[pybsddb] How to manage logs

Fri Jun 19 10:54:41 CEST 2015

A last update regarding this topic. To sum up what I am doing, so that 
other people can have a picture more quickly:

- building a graph database using pybsddb3 the Python drivers for Oracle 
Berkeley Database (you already know that)
- There is 2 HASH databases (vertex and edge values) and 2 BTREE 
databases databases (indexes of vertex and edge label)
- Create an vertex will put one value in the BTREE and one in a HASH 
database
- Create an edge will put one value in the BTREE  and one in the edge 
HASH
- the BTREE databases are only used to query elements (vertex, edge) by 
label it requires one cursor per index, per query.
- retrieving an edge by label require 3 get (or one if you only need the 
edge properties) + the index lookup (one cursor)
- retrieving a vertex require one get + the index lookup (one cursor)

I loading the conceptnet dataset, 8G packed with msgpack, there is 13 
280 680 edges in the dataset.

Loading the data with transaction is *very* slow, I computed that it 
would take several hundred hours. So I removed transactions support:

>        flags = (
>            DB_CREATE
>            # | DB_INIT_LOG
>            # | DB_INIT_TXN
>            | DB_INIT_MPOOL
>            # | DB_INIT_LOCK
>        )
>        # self._env.log_set_config(
>        #     DB_LOG_AUTO_REMOVE
<        #     # | DB_LOG_NOSYNC
>        #     , True
>        # )
>        # self._env.set_lg_max(1024 ** 3)

A few facts:

- in a previous post I was trying to set of DB_LOG_AUTO_REMOVVE with 
DBEnv.set_flags, which doesn't work.

- DB_LOG_NOSYNC is in Oracle Documentation but is not in pybsddb

- The environment is setup with 1G/4G of cache size

After 9 hours, I am at 40% of processed edges and the database size 7G.

Best regards,

Amirouche

On 2015-06-18 16:15, Amirouche Boubekki wrote:
> On 2015-06-18 16:02, Lauren Foutz wrote:
>> On 6/18/2015 9:48 AM, Amirouche Boubekki wrote:
>>> On 2015-06-18 14:26, Lauren Foutz wrote:
>>>> If the environment and database are transaction enabled, then every
>>>> operation will use transactions, regardless of whether you create 
>>>> one
>>>> explicitly.  BDB will create a transaction internally and commit it
>>>> when the operation finishes, or abort it on an error.
>>> 
>>> It will create one transaction per operation (get, put, delete).
>> 
>> Yes.
>> 
>>> Does it provide any speed over using transaction explicitly?
>> 
>> No, it tends to be slower since each commit requires that the logs be
>> flushed to disk.  It is better to use an explicit transaction, and use
>> the same transaction over multiple put/get/delete operations.
> 
> So:
> 
> - I should use larger transactions instead.
> - or never use transaction for a given database
> 
>>> 
>>> Is a database created *without* transaction compatible with opening 
>>> it later *with* transactions?
>> 
>> Short answer, no, a data base needs to either always support
>> transactions, or never support transactions.  Long answer, if you use
>> the function DB_ENV->lsn_reset() to reset the log number in each of
>> the database files, and then  delete the environment files (those that
>> start with __db), you may be able to re-open the databases in a new
>> transactionally enabled environment.  But I am not certain that will
>> work.  The safe bet is to either have the database always support
>> transactions, or never support transactions.
>> 
>>> 
>>>> 
>>>> As for how to reduce the number of logs.  Using DB_LOG_AUTO_REMOVE 
>>>> is
>>>> a good start, but it will not remove logs until you run a 
>>>> checkpoint.
>>>> So I recommend you execute a checkpoint at regular intervals while
>>>> loading data into the databases.
>>> 
>>> Ok! that's what I was missing.
> 
> I made the changes, it looks better for now.
> 
>>>> 
>>>> Also, you should removing the comment getting rid of DB_INIT_LOG in
>>>> flags,
>>> 
>>> ok
>>> 
>>>> and also add the flag DB_INIT_LOCK.
>>> 
>>> I don't need locks it's single threaded, no?
>> 
>> Transactions assumes locks are used, regardless if whether the program
>> is single threaded or not.  Using transactions without locks can lead
>> to undefined behavior such as a program crash due to accessing
>> uninitialized memory.
> 
> Ok thanks for your quick responses.
> 
>> 
>> Lauren Foutz
>> 
>>> 
>>> Best regards,
>>> 
>>>> 
>>>> Lauren Foutz
>>>> 
>>>> On 6/18/2015 5:58 AM, Amirouche Boubekki wrote:
>>>>> Héllo,
>>>>> 
>>>>> 
>>>>> I'm loading a dataset (conceptnet5) into Ajgu Db [1] backed by 
>>>>> pybsddb3 '6.0.1' and Berkeley DB 5.3.21.
>>>>> 
>>>>> The problem I have is that even when I'm not using transactions 
>>>>> (passing txn=None) my database fills the disk with log files. There 
>>>>> is 2.3 Go of database files (including __db.* files) out of 429 Go 
>>>>> total disk space used by the database directory (du -h .).
>>>>> 
>>>>> How can I remove those log files during the import of the database. 
>>>>> Right now the script can't even finish the loading of the first 
>>>>> file of the dataset.
>>>>> 
>>>>> My db environment is configured as follow
>>>>> 
>>>>> ```
>>>>>         # init bsddb3
>>>>>         self._env = DBEnv()
>>>>>         self._env.set_cache_max(*max_cache_size)
>>>>>         self._env.set_cachesize(*cache_size)
>>>>>         flags = (
>>>>>             DB_CREATE
>>>>>             # | DB_INIT_LOG
>>>>>             | DB_INIT_TXN
>>>>>             | DB_INIT_MPOOL
>>>>>         )
>>>>>         self._env.set_flags(DB_LOG_AUTO_REMOVE, True)
>>>>>         self._env.open(
>>>>>             str(self._path),
>>>>>             flags,
>>>>>             0
>>>>>         )
>>>>> ```

>>>>> https://git.framasoft.org/python-graphiti-love-story/AjguGraphDB/blob/f8bf004ee132ac21fcbbb1c925889a16f1d5388d/ajgu/storage.py#L62 
>>>>> Every single store is created with the following function
>>>>> 
>>>>> ```
>>>>>         # create vertices and edges k/v stores
>>>>>         def new_store(name, method):
>>>>>             txn = self._txn()
>>>>>             flags = DB_CREATE
>>>>>             elements = DB(self._env)
>>>>>             elements.open(
>>>>>                 name,
>>>>>                 None,
>>>>>                 method,
>>>>>                 flags,
>>>>>                 0,
>>>>>                 txn=txn._txn
>>>>>             )
>>>>>             txn.commit()
>>>>>             return elements
>>>>> ```