[pybsddb] How to manage logs
Amirouche Boubekki
amirouche at hypermove.net
Fri Jun 19 10:54:41 CEST 2015
A last update regarding this topic. To sum up what I am doing, so that
other people can have a picture more quickly:
- building a graph database using pybsddb3 the Python drivers for Oracle
Berkeley Database (you already know that)
- There is 2 HASH databases (vertex and edge values) and 2 BTREE
databases databases (indexes of vertex and edge label)
- Create an vertex will put one value in the BTREE and one in a HASH
database
- Create an edge will put one value in the BTREE and one in the edge
HASH
- the BTREE databases are only used to query elements (vertex, edge) by
label it requires one cursor per index, per query.
- retrieving an edge by label require 3 get (or one if you only need the
edge properties) + the index lookup (one cursor)
- retrieving a vertex require one get + the index lookup (one cursor)
I loading the conceptnet dataset, 8G packed with msgpack, there is 13
280 680 edges in the dataset.
Loading the data with transaction is *very* slow, I computed that it
would take several hundred hours. So I removed transactions support:
> flags = (
> DB_CREATE
> # | DB_INIT_LOG
> # | DB_INIT_TXN
> | DB_INIT_MPOOL
> # | DB_INIT_LOCK
> )
> # self._env.log_set_config(
> # DB_LOG_AUTO_REMOVE
< # # | DB_LOG_NOSYNC
> # , True
> # )
> # self._env.set_lg_max(1024 ** 3)
A few facts:
- in a previous post I was trying to set of DB_LOG_AUTO_REMOVVE with
DBEnv.set_flags, which doesn't work.
- DB_LOG_NOSYNC is in Oracle Documentation but is not in pybsddb
- The environment is setup with 1G/4G of cache size
After 9 hours, I am at 40% of processed edges and the database size 7G.
Best regards,
Amirouche
On 2015-06-18 16:15, Amirouche Boubekki wrote:
> On 2015-06-18 16:02, Lauren Foutz wrote:
>> On 6/18/2015 9:48 AM, Amirouche Boubekki wrote:
>>> On 2015-06-18 14:26, Lauren Foutz wrote:
>>>> If the environment and database are transaction enabled, then every
>>>> operation will use transactions, regardless of whether you create
>>>> one
>>>> explicitly. BDB will create a transaction internally and commit it
>>>> when the operation finishes, or abort it on an error.
>>>
>>> It will create one transaction per operation (get, put, delete).
>>
>> Yes.
>>
>>> Does it provide any speed over using transaction explicitly?
>>
>> No, it tends to be slower since each commit requires that the logs be
>> flushed to disk. It is better to use an explicit transaction, and use
>> the same transaction over multiple put/get/delete operations.
>
> So:
>
> - I should use larger transactions instead.
> - or never use transaction for a given database
>
>>>
>>> Is a database created *without* transaction compatible with opening
>>> it later *with* transactions?
>>
>> Short answer, no, a data base needs to either always support
>> transactions, or never support transactions. Long answer, if you use
>> the function DB_ENV->lsn_reset() to reset the log number in each of
>> the database files, and then delete the environment files (those that
>> start with __db), you may be able to re-open the databases in a new
>> transactionally enabled environment. But I am not certain that will
>> work. The safe bet is to either have the database always support
>> transactions, or never support transactions.
>>
>>>
>>>>
>>>> As for how to reduce the number of logs. Using DB_LOG_AUTO_REMOVE
>>>> is
>>>> a good start, but it will not remove logs until you run a
>>>> checkpoint.
>>>> So I recommend you execute a checkpoint at regular intervals while
>>>> loading data into the databases.
>>>
>>> Ok! that's what I was missing.
>
> I made the changes, it looks better for now.
>
>>>>
>>>> Also, you should removing the comment getting rid of DB_INIT_LOG in
>>>> flags,
>>>
>>> ok
>>>
>>>> and also add the flag DB_INIT_LOCK.
>>>
>>> I don't need locks it's single threaded, no?
>>
>> Transactions assumes locks are used, regardless if whether the program
>> is single threaded or not. Using transactions without locks can lead
>> to undefined behavior such as a program crash due to accessing
>> uninitialized memory.
>
> Ok thanks for your quick responses.
>
>>
>> Lauren Foutz
>>
>>>
>>> Best regards,
>>>
>>>>
>>>> Lauren Foutz
>>>>
>>>> On 6/18/2015 5:58 AM, Amirouche Boubekki wrote:
>>>>> Héllo,
>>>>>
>>>>>
>>>>> I'm loading a dataset (conceptnet5) into Ajgu Db [1] backed by
>>>>> pybsddb3 '6.0.1' and Berkeley DB 5.3.21.
>>>>>
>>>>> The problem I have is that even when I'm not using transactions
>>>>> (passing txn=None) my database fills the disk with log files. There
>>>>> is 2.3 Go of database files (including __db.* files) out of 429 Go
>>>>> total disk space used by the database directory (du -h .).
>>>>>
>>>>> How can I remove those log files during the import of the database.
>>>>> Right now the script can't even finish the loading of the first
>>>>> file of the dataset.
>>>>>
>>>>> My db environment is configured as follow
>>>>>
>>>>> ```
>>>>> # init bsddb3
>>>>> self._env = DBEnv()
>>>>> self._env.set_cache_max(*max_cache_size)
>>>>> self._env.set_cachesize(*cache_size)
>>>>> flags = (
>>>>> DB_CREATE
>>>>> # | DB_INIT_LOG
>>>>> | DB_INIT_TXN
>>>>> | DB_INIT_MPOOL
>>>>> )
>>>>> self._env.set_flags(DB_LOG_AUTO_REMOVE, True)
>>>>> self._env.open(
>>>>> str(self._path),
>>>>> flags,
>>>>> 0
>>>>> )
>>>>> ```
>>>>> https://git.framasoft.org/python-graphiti-love-story/AjguGraphDB/blob/f8bf004ee132ac21fcbbb1c925889a16f1d5388d/ajgu/storage.py#L62
>>>>> Every single store is created with the following function
>>>>>
>>>>> ```
>>>>> # create vertices and edges k/v stores
>>>>> def new_store(name, method):
>>>>> txn = self._txn()
>>>>> flags = DB_CREATE
>>>>> elements = DB(self._env)
>>>>> elements.open(
>>>>> name,
>>>>> None,
>>>>> method,
>>>>> flags,
>>>>> 0,
>>>>> txn=txn._txn
>>>>> )
>>>>> txn.commit()
>>>>> return elements
>>>>> ```
More information about the pybsddb
mailing list