[pybsddb] Batch import is slowing down

Sat Jun 27 18:30:44 CEST 2015

On 2015-06-26 18:27, Jesus Cea wrote:
> On 25/06/15 13:33, Amirouche Boubekki wrote:
> 
>>> Do your structures contains circular references?
>> 
>> They are some. Should I add some weakref?
> 
> That is an option. Another option is to break cycles by hand before
> freeing the objects.

Thanks for the reminder :)

> 
>>> Could you add this at the beginning of your python code?
>>> 
>>>   import gc
>>>   gc.disable()
>> 
>> It helps a little bit, between 10% and 20%.
> 
> Uhmmm.
> 
> Try this now: load 200.000 objects, shutdown the program and then run 
> it
> again a load another 200.000 objects on the database already populated.
> Time both steps.

As you suggest, I have done a few experiments.

The loop without database calls is negigeable: 1 second.

I've done timings using timeit module.

This are timing for each successive loop done with txn checkpoints and 
200.000 entries:

> with sync with gc
>> first: 314 seconds
>> second: 574
>> third: 791

> without sync with gc
>> first: 175 seconds
>> second: 368
>> third: 603

> without sync without gc
>> first: 172 seconds
>> second: 359
>> third: 581

For the following test I used a non-acid, txn-less environnement:

> txn less environment without gc
>> first: 72 seconds
>> second: 87
>> third: 101

Configuration of the ACID environment:

         env = DBEnv()
         env.set_cache_max(4, 0)
         env.set_cachesize(3, 0)
         flags = (
             DB_CREATE
             | DB_INIT_LOG
             | DB_INIT_TXN
             | DB_INIT_MPOOL
             | DB_INIT_LOCK
         )
         env.log_set_config(DB_LOG_AUTO_REMOVE, True)
         env.set_lg_max(1024 ** 3)

Primary tables use hash backend. In the worst case an entry will do:

- 2 put in vertices hash
- one put in edges hash table.
- 3 (small) put are done in two different btree.