[pybsddb] Batch import is slowing down
Amirouche Boubekki
amirouche at hypermove.net
Sat Jun 27 18:30:44 CEST 2015
On 2015-06-26 18:27, Jesus Cea wrote:
> On 25/06/15 13:33, Amirouche Boubekki wrote:
>
>>> Do your structures contains circular references?
>>
>> They are some. Should I add some weakref?
>
> That is an option. Another option is to break cycles by hand before
> freeing the objects.
Thanks for the reminder :)
>
>>> Could you add this at the beginning of your python code?
>>>
>>> import gc
>>> gc.disable()
>>
>> It helps a little bit, between 10% and 20%.
>
> Uhmmm.
>
> Try this now: load 200.000 objects, shutdown the program and then run
> it
> again a load another 200.000 objects on the database already populated.
> Time both steps.
As you suggest, I have done a few experiments.
The loop without database calls is negigeable: 1 second.
I've done timings using timeit module.
This are timing for each successive loop done with txn checkpoints and
200.000 entries:
> with sync with gc
>> first: 314 seconds
>> second: 574
>> third: 791
> without sync with gc
>> first: 175 seconds
>> second: 368
>> third: 603
> without sync without gc
>> first: 172 seconds
>> second: 359
>> third: 581
For the following test I used a non-acid, txn-less environnement:
> txn less environment without gc
>> first: 72 seconds
>> second: 87
>> third: 101
Configuration of the ACID environment:
env = DBEnv()
env.set_cache_max(4, 0)
env.set_cachesize(3, 0)
flags = (
DB_CREATE
| DB_INIT_LOG
| DB_INIT_TXN
| DB_INIT_MPOOL
| DB_INIT_LOCK
)
env.log_set_config(DB_LOG_AUTO_REMOVE, True)
env.set_lg_max(1024 ** 3)
Primary tables use hash backend. In the worst case an entry will do:
- 2 put in vertices hash
- one put in edges hash table.
- 3 (small) put are done in two different btree.
More information about the pybsddb
mailing list