[pybsddb] Batch import is slowing down

Amirouche Boubekki amirouche at hypermove.net
Thu Jun 25 13:33:09 CEST 2015


On 2015-06-25 01:44, Jesus Cea wrote:
> Disclosure: This is a mailing list about python bindings for Berkeley
> DB, not about Berkeley DB performance. Moreover, I do professional
> consulting work for this kind of performance "anomalies" :). Both
> Berkeley DB and Python itself.
> 
> That said, let's try some analysis.

Ok, thanks. I'm just a free software and graphdb afficionados and I'm 
not making any money on this. They are a few downloads. The part that is 
difficult in the library is bsddb; So I'll probably forward people to 
you if they have any issues.

> 
> I don't see anything suspicious in your "db_stats -m" dumps. The pages
> touched grows linearly (kind of) with the number of nodes added, as
> expected.
> 
> Before keeping digging in the Berkeley DB, lets discard that you are
> being hitting a "classical" garbage collection anomaly in Python. Tell
> me what do you do with the node objects in python when you finish
> loading it.


- Do you keep the data in RAM?.

No. I don't do something like `vertices.append(vertex)`.

> Do your structures contains circular references?

They are some. Should I add some weakref?

> Could you add this at the beginning of your python code?
> 
>   import gc
>   gc.disable()

It helps a little bit, between 10% and 20%.

In the following I also removed checkpoint from the timing:

10000 0:00:03.510748
20000 0:00:03.661955
30000 0:00:04.656853
40000 0:00:03.776255
50000 0:00:05.608382
60000 0:00:04.555878
70000 0:00:04.391641
80000 0:00:04.373011
90000 0:00:04.678581
100000 0:00:03.512782
110000 0:00:07.951906
120000 0:00:03.498123
130000 0:00:04.125034
140000 0:00:06.041967
150000 0:00:09.118047
160000 0:00:03.803080
170000 0:00:04.085018
180000 0:00:04.433703
190000 0:00:05.683068
200000 0:00:04.487975
210000 0:00:12.519483
220000 0:00:04.559587
230000 0:00:04.671076
240000 0:00:04.599431
250000 0:00:03.500114
260000 0:00:05.962410
270000 0:00:10.361552
280000 0:00:27.594989
290000 0:00:16.008858
300000 0:00:19.819034
310000 0:00:40.575722


More information about the pybsddb mailing list