[pybsddb] Bulk Data load into a DB

Tue Mar 30 03:04:29 CEST 2010

On Mar 29, 2010, at 8:03 PM, Jesus Cea wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> On 03/29/2010 10:39 PM, Jon Kerr Nilsen wrote:
>> Forgot to say that our application interface has bulk support. I just
>> meant that we have to do bulk transfers all the time so the
>> application code would be better if we could just stuff in key-value
>> lists into bdb in one go.
> 
> So now your application opens a transaction, cycles over the key/value
> received, does a "put" for each, and does the final "commit". I guess.
> 
> I will support bulk interface, but I would like somebody could measure
> performance improvement. Any voluntary?. Would need to code it in C...
> 

Having just converted RPM from a DB_INIT_CDB -> DB_INIT_TXN model
I have relevant experience wrto bulk interfaces and more.

RPM queries into Berkeley DB are significantly (>10x) faster in
spite of increased locking overhead (measured by callgrind) using
a DB_INIT_TXN model.

A large part of the speed-up is due to using secondary indices,
and a DB->associate callback, with bulk updates, to add index
entries.

All subjective, sorry; designing a benchmark to capture just the
additional speed increase from doing bulk rather than individual
updates is likely too highly dependent on the amount of data
moving in the benchmark, and (imho) would likely not be generally
relevant, since there's too many convolved issues with Berkeley
DB performance.

hth

73 de Jeff