[pybsddb] Problem scanning large hashed database

Jesus Cea jcea at argo.es
Thu Dec 4 21:47:33 CET 2008


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

andrew wrote:
> Hi All,
> 
> I have a large-ish production database (1.6M objects) that usually gets
> accessed only for reads on writes on specific keys, for which it is nice
> and fast. However, I need to be able to do some adhoc queries on this
> data occasionally and I seem to be running into some problems doing
> that.
> 
> My approach is simply to fetch all keys and then iterate through the
> keys, fetching the data for each key and then writing some of it to an
> output file. I want to do this while the database is still "up", i.e.,
> while a server process that allows access to the database via Pyro is
> still running. Unfortunately, when I run this scanning script (which
> takes minutes just to fetch all the keys) it seems to kill the Pyro
> process with:
> 
> bsddb.db.DBRunRecoveryError: (-30978, 'DB_RUNRECOVERY: Fatal error, run
> database recovery -- PANIC: fatal region error detected; run recovery',
> 'This error occured remotely (Pyro). Remote traceback is available.')
> 
> My questions are:
> 
> 1. Is there a better way to do this ?
> 2. Could it be some sort of locking problem ?
> 3. Has anyone else done anything like this successfully ?

We don't know if you are using transactions, for instance.

A common problem is forgetting that each page visited inside a
transaction keeps a read-lock on it until the transaction completes
(commit or abort). So you usually can NOT cursor over 1.6 million of
registers, because a) your lock table is going to overflow, and b) you
are locking out any other write transaction.

Without knowing about your code and requirements, consider to split the
cursor in several small requests. For instance, 100-1000 registers. Then
close the cursor/transaction, open a new one, move to your last read
register and read from there, until you complete.

This approach should work, but remember you are going to miss some
registers/get some duplicate data is the database is actively written.
And you lose Atomicity in the cursor (that is, you can get registers
changed in different transactions, while scanning).

Anyway, fully scanning the database seems a bad thing to do. You don't
need BDB for that. Are you sure there is no other way?.

PS: I think there is a flag to configure a cursor as not locking pages
behind. See:

<http://www.oracle.com/technology/documentation/berkeley-db/db/ref/transapp/read.html>

If pybsddb doesn't support this functionality yet (I am not sure about
that), just ask for it.

Berkeley DB is very powerful, but complex. Consider reading the full
documentation in
<http://www.oracle.com/technology/documentation/berkeley-db/db/ref/toc.html>.
It is very well written and highly enlightening. Recommended!.

- --
Jesus Cea Avion                         _/_/      _/_/_/        _/_/_/
jcea at jcea.es - http://www.jcea.es/     _/_/    _/_/  _/_/    _/_/  _/_/
jabber / xmpp:jcea at jabber.org         _/_/    _/_/          _/_/_/_/_/
.                              _/_/  _/_/    _/_/          _/_/  _/_/
"Things are not so easy"      _/_/  _/_/    _/_/  _/_/    _/_/  _/_/
"My name is Dump, Core Dump"   _/_/_/        _/_/_/      _/_/  _/_/
"El amor es poner tu felicidad en la felicidad de otro" - Leibniz
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iQCVAwUBSThB35lgi5GaxT1NAQKYYQP+KXyWTkkaxYHunER/HwLTN7Me1zLN8ue4
baFxvDZpxRRNdC43nlfSsZClPgSn2dh3qqpltZMHxnF9WO/G3mjTAjPOnAOahLNz
4/3eu/dbwPD6CwCUGMhx0Ui8XWHDFqNDjDwc7AzU0zlA+kfFfYFKU/iUofZid/UI
nJc5arXIjT4=
=Rgyw
-----END PGP SIGNATURE-----



More information about the pybsddb mailing list