[pybsddb] Data leak in latest bsddb3/berkeleydb packages
jacobhenner at outlook.com
jacobhenner at outlook.com
Wed May 7 04:00:59 CEST 2025
Apologies for the imprecise description earlier.
> It would be dangerous if the stale data you see were not previously
present in the database. Leaking application unrelated data would be
actually quite ugly.
Yes, this is what I'm reporting, a data leak (and not data corruption).
In the example I shared earlier, I am manipulating a sendmail access.db
database. The modifications are limited to the two loops seen in the
example, plus two more additions similar to the ones in the loops (but
with a literal key and value). Nothing else is deliberately
manipulating these files. At the time that the contents are first
written (pre-modification), the contents have been confirmed to not
contain leaked data. After the DB operations complete, the output
usually contains leaked data. The specific data that is leaked varies
between runs (for the same database files).
This unrelated leaked data is likely relevant to the program, but it's
not relevant to section of code I've shared. As far as I can tell,
there is no opportunity for the leaked data to be passed to the
berkeleydb (or bsddb3) libraries directly, since:
* The leaked data is not accessed in the module where the code I sent
earlier runs.
* There are no parameters, contextvars, object attributes, etc,
accessible from the code I shared that would contain the data that is
leaked.
* Aside from the two `db[...] = b"..."` (with literals in place of ...)
that I mentioned earlier, all other operations that involve the DB are
present in my example. Since the keys/values only consist of literals
and the string pattern visible in the example, there's no interaction
with the library that would directly introduce the leaked data.
The leaked data is sufficiently distinct from the ordinary contents of
a sendmail access.db file to be noticed immediately. I can confirm that
the data that is being leaked was never part of the database, at any
point. When viewing the raw file, imagine seeing a block (or substring)
of pretty-printed JSON, a partial ini file, or HTML. All of these cases
have been observed, and the leaked data comes from very different parts
of the codebase than the part that manipulates these databases. The
berkeleydb library is not used anywhere in the codebase beyond in the
section that corresponds to the example I shared. The file created by
`with tempfile.NamedTemporaryFile() as temp_db: ...` is never supplied
to any other parts of the code, and it closes (and gets deleted)
automatically when the context manager exits, so I don't see an
opportunity for this to be an application-level file operations issue.
I have not confirmed that the leak is specific to the berkeleydb/bsddb3
packages, in the sense that perhaps it could be an issue with Python
3.12.7 itself. I can try again with other versions of Python to see if
there is any change in behavior. However, I did do some searches for
similar issues affecting Python 3.12, as well as reviewed the release
notes of 3.12.8 and 3.12.9, and I did not see anything that suggested a
bug in 3.12.7. I imagine that if there were such a dangerous and
noticeable bug in Python itself it would have been detected by now, but
of course there's always the possibility that some edge case is being
encountered.
Regards,
Jacob Henner
On Wed, 2025-05-07 at 00:51 +0200, Jesus Cea wrote:
> On 6/5/25 23:00, Jacob Henner wrote:
> > I believe I've encountered a data corruption bug in the latest
> > bsddb3
> > and berkeleydb packages. It's possible that it also affects other
> > versions, but I've only tested with the latest.
>
> Please, for your test code use berkeleydb. bsddb3 is legacy and
> unsupported by now.
>
> That said, your code doesn't check anything at all. What am I suppose
> to
> see there?. What data corruption are you seeing?
>
> Your description is a bit confusing and your code doesn't check
> anything
> at all, but reading between lines I kind of understand that after you
> write data to a database you check the file binary (via "raw" tools,
> not
> the DB interface) and you find content unrelated to the data your
> stored. Is that is the case, it is not "data corruption" but "data
> leakage".
>
> Please, confirm that is the case in order for you and me to be in the
> same page.
>
> Have you tried to reproduce this using the Berkeley DB C interface?
>
> Berkeley DB C library and my python bindings in fact reuse buffers
> and I
> find no strange that when writing a partial page the not overwritten
> portion keep the old content. That stale data is present in the file,
> but not accesible via DB calls. This is not different that "deleting"
> a
> file and still seeing its content if you examine the raw blocks on
> disk.
> That is how "undelete" worked in the old days.
>
> > [This code][1] writes a BerkeleyDB database to a temporary file,
> > and
> > then manipulates that file using bsddb3/berkeleydb. When examining
> > the
> > output, I've noticed that it includes unexpected data. The
> > unexpected
> > data is recognizable as data that the program reads in other
> > sections,
> > but it's data that is not accessible within the scope of the code
> > manipulating the database file. This leads me to believe there
> > might be
> > a buffer/memory management issue within the native extension.
>
> Just buffer reuse. Business as usual.
>
> If your database, for example, contains a million registers and you
> delete them, you see no data using the DB interface but you will see
> all
> that old content in the raw database file. That is the way things
> work [*].
>
> [*] A zero filling for free space would be an option, and maybe it
> could
> be a configuration option, but if the problem is that an attacker can
> see "stale" data examining the raw file, she could have done the same
> using the regular DB interface while that data was live and
> available.
>
> It would be dangerous if the stale data you see were not previously
> present in the database. Leaking application unrelated data would be
> actually quite ugly.
>
> So, my question would be:
>
> Is the "stale" data you are seeing data previously present in the
> database or data leaked from the application, completely unrelated to
> the DB?. This is the critical distinction we must determine.
>
> In the first case, that is expected. In the second case, we have a
> problem.
>
> Please, be precise with your words: data corruption means that you
> store
> some data and read something different. That would be bad. Data leak
> means that you are seeing data in the binary file, not using the DB
> interface, that is not suppose to be there. That can be bad (if
> leaking
> data unrelated to the DB) or harmless (leftovers from deletions/page
> splitting/whatever).
>
> Try this with any other database, for instance, sqlite. You will be
> surprised :-)
>
> If you are actually seeing leaked non DB application memory, that
> would
> be serious. Is that the case?
>
> If you DON'T see application private memory leaks, but you still
> thinks
> this is a problem, feel free to insist in this mailing list.
>
> Thanks for reaching out.
>
> --
> Jesús Cea Avión _/_/ _/_/_/
> _/_/_/
> jcea at jcea.es - https://www.jcea.es/ _/_/ _/_/ _/_/ _/_/
> _/_/
> Twitter: @jcea _/_/ _/_/
> _/_/_/_/_/
> jabber / xmpp:jcea at jabber.org _/_/ _/_/ _/_/ _/_/ _/_/
> "Things are not so easy" _/_/ _/_/ _/_/ _/_/ _/_/ _/_/
> "My name is Dump, Core Dump" _/_/_/ _/_/_/ _/_/ _/_/
> "El amor es poner tu felicidad en la felicidad de otro" - Leibniz
More information about the pybsddb
mailing list