The LMDB file format

published 2016-04-03 [ home ]

At Lima, we use the LMDB database extensively, and we looked into its source code in order to debug a few issues we had. To help us diagnose broken databases, and also because I wanted to learn more about it, I wrote a parser for the LMDB database file format in Lua in my spare time, which I published under the Open Source MIT license.

In this post, I will give an overview of how a data.mdb file is structured. Note that I will sometimes skip over some things and operate under the assumption that you are running on a little-endian platform with 4 kB pages, etc.

Portability

The first thing to know is that the data.mdb format is platform-specific, which means that you cannot necessarily open a database created on a machine on another one. In practice, on ARM and x86, the only thing you have to care about is whether the machine is 32 or 64 bits. On a 64 bit machine, you can read the data from a 32 bit database by exporting it to a text format using the mdb_dump utility compiled in 32 bits and reimporting it with a 64 bits mdb_load. With lua-mdb, you can read a 32 bits database directly by passing the bits option to the constructor.

Pages

A data.mdb file is made to be mapped into memory, and as such is organized into a set pages of 4 kB. All pages start with a header which contains the page number and some flags which determine the type of the page. For all page types except overflow pages, the header also includes the bounds of the free space in the page.

Meta pages

The first two pages (page numbers 0 and 1) are called meta pages. Meta pages start with a magic number (0xBEEFC0DE), a version (currently 1), and the address and size of the memory mapping. The address is only valid if the mapping is fixed, i.e. if LMDB is used with the option MDB_FIXEDMAP.

After that, we find two structures describing what LMDB calls databases. The first one is the “free” database, which is used to track free pages, and the second one is the “main” database.

Finally come the number of the last page used in the file and the ID of the last transaction that wrote to the page successfully. Indeed, those meta pages is the mechanism LMDB uses to implement MVCC. There can be a single writer at a time, and it will always write to page ID N % 2, where N is the current write transaction ID. Meanwhile, readers will read page N % 2 + 1, providing readers - writer isolation. This means that, when opening a .mdb file, the latest data can be found in the databases pointed to by the meta page with the highest transaction ID.

Databases

As we saw, a meta page contains two structures describing databases. Databases in LMDB are technically B+ trees. The structures contain some environment flags used at runtime, information about the tree including its depth, the number of pages of each type and the total number of entries, and finally the page number of the root of the tree.

Pages of the tree itself can be one of three types: branch pages, leaf pages and overflow pages. There are two other types (BRANCH2 and SUBP) which are only used with specific DB options (MDB_DUPFIXED and MDB_DUPSORT) and that we will ignore.

Branch pages

Branch pages represent internal nodes of the tree. After the header, they contain an array of “pointers” (technically, 16 bit indices within the page) to “branch nodes” within that page. The number of branch nodes, which is the number of keys in the page, can be deduced from the boundaries of the free space in the header.

Branch nodes are allocated from the bottom of the page up. They start with a page number, then a key prefixed by its size. Branch node pointers are sorted according to the key and the current key comparison function, and the page number indicates the page which should be visited to look up keys between the key of the node (included) and the key of the next node. That page can be a branch or a leaf.

Leaf pages

Like branch pages, leaf pages start with a number of pointers to leaf nodes. Leaves store the actual keys and values contained into the DB. Small values are stored in the leaf node itself and large values are stored in overflow pages.

Leaf nodes are slightly more complex than branch nodes. They contain flags, which determine if the value is contained in the node or in an overflow page, and the key which corresponds to the node. For small values, they contain the value itself, and for larger values they contain the page number of the corresponding overflow page.

Overflow pages

Large values are stored in so-called overflow pages. Those pages are pretty simple: after the header, they contain the raw value (not prefixed by its size). This value can be larger than the size of a page, which explains the name: they overflow on later pages. Note that this means those later pages do not start by the usual header.

Learning more

I have explained the basics of the data.mdb file format. Should you have to debug broken databases, it will help you get started. If you need more, read the lua-mdb source code or the LMDB source code (it is only about 10000 lines, although sometimes not easy to understand given that its author adopts a style that favors performance to readability).