AvocadoDB uses AppendOnly memory-mapped files with frequent fsync. Derived data (indices, etc.) is stored in the main memory only. This article explains why that particular combination leads to high performance and consistent data at the same time―even in case of a system failure.
Classical database systems – a bulk of data and insufficient main memory
Put simply, there are three possible settings regarding databases:
- Setting 1: All data fits into the main memory.
- Setting 2: The complete data pool does not fit into the main memory all at once, but the main memory is large enough to store all the data accessed in an average time span.
- Setting 3: Even the sub-set of data accessed in an average time span is too large for the main memory.
Classical database systems had to cope with setting 3 because main memory was too expensive to store the majority of data.
Basically, classical database systems had to manage the main memory themselves. To manage all data sets that exceeded the capacity of the main memory they needed sufficiently intelligent algorithms which the system software couldn’t provide (i.e., to stream the data through main memory for full table scans).
To the extent to which the majority of data fits into the main memory (setting 2) you can apply simpler algorithms. One possible simple algorithm is to record for each block (memory page on the server) the last time it was used. If you have to load another block from the hard disk (because you need to access it), you delete the least recently used block from the main memory.
The operating system can cope with that kind of simple algorithm. For a couple of reasons it’s better at implementing this algorithm:
- Firstly, the operating system is able to access the hardware (memory-mapping unit) to an application has no access.
- Secondly, the operating system has access to information that is not available to an application (i.e., the operating systems knows which file system is a hard disk and which is a SSD).
- Thirdly, the operating system is able to account for other applications that run in parallel.
- Fourthly, the operating system can be optimized for different architectures (e.g., different processors – even of the same brand – have differently large caches and differently fast access to main memory).
Besides, a system which uses the operating system’s virtual memory admin is optimal even in case you don’t need virtual memory. Even if you restrict yourself to working solely with setting 1, you couldn’t write a more efficient system.
Classical database systems aren’t optimal for setting 1 because the management of memory by the application is always associated with expenses―regardless whether the data fits or not.
Given the price trend of main memory and the resulting amount of memory available in most servers, settings 1 and 2 become more important. Here, the new systems are superior to the old ones.
Why we use memory-mapped files and access them append-only
Now, we’re explaining the reason why we use memory-mapped files and access them append-only.
Actually, virtual memory and memory-mapped files are identical in almost all aspects. The Linux kernel treats both of them almost identically. It’s the same method that decides if a memory page is to be swapped out, and which one, respectively. The only difference is that the virtual memory page is written to the swap space and the memory-mapped file page is written to the file system.
The main difference between memory-mapped files and swap space is the fact that data stored in the swap space is lost as soon as the process is terminated. Memory-mapped files are preserved even if the program (i.e., the DB process) is terminated. Memory-mapped files are ready for re-use as soon as you start the program.
We’re using memory-mapped files to store raw data that will be preserved after the termination of an AvocadoDB process (or a system failure).
However, basically it’s the operating system that decides when to copy data from the main memory to the file system. Because the file system is definitely slower than the main memory, the system tries to minimize writing transactions. That’s not what we have in mind. We want to get the data written to the file system as soon as possible to minimize the possibility of data loss. To that end we use the operating system call fsync.
Using fsync you instruct the operating system to write all of the data stored in a file (even a memory-mapped file) on the disk. Actually there is the additional step to force the disk to copy the data from the disk’s own cache to the persistent storage. More on this topic some other time…
We write append-only
- to conduct writing operations with utmost efficiency,
- to be able – after a system crash and restart – to determine exactly what was written before the crash.
Append-only means to just extend a file with new data instead of overwriting the data itself. This approach permits us to just search for a file’s valid start-section after a failure or reboot.
If you’d overwrite existing data, you’d have to verify each block’s validity.
The fiend is in the details. Fsync does not assure a particular order in which the blocks are written on the hard disk. Your computer could (and will according to Murphy’s Law) crash down exactly during fsync. Assume that we write the blocks B1, B2, … B10 in that chronological order; assume further that these blocks are written by fsync in the following order B1, B3, B2, B6, B4, B10, B5, and furthermore assume that the system crash occurred exactly before blocks B7, B8, and B9 could be written. Our method would enable us to detect that B1 to B6 were written and that this particular start-section is still valid. At the same time this method is able to ignore B10 (because it detects that B7 was never written, deciding to ignore all following blocks).
A system that overwrites data blocks not only would have to verify all blocks, but wouldn’t be able to ignore B10. It wouldn’t have a (simple) contingency to detect that chronologically antecedent data blocks weren’t written―they could anywhere. The problem with that is that information in B10 could refer to information in B8 (directly or indirectly, even just logically). This could lead to an inconsistent status at the time of rebooting where the database uses B10 but can’t use B8.
We’d like to conclude with the reason why we store only raw data: First of all, you can’t store the supporting data (i.e., indices) as AppendOnly without difficulties. Secondly, you have to keep a watchful eye on data-set consistency. Verifying the consistency of data at the time of rebooting is presumably as expensive as generating the indices from scratch.
Basically, that’s the reason why we store raw data as AppendOnly memory-mapped files with frequent fsync. At the same time, we are storing the derived data (indices, etc.) only in the main memory (where the operating system could transfer it to the swap space, of course).