In this post we’ll explain how ArangoDB stores collection data on disk and look at its storage space requirements, compared to other popular NoSQL databases such as CouchDB and MongoDB.
How ArangoDB allocates disk space
ArangoDB stores documents in collections. The collection data is persisted on disk so it does not get lost in case of a server restart.
When a collection gets created (either explicitly or by inserting the first document into it), a separate directory is created for the collection on disk. ArangoDB will also create a so-called “journal” file for the collection that the document data will be written into. The “journal” file will have a file size of 32 MB by default.
The value of 32 MB is configurable at server restart by setting the “–database.maximal-journal-size” option. It will be used for all subsequently created journal files. Existing journal files will not be changed, though. The journal size is also configurable on a per collection basis by setting the “journalSize” option when creating the collection. The minimum value is 1 MB.
When the journal file is first created, it is prefilled with zero-bytes so it will already take up as much disk space as the journal size value was set to. That means even if a collection contains only one small document, the journal file will already take up the 32 MB on disk. Adding more documents will then fill up blocks in the existing journal file, not taking additional disk space.
Only if the journal file is filled up, a new journal file will be created. The previous journal file will be made a “datafile” then. The distinction between journal files and datafiles is that journal files are actively written to whereas datafiles are immutable. It should be noted that after the initial allocation with zero-bytes, journal files are written in append-only fashion. There are no in-place modifications of document data in journals or datafiles.
As mentioned before, ArangoDB claims disk space for collection data in chunks that have a certain size. Disk usage is not increasing for each inserted document, but only when a journal file gets full and rotated. Obviously, the journal size is a factor for disk space consumption so it is a configurable value in ArangoDB.
ArangoDB will also allocate 2 MB of storage space per collection to store document structure and data type information. These “so-called” shapes are an important design aspect of ArangoDB that is explained below. Furthermore, ArangoDB will create an initial compaction file per collection, with the same size as the journal file.
Some storage design considerations
The preallocation/prefilling that ArangoDB employs might not make sense at first, but it is done for a reason. In some environments, overwriting an existing (prefilled) file is faster than appending new blocks to an existing file. Furthermore, allocating storage in bigger chunks might help reduce file system fragmentation.
Though preallocation/prefilling can improve performance considerably, it comes at a cost: first of all, disk usage is not directly proportional to the number/size of documents inserted. More importantly, the storage overhead may be relatively high for small datasets.
In ArangoDB, the overhead of preallocation/prefilling is configurable by setting the journal size appropriately. The value can be adjusted at collection level. The effect of different journal sizes was measured in this test and can be found below in the columns “ArangoDB, 32 MB journal” and “ArangoDB, 4 MB journal”.
CouchDB does not seem to preallocate space and therefore it does not have much initial overhead for small datasets. Storage space can be saved by using compression data. Using Snappy compression in CouchDB reduced disk usage by about 20 to 60 %, depending on the dataset used. The compressed data sizes are available in the results below in column “CouchDB, snappy” (with compression) and “CouchDB” (no compression).
In MongoDB, the disk space allocation is done in blocks of increasing size (64 MB, 128 MB, 256 MB, 512 MB, 1 GB, 2 GB). The first block consumes 64 MB already. There is also a startup option –smallfiles that modifies this series to [16 MB … 512 MB] to reduce space overhead. This is turned off by default and was not covered in these tests. MongoDB by default will also preallocate the next block before the current block is filled up. This has the advantage that the next block is likely to be already available when needed, but makes it consume even more disk space. This preallocation can also be turned off by starting with option –noprealloc. The results for this are present in the column “MongoDB, no prealloc”. As in ArangoDB, disk usage in MongoDB is not directly proportional to the number/size of documents inserted.
While preallocation space overhead matters for small datasets, its effect becomes less important for bigger datasets. For bigger datasets, it’s especially important how efficient the storage format of a database is and if patterns in the data can be exploited to compress it.
Storing the incoming JSON data as plain text would be too inefficient so ArangoDB stores data in some binary format, and so do CouchDB and MongoDB.
ArangoDB separates the document structure and the actual document data when saving a document. Document structure information, consisting of attribute names and attribute data types, is stored as so-called “shapes”. The document data stored will only contain a shape-id (a reference to an existing shape), and multiple documents can point to the same shapes. This helps in reducing disk usage when many or even all documents in a collection have the same structure.
CouchDB supports compression out of the box, but this comes at some performance cost (CPU cycles will be spent for compressing and uncompression data) so it is turned off by default. Apart from that it seems that CouchDB stores all attribute names individually for each document inserted, even if all documents of a collection/database share identical attribute names.
MongoDB uses an optimised binary data representation (BSON) as the internal storage format, and also seems to store repeated document structure information redundantly.
Actual storage space requirements
We have measured the actual disk usage in ArangoDB for some real-world and artificial datasets. For ArangoDB, we have used journal sizes of 32 MB (the default value) and 4 MB to illustrate the difference. Furthermore, we have imported the same datasets into other document databases, CouchDB and MongoDB, to see how much disk space they require. We’ve used CouchDB 1.2 without file compression and with Snappy compression. We’ve tested MongoDB 2.1.3 with and without preallocation.
The following datasets have been tested:
|Dataset name||Description||Number of documents||Average document size (bytes)
|names1000||person records, containg names and addresses, artificially created with source data from US census bureau, ZIP code and state lists||1.000||331|
|enron email corpus||e-mail data, published by Federal Energy Commission||41.299||3.895|
|access logs||Apache web server access logs||1.357.246||258|
|aol search queries||search engine queries, published by AOL||3.459.421||111|
All datasets were JSONified and imported into the beforementioned databases using arangoimp (ArangoDB), the _bulk_docs API (CouchDB), and mongoimport (MongoDB). For each dataset, the total actual disk allocation in bytes as reported by the filesystem was measured.
Storage space requirements on an ext4 filesystem were as follows. All values are reported in MB (1,048,576 bytes).
|Dataset name||ArangoDB, 32MB journal||ArangoDB, 4MB journal||CouchDB||CouchDB, snappy||MongoDB||MongoDB, no prealloc|
|names1000||* 66.36||* 10.36||0.66||0.47||208.01||80.01|
|enron email corpus||203.75||171.38||163.04||95.22||464.01||208.01|
|wiki50||* 66.36||* 10.36||1.13||0.63||208.01||80.01|
|aol search queries||578.36||542.36||2067.70||843.86||2000.01||976.01|
* if you might wonder why the actual values are different to the expected 32 MB and 4 MB, the difference is due to ArangoDB creating a journal file and a compaction file that sum up to two times the specified journal size, and a shape file for 2 MB.
If you intend to store lots of documents with identical structures, ArangoDB might be for you. You might well save some storage space. This is especially true if you plan on using long attribute names. ArangoDB stores attribute names in the shape data, so attribute names are not stored repeatedly for documents that use the same shapes.
Setting the journal size to a lower value than the default 32 MB might be sensible if you plan to have a lot of small (in terms of number of documents) collections in ArangoDB. Setting the journal size to a lower value when creating small collections might save considerable disk space if you do this often.
This should not be done for collections that you plan to insert documents into frequently. Reducing journal size for such collections might still reduce disk space usage but may also decrease write performance because more files need to be created and synced. So this is a trade-off.