I have varying levels of familiarity with Google’s original leveldb and three of its derivatives. RocksDB is one of the three. In each of the four leveldb offerings, the code is optimized for a given environment. Google’s leveldb is optimized for a cell phone, which has much more limited resources than a server. RocksDB is optimized for flash arrays on a large servers (per various Rocksdb wiki pages). Note that a flash array is a device of much higher throughput than a SATA or SSD drive or array. It is a device that sits on the processor’s bus. RocksDB’s performance benchmark page details a server with 24 logical CPU cores, 144GB ram, and two FusionIO flash PCI devices. Each FusionIO device cost about $10,000 at the time of the post. So RocksDB is naturally tuned for extremely fast and expensive systems. Here is an example Arangodb import on a machine similar to the RocksDB performance tester:

The above graph shows an import of a sample database into a single ArangoDB server that is using the RocksDB backend. The server had a single FusionIO card, 128G Ram, and 32 CPU cores. The import took a little over 33 minutes. The median import rate was 17,620 records per 10 second interval. The worst case interval was 4,055 records.

Many ArangoDB customers have more modest hardware budgets. In fact, the AWS r3.xlarge instance with 4 logical CPU cores, SSD drives, and 32GB ram is a common selection. This hardware is very different from RocksDB’s benchmark as well as Google’s original cell phone environment. Here is the same Arangodb import on a machine similar to the AWS r3.xlarge instance:

The white dots are a repeat of the Facebook like machine with a FusionIO card. The black dots are a single server with 32G ram, SATA array, and 4 cpu cores. It is no surprise that the import took much longer, 2 hours and 36 minutes. The median import rate was 3,921 per 10 second interval. But there 47 intervals where zero records imported.

This is not good. The rate of writes per 10 second interval varies from 0 to 6,800. Any given write can take from 5 milliseconds to 108 seconds (overall median is 140 milliseconds). The 108 seconds becomes extremely problematic for client applications and cluster health tests. One ArangoDB customer has client code that times out after 10 seconds, then retries up to five times. If the server does not respond within an aggregate 60 seconds, the application aborts. Complex queries could cause the entire cluster to block if one server stops accepting writes. The ArangoDB cluster monitors all its component servers. If the RocksDB compactions overwhelm the CPU and a server starts failing to give heartbeats for 5 seconds, it gets marked as down and demoted. The demotion is intended to handle rare hardware failure events, not frequent software overloads. Again, this is not good.

The root of the problem is deep within Google’s original leveldb design and continues with Facebook’s RockDB. RocksDB lets the user write to the database in an unconstrained fashion. The writes accumulate within a memory buffer. Once the memory buffer is full, a new memory buffer is swapped into active duty while the old one is written to disk. If the new memory buffer completely fills before the old is completely written, incoming user writes stop until the old’s write to disk finishes. This is just one of two scenarios where RocksDB stops all user writes. The memory buffers write to disk and become “level 0 table files”. Over time RocksDB combines the “level 0 table files” via compaction to form “level 1” and higher table files. RocksDB will again stop all user writes if the number of “level 0” table files grows too large awaiting compaction to “level 1”. There is a need to slow user writes gradually in hopes of avoiding sudden stops of user writes.

This is the same data for the 4 cpu server has before with the Fusion IO data removed. The Y scale is changed to emphasize the wide variation in writes per 10 second interval.

RocksDB contains a fixed algorithm to slow user writes. The algorithm’s goal is to reduce the chance of completely stopping writes. The problem is that the RockDB algorithm assumes a flash array environment. There are some user accessible tuning variables, but the resulting tuning is still very jagged and highly ineffective with ArangoDB users’ likely hardware (as the previous diagram demonstrates).

In the upcoming 3.3 release, ArangoDB is adding two special algorithms to address write stops within hardware environments used by our customers. This is the same example Arangodb import utilizing the new algorithms:

The gold dots represent the writes per 10 second interval with ArangoDB’s added algorithms. The same server hardware is used to produce the gold dots as the black, only change is the additional algorithms. The median import rate 2,460 writes per 10 second interval. The worst case 720 writes. But there are no zero intervals and write quickly settles into a steady rate after 15 minutes.

The first of the two algorithms adds dynamic pacing of the user writes. “Dynamic” means the pacing algorithm automatically tunes to the user’s hardware environment. The pacing algorithm monitors the actual write rate RocksDB is able to achieve when flushing write buffers and performing level compactions. The monitored rates, statistically smoothed over a one hour period, become the pacing limit for incoming user write operations. Stated simply, the pacing algorithm does not let the user write faster than RocksDB can write to the disk.

The second algorithm helps computers with fewer CPU cores prioritize RocksDB’s flush operations, level 0 compactions, and all other level compactions. The prioritization focuses the CPU time on the compactions operations most needed to avoid the scenarios where RocksDB stops all user writes. The prioritization also ensures that other essential activities, such as ArangoDB cluster communications (heartbeat messages) and user read/query requests, get adequate CPU time. The pacing algorithm implicitly receives the impact of this prioritization algorithm by receiving the changes in compaction write rates.

These two algorithms address the majority, but not all, of the environmental differences between RocksDB’s performance hardware expectation and a typical ArangoDB user’s hardware. There is additional work on the horizon to continue improving RocksDB for ArangoDB customers. The expectation is that ArangoDB can increase RocksDB’s average write rate without returning to the widely unpredictable intervals of native RocksDB.

The results presented in this blog post are preliminary and based on an optimized devel version of ArangoDB not yet suitable for production use.

Join Matthew on Monday, December 4th at Facebook HQ (2PM PT) to learn more about RocksDB with ArangoDB.

Join the Meetup here