ArangoDB is multithreaded and able to use several CPU-cores at once. Because of that access to common data structures to these threads have to be protected from concurrent access. ArangoDB currently uses mutexes, spinlocks and RW-locks for that. With the ongoing development of the MVCC the number of situations where protected access is needed grows significantly. If locking is done too often the scalability is effectively limited to one core. So this test was done to estimate the costs, and evaluate other solutions – so called lockless programming with atomics.

Want to learn more about ArangoDB? Get in touch with the team for tech insides, Q&A and more.

A simple program was written to compare the different techniques. Since ArangoDB should have significantly more reader jobs and threads than writers the test pattern chosen was 49 reader threads and one writer. The test program uses an uint64_t variable, with all threads accessing it. Most use cases will increment the number (“write”), some will only put the loop counter into it (thus slightly different output of the counters, referenced as “set”). All test cases were configured to run one million loops. A concurrent start of all workers is desired. Threads spawned are held back by a C++ 11 construct. It’s the condition_variable so they can be started to do their load testing work at (roughly) the same point in time. The time measured is started from releasing the condition, and all worker threads rejoined the master.

Following are the platforms we used for the test:

  • Windows (x64 Windows 8.1 with AMD Phenom II X2)
  • Windows (x64 GRML Small with AMD Phenom II X2)
  • MAC OS X (x64 Intel(R) Core(TM) i3 CPU 540 @3GHz)
  • Linux ARM (Allwinner A20 SOC on Cubie Truck)
  • Linux AMD64 (Intel Core2 Duo CPU E8500 @3.1GHz)

Since the idea was to get an overview of the different behaviors on the platforms, varying hardwares were chosen without virtualization. The windows computer was booted with GRML x64 small iso to revalidate whether the significantly different test results were hardware related.

Methods tested

Here is a list of the implemented algorithms with explanations:

  • Read: reads the current value and increases control counter
  • Write: reads the old value, increases it and writes it back
  • Set: writes a loop counter into the protected variable

0: Unlocked

To get a reference to the costs of the locking, the probably race conditions prone flat increment & read was executed.

1: Mutexes

The classical way would be to protect the access to the variable by a mutex usind std::mutex. This is known to have much overhead.

2: Writelock / Writelock

These “should” be cheaper than mutexes. We wanted to see whats happening if you don’t differentiate between read / write. We do pthread wrlock / wrlock for the *nix-es, SRWLock Exclusive / Exclusive on windows

3: Writelock / Readlock

These “should” be cheaper than mutexes. We do pthread wrlock / rdlock for the *nix-es, SRWLock Exclusive / Shared on windows

4: Atomic Read & Write

The new kid in town since C++ 11: Also called lockless. Atomic is here to ensure no races are to be expected while accessing a variable. The possible syntax can result in a very compact code, at which you may not always be aware the ‘a++’ you’re reading is actually involves the overhead of atomic operations. The implementation is strongly dependent on the CPU architecture and its support.

5: Atomic Read “consume”, Atomic Set “release” for setting

This way the expenses of the atomic operation comes quicker to the code readers eye; its using the setter & getter methods. The test case doesn’t increment the int, it simply sets the counter as new value – thus the resulting numbers look different. Atomic Load is used to retrieve the value in the readers; It offers several heuristics; this test chooses the std::memory_order_consume for reading, std::memory_order_release for setting..

6: Atomic Read “acquire”, Atomic Set “release” for setting

This way the expenses of the atomic operation comes quicker to the code readers eye; its using the setter & getter methods. The test case doesn’t increment the int, it simply sets the counter as new value – thus the resulting numbers look different. Atomic Load is used to retrieve the value in the readers; It offers several heuristics; this test chooses the std::memory_order_acquire, std::memory_order_release for setting.

7: Atomic Read “consume”, Atomic Set “cst – consistent” for setting

This way the expenses of the atomic operation comes quicker to the code readers eye; its using the setter & getter methods. The test case doesn’t increment the int, it simply sets the counter as new value – thus the resulting numbers look different. Atomic Load is used to retrieve the value in the readers; It offers several heuristics; this test chooses the std::memory_order_consume, std::memory_order_seq_cst for setting.

8: Atomic Read “acquire”, Atomic Set “cst – consistent” for setting

This way the expenses of the atomic operation comes quicker to the code readers eye; its using the setter & getter methods. The test case doesn’t increment the int, it simply sets the counter as new value – thus the resulting numbers look different. Atomic Load is used to retrieve the value in the readers; It offers several heuristics; this test chooses the std::memory_order_acquire, std::memory_order_seq_cst for setting.

9: Atomic Read “consume”, Atomic Set “relaxed” for setting

This way the expenses of the atomic operation comes quicker to the code readers eye; its using the setter & getter methods. The test case doesn’t increment the int, it simply sets the counter as new value – thus the resulting numbers look different. Atomic Load is used to retrieve the value in the readers; It offers several heuristics; this test chooses the std::memory_order_consume, std::memory_order_relaxed for setting.

10: Atomic Read “acquire”, Atomic Set “relaxed” for setting

This way the expenses of the atomic operation comes quicker to the code readers eye; its using the setter & getter methods. The test case doesn’t increment the int, it simply sets the counter as new value – thus the resulting numbers look different. Atomic Load is used to retrieve the value in the readers; It offers several heuristics; this test chooses the std::memory_order_acquire, *std::memory_order_relaxed for setting.

11: Atomic Read “Acquire”, Atomic exchange weak for writing

This way the expenses of the atomic operation comes quicker to the readers eye; its using the setter & getter methods. This test case implements incrementing using while loops, as suggested by the documentation. Atomic Load is used to retrieve the value in the readers; It offers several heuristics; this test chooses the std::memory_order_acquire, std::memory_order_relaxed for setting.

12: Atomic Read “consume”, Atomic exchange weak for writing

This way the expenses of the atomic operation comes quicker to the readers eye; its using the setter & getter methods. This test case implements incrementing using while loops, as suggested by the documentation. Atomic Load is used to retrieve the value in the readers; It offers several heuristics; this test chooses the std::memory_order_consume, std::memory_order_relaxed for setting.

Test Results

Heres a table of the time in s the tests took to execute:

Testcase / OS Windows 8.1 GRML Linux X64 Mac OS X Linux ARM Linux X64
#0 0.033 0.0300119 0.02503 0.18299 0.02895
#1 479.878 5.7003600 118.47100 21.29970 4.47721
#2 1.45997 4.5296900 142.61700 23.29240 4.72051
#3 4.70791 6.3525200 7.65026 23.82040 7.87677
#4 0.056999 0.0454769 0.03771 0.94990 0.035302
#5 0.033999 0.0263720 0.02774 0.58076 0.017803
#6 0.032999 0.0286388 0.02785 0.58125 0.017604
#7 0.060998 0.0528450 0.03783 0.57926 0.033422
#8 0.060999 0.0536420 0.03807 0.57851 0.033546
#9 0.032999 0.0299869 0.02606 0.55443 0.017258
#10 0.033999 0.0235679 0.02593 0.54985 0.017839
#11 0.058999 0.0534279 0.06688 0.56929 0.030724
#12 0.059998 0.0676858 0.07563 0.57466 0.036788

We can see here that Windows delivers an extraordinary bad performance with std::Mutex; Dave explains why: It seems as if in windows std::mutex is similar to std::recursive_mutex using the Critical_Section system call which has a significant larger overhead and is to be avoided. SRW Locks are sufficient for ArangoDBs use of mutexes, and are therefore preferred. However, on Mac OS X, RW-Locks don’t deliver better performance, neither on Linux/ARM.

Performance wise one can clearly state that you should use linux if you want the biggest bang for the buck.

Implementation wise the conclusion is that one can’t use C++11’s std::Mutex as porting layer – depending on the requirements of the systems one should create ones own wrappers around locking mechanisms.

Atomic types deliver very good performance and if possible should be preferred to Mutexes / Locks. C++11 is delivering sufficiently persistent performance for these over all platforms in our test so one can lean on it. The performance of using all syntactic sugar (i.e. using the post increment operator on an atomic uint instance) of C++11 is equal to the writer/getter method. To visualize the overhead in the source code the instances should be prefixed like atomic_counter.

An interesting side note is, that one can see the overhead of Atomics on ARM is a little bit higher than on X86 (which was expected because of the weaker memory model).

Test Program

Most important – showing the means of measuring – heres the test program: (To compile in windows, create a default console project, paste in everything enable the stdafx.)

The code is available as gist.

It was run for the tests like this: