home shape

Lockfree protection of data structures that are frequently read

Motivation

In multi-threaded applications running on multi-core systems, it occurs often that there are certain data structures, which are frequently read but relatively seldom changed. An example of this would be a database server that has a list of databases that changes rarely, but needs to be consulted for every single query hitting the database. In such situations one needs to guarantee fast read access as well as protection against inconsistencies, use after free and memory leaks.

Therefore we seek a lock-free protection mechanism that scales to lots of threads on modern machines and uses only C++11 standard library methods. The mechanism should be easy to use and easy to understand and prove correct. This article presents a solution to this, which is probably not new, but which we still did not find anywhere else.

The concrete challenge at hand

Assume a global data structure on the heap and a single atomic pointer P to it. If (fast) readers access this completely unprotected, then a (slow) writer can create a completely new data structure and then change the pointer to the new structure with an atomic operation. Since writing is not time critical, one can easily use a mutex to ensure that there is only a single writer at any given time. The only problem is to decide, when it is safe to destruct the old value, because the writer cannot easily know that no reader is still accessing the old values. The challenge is aggravated by the fact that without thread synchronization it is unclear, when a reader actually sees the new pointer value, in particular on a multi-core machine with a complex system of caches.

If you want to see our solution directly, scroll down to “Source code links“. We first present a classical good approach and then try to improve on it.

Hazard pointers and their hazards

The “classical” lock-free solution to this problems are hazard pointers (see this paper and this article on Dr Dobbs). The basic idea is that each reading thread first registers the location of its own “hazard” pointer in some list, and whenever it wants to access the data structure, it sets its own hazard pointer to the value of P it uses to access the data, and restores it to nullptr when it is done with the read access.

A writer can then replace the old value of P with a pointer to a completely new value and then scan all registered hazard pointers to see whether any thread still accesses the old value. If all store operations to the hazard pointers and the one to P use memory_order_seq_cst (see this page for an explanation), then it is guaranteed that if a reader thread sees the old version of P, then it observes the change of its own hazard pointer earlier, therefore, because of the guaranteed sequential order of all stores with memory_order_seq_cst, the writer thread also observes the hazard pointer value before its own change of P.

This is a very powerful and neat argument, and it only uses the guaranteed memory model of the C++11 standard in connection with atomic operations in the STL. It has very good performance characteristics, because the readers just have to ensure memory_order_seq_cst by means of memory fence or equivalent instructions, and since one can assume that the actual hazard pointers reside in different cache lines there is no unnecessary cache invalidation.

However, this approach is not without its own hazards (pun intended). The practical problems in my opinion lie in the management of the hazard pointer allocations and deallocations and from the awkward registration procedure. A complex multi-threaded application can have various different types of threads, some dynamically created and joined. At the same time it can have multiple data structures that need the sort of protection discussed here. The position of the actual hazard pointer structure is thread-local information, and one needs a different one for each instance of a data structure that needs protection.

What makes matters worse is that at the time of thread creation the main thread function often does not have access to the protected data at all, due to data encapsulation and object-oriented design. One also does not want to do the allocation of hazard pointer structures lazily, since this hurts the fast path for read access.

If one were to design a “DataGuardian” class that does all the management of hazard pointers itself, then it would have to store the locations of the hazard pointers in thread-local data, but then it would have to be static and it would thus not be possible to use different hazard pointers for different instances of the DataGuardian. We have actually tried this and failed to deliver a simple and convenient implementation. This frustration lead us to our solution, which we describe next.

Lock-free reference counting

The fundamental idea is to use a special kind of reference counting in which a reading thread uses atomic compare-and-exchange operations to increase a reference counter before it reads P and the corresponding data and decreases the counter after it is done with the reading. However, the crucial difference to this standard approach is that every thread uses a different counter, all residing in pairwise different cache lines! This is important since it means that the compare-and-exchange operations are relatively quick since no contention with corresponding cache invalidations happens.

Before we do any more talking, here is the code for the simple version of the DataProtector class, first DataProtector.h:

#include <atomic>
#include <unistd.h>

template<int Nr>
class DataProtector {
    struct alignas(64) Entry {
      std::atomic<int> _count;
    };

    Entry* _list;

    static std::atomic<int> _last;
    static thread_local int _mySlot;

  public:

    DataProtector () : _last(0) {
      _list = new Entry[Nr];
      // Just to be sure:
      for (size_t i = 0; i < Nr; i++) {
        _list[i]._count = 0;
      }
    }

    ~DataProtector () {
      delete[] _list;
    }

    void use () {
      int id = getMyId();
      _list[id]._count++;   // this is implicitly using memory_order_seq_cst
    }

    void unUse () {
      int id = getMyId();
      _list[id]._count--;   // this is implicitly using memory_order_seq_cst
    }

    void scan () {
      for (size_t i = 0; i < Nr; i++) {
        while (_list[i]._count > 0) {
          usleep(250);
        }
      }
    }

  private:

    int getMyId () {
      int id = _mySlot;
      if (id >= 0) {
        return id;
      }
      while (true) {
        int newId = _last + 1;
        if (newId >= Nr) {
          newId = 0;
        }
        if (_last.compare_exchange_strong(id, newId)) {
          _mySlot = newId;
          return newId;
        }
      }
    }

};

And a minuscule part in DataProtector.cpp for the definition of two static variables, one of which is thread-local:

#include "DataProtector.h"
template<int Nr> thread_local int DataProtector<Nr>::_mySlot = -1;
template<int Nr> std::atomic<int> DataProtector<Nr>::_last(0);
template class DataProtector<64>;

In a multi-threaded application one would declare the following, either global or in some object intance:

std::atomic<Data*> P;
DataProtector Prot;

A reader uses this as follows:

Prot.use();
Data* Pcopy = P;   // uses memory_order_seq_cst by default
// continue accessing data via Pcopy
Prot.unUse();

A writer simply does (protected by some mutex):

Data* newP = new Data();
Data* oldP = P;
P = newP;
Prot.scan();
delete oldP;

The code speaks mostly for itself, because this is actually a very simple approach: We administrate multiple slots with reference counters, making sure that each resides in a different cache line by using alignment. Each thread chooses once and for all a slot (we store the number in static thread-local storage), valid for all instances of the DataProtector class. This leads to a very fast path for reading data.

The writer, which is always only one at a time using mutexes, first builds up a completely new copy of the protected data structure and then switches the atomic pointer P to the new value. From this point on all readers only see the new version. To ensure that no more readers access the old version, the writer simply scans all reference counters in the DataProtector class and waits until it has seen a 0 in each of them. It is not necessary to see zeros in all of them at the same time, it is enough to have seen a zero in each slot once. After that it is safe to destroy the old value of the protected data structure.

The proof that this works is equally simple as in the hazard pointer case above: The changes to the reference counters as well as the change to the global pointer P by the writer are all done with memory_order_seq_cst. That is, the C++ memory model guarantees that all these changes are observed by all threads in the same sequential order. A reader that observes the old value of P (and then subsequently reads the old version of the data structure), has incremented its reference counter before reading P. Therefore it observes the change to the counter before it observes the change to P by the writer. Thus the writer must observe the change to the counter also as happening before the change to P. Therefore it will always see some positive counter as long as a reader is still accessing the old value of P and the corresponding data structure.

We assumed that it is not a problem that the writer is somewhat slow, because writes are infrequent. Therefore, locking a mutex, reading all reference counters, whose number is of the order of magnitude of the number of reader threads, and waiting for each of them to drop to 0 once is not a performance problem.

The benefits of this approach are as follows: All management happens encapsulated in the DataProtector class, which is extremely simple to use. We have discussed the performance characteristics above and show a benchmark and comparison with other methods below.

There is a single convenience improvement, which we describe in the following section.

Convenience with scope guards

To make the usage for the readers even more convenient and reduce possiblities for bugs, we create a facility to use scope guards. We do this in the form of a small UnUser class, which is encapsulated in the DataProtector class. Modern C++11 features like type inference (auto) further help. After this modification, the reader uses the DataProtector as follows:

{
  auto unuser(Prot.use());
  Data* Pcopy = P;   // uses memory_order_seq_cst by default
  // continue accessing data via Pcopy
}

The unuser instance will have type DataProtector::UnUser and the use method of the DataProtector returns the right instance such that the destructor of the UnUser class automagically calls the unUse method of the DataProtector class, when the object goes out of scope. This method can then in turn be private. Without further talking, here is the code of the UnUser class:

// A class to automatically unuse the DataProtector:
class UnUser {
    DataProtector* _prot;
    int _id;

  public:
    UnUser (DataProtector* p, int i) : _prot(p), _id(i) {
    }

    ~UnUser () {
      if (_prot != nullptr) {
        _prot->unUse(_id);
      }
    }

    // A move constructor
    UnUser (UnUser&& that) : _prot(that._prot), _id(that._id) {
      // Note that return value optimization will usually avoid
      // this move constructor completely. However, it has to be
      // present for the program to compile.
      that._prot = nullptr;
    }

    // Explicitly delete the others:
    UnUser (UnUser const& that) = delete;
    UnUser& operator= (UnUser const& that) = delete;
    UnUser& operator= (UnUser&& that) = delete;
    UnUser () = delete;
};

There is nothing special to it, note that the rule of five is observed and that the implemented move constructor allows return value optimization to kick in, such that the value now returned by the use class of the DataProtector is directly constructed in the stack frame of the reader:

UnUser use () {
  int id = getMyId();
  _list[id]._count++;   // this is implicitly using memory_order_seq_cst
  return UnUser(this, id);  // return value optimization!
}

As already mentioned, the unUse method is now private, other than that, the code of the DataProtector is unchanged.

Source code links

All code is available online in this github repository:

https://github.com/neunhoef/DataProtector

There, we also publish the test code, which is used in the following section to measure performance.

Additionally, this is actually being used in published software in the source code of ArangoDB, see here and here for details.

Performance comparison with other methods

To assess the performance of our DataProtector class, we have done a comparison with the following methods:

  1. DataGuardian with hazard pointersThis is our own implementation of hazard pointers.
  2. unprotected accessThis is just unprotected access, which is of course not an option at all, but interesting nevertheless. One sees, that the readers essentially just consult their caches which are updated eventually. There is no guarantee against use-after-free at all.
  3. a mutex implementationThis is a very simple-minded application where all readers and the writer share a global mutex.
  4. a spin-lock implementation using boost atomicsAgain a very simple-minded implementation of spin-locks.
  5. DataProtectorThis is our new class described in this article.

The test program simply starts a number of reader threads which constantly read a dummy data structure, thereby detecing use-after-delete and seeing a nullptr. We count reads per second and reads per second and thread.

Here are the results on an n1-standard-16 instance on Google Compute Engine (GCE) for various numbers of threads. The code has been compiled with g++ -std=c++11 -O3 -Wall. Results are in million reads per second (M/s), and million reads per second and thread (M/s/thread):

|Nr |DataGuardian|unprotected|Mutex      |Spinlock   |DataProtector
|Thr|  M/s  M/s/T|  M/S M/s/T|  M/S M/s/T|  M/S M/s/T|  M/S M/s/T 
-------------------------------------------------------------------
| 1 | 34.0  34.06| 2259  2259|60.42 60.42|102.1 102.1| 84.3 84.35
| 2 | 64.6  32.31| 4339  2169| 9.33  4.66| 97.5  48.7|160.1 80.05
| 3 | 92.6  30.86| 6162  2054| 7.89  2.63| 93.5  31.1|234.3 78.12
| 4 |120.5  30.13| 8021  2005|13.96  3.49| 89.9  22.4|300.4 75.12
| 5 |146.3  29.26| 9814  1962|11.43  2.28| 88.3  17.6|367.4 73.49
| 6 |168.6  28.10|11302  1883| 9.23  1.53| 83.9  13.9|431.2 71.87
| 7 |208.3  29.76|12790  1827| 8.43  1.20| 81.0  11.5|498.5 71.21
| 8 |220.1  27.51|14075  1759| 8.39  1.04| 81.2  10.1|554.7 69.34
| 9 |258.1  28.68|14040  1560| 8.24  0.91| 81.0   9.0|542.1 60.23
|10 |282.4  28.24|14078  1407| 8.36  0.83| 81.9   8.1|540.9 54.09
|11 |301.0  27.37|14029  1275| 8.10  0.73| 81.0   7.3|524.4 47.67
|12 |321.2  26.76|14060  1171| 7.99  0.66| 82.2   6.8|518.4 43.20
|13 |345.0  26.54|14031  1079| 7.79  0.59| 82.3   6.3|510.9 39.30
|14 |366.7  26.19|14086  1006| 7.63  0.54| 82.8   5.9|502.5 35.89
|15 |385.9  25.73|14092   939| 7.54  0.50| 83.4   5.5|498.5 33.23
|16 |408.6  25.54|14276   892| 7.53  0.47| 82.9   5.1|491.9 30.74
|20 |408.0  20.40|14317   715| 7.98  0.39| 84.5   4.2|488.9 24.44
|24 |402.6  16.77|14196   591| 8.29  0.34| 84.0   3.5|525.3 21.88
|28 |398.6  14.23|14235   508| 8.43  0.30| 83.4   2.9|501.8 17.92
|32 |389.9  12.18|14123   441| 8.53  0.26| 83.4   2.6|528.9 16.52
|48 |385.2   8.02|14202   295| 8.62  0.17| 81.0   1.6|488.4 10.17
|64 |375.1   5.86|14196   221| 8.91  0.13| 79.2   1.2|508.3  7.94

The first column is the number of reader threads, in each of the five following columns there is first the total number of reads in all threads in millions per second and then the same number divided by the number of threads, which is the total number of reads per second and thread. The results have some random variation and are very similar when using the clang compiler.

One can see that both the hazard pointers in the DataGuardian class and the DataProtector class scale well, until the number of actual CPUs (16 vCPUs are 8 cores with hyperthreading) is reached. On such a machine 554 million reads per second with 8 threads is a good result, this means that every thread achieves 70 M reads per second and thus only spends around 14 nanoseconds for each. This shows that in this uncontented situation the atomic compare-and-exchange operations are quite fast.

References

Edits and thanks

After Josh Habermans comments I have made _last static and fixed the code for getMyId(). Thanks to Josh for his nice comments and for pointing out this bug.

Max Neunhöffer

Max is one of the C/C++ developers working on the ArangoDB core. In particular, he is responsible for the sharding extension and additionally converts the latest ideas from database science into C/C++ code. Furthermore, he enjoys to give public talks about the technical aspects of the ArangoDB development.

4 Comments

  1. Josh Haberman on August 13, 2015 at 1:15 am

    I believe you may have re-invented RCU: https://en.wikipedia.org/wiki/Read-copy-update

    This code follows the basic structure of RCU. The traditional RCU primitives correspond to DataProtector primitives like so:

    – prot.use() is rcu_read_lock()
    – prot.unUse() is rcu_read_unlock()
    – prot.scan() is synchronize_rcu()

    RCU was originally a kernel-side concept, but it has since expanded to user-space also, see:

    http://liburcu.org/

    https://lwn.net/Articles/573424/ (especially see the table about the tradoffs — fascinating!)

    liburcu is clearly not as simple as what is described in this article. One reason is that they are using pre-C11 C, which has no standardized atomics or barriers. It is also much more aggressive about reducing overheads: the DataProtector code described here incurs a test plus a (possibly contended) atomic increment/decrement pair for every read critical section. liburcu offers several variants of the implementation, only one of which is close to this amount of read-side overhead.

    I said “possibly contended” even though the code here endeavors not to have contention. But it appears it still can run into contention in the case where there are more threads than the compile-time template parameter of the DataProtector class. In that case the id space will wrap around and multiple threads will be assigned the same slot, leading to contention. The more you exceed this, the more contention you will get. This is an unfortunate drawback of this code’s simplicity.

    Also, it seems like there is a bug in the code. _mySlot is thread-local, but _last is an instance variable. It seems like two DataProtectors (which will have independent _last values) could assign the same _mySlot to different threads. This would cause unnecessary contention even if you don’t exceed the thread limit. It seems like _last should be static (global).

    I don’t mean this commentary to come off negatively. I think there is a *lot* of value in the way you have managed to factor this so that the DataProtector class is short and simple. Lots of lock-free algorithms have been known for a while, but in many cases they aren’t practical because they aren’t factored in such a way that they have convenient APIs. SMR/Hazard Pointers is a great example of this. I would love to see if it’s possible for this class to address these issues while still remaining simple.

    • Max Neunhöffer on August 13, 2015 at 10:54 pm

      Thanks for your nice and thorough comments.

      Indeed, this is RCU at work. Thanks for the references, I knew it is not new, but I did not know the name.

      I think what is new in this implementation of RCU is that it is using C++11 standard tools only, that everything is nicely encapsulated in a single class, and that the usage is extremely simple: No preparation per thread is needed, nothing needs to be called on a regular basis from each thread and there is no limitation on the time a reader spends inspecting a version of the data.

      This of course comes at a price, which is a pair of an atomic increment and an atomic decrement operation in usually uncontented memory. The measurements indicate that the costs for this are small in comparison to the read operations we usually do in ArangoDB, and small in comparison to a proper lock operation, at least on x86.

      You are of course right, _last should be static such that threads using different DataProtector instances have a better chance to pick different slots. Thinking about it again, it occurs to me now that getMyId() is actually buggy, when multiple threads call it concurrently. I have improved the code in the github repository using a compare_exchange_strong call. As presented originally it could even lead to a buffer overrun!

      One other advantage of this implementation is that correctness does not depend on the fact that different threads use different slots. Even if there are more threads than the compile time template parameter, proper protection is guaranteed, although there might be contention in some reference counters. In practice, I believe this problem can be neglected.

      Therefore we propose this simple class as a practical implementation of RCU for read-world software.

  2. Brent Spell on August 13, 2015 at 3:53 pm

    I believe your getMyId method has some problems. First, the id variable can be assigned to a value equal to Nr, which should result in a buffer overrun on the _list array.

    Additionally, the test (_last > Nr) and assignment _last = 0 are not synchronized, so it is possible that several threads could be assigned the same ids, if _last were assigned to 0 by several threads in a concurrent schedule. This should not affect correctness, but it will cause some false sharing as multiple threads are assigned the same slot.

    A simpler implementation without these problems would be to allow _last to increase without bound and assign slots mod Nr, something along the lines of the following.

    int getMyId () {
    if (_mySlot < 0)
    _mySlot = _last++ % Nr;
    return _mySlot;
    }

    • Max Neunhöffer on August 14, 2015 at 9:21 am

      You are of course right. Unfortunately I did not yet see your post when I was making some changes last night, because it was still in moderation.

      In the meantime I have changed the getMyId method to use compare_exchange_strong, which is OK, because that is only used in the rare case that a thread accesses any data protector instance for the first time, so it is not performance critical. Furthermore, the _last variable is now static, which makes more sense, as was pointed out by Josh above. The changes are in github and in the text of the article.

      Thanks for your comment and your kind words.

Leave a Comment





Get the latest tutorials, blog posts and news: