This is a story of an excursion to the bottom of a deep rabbit hole, where I discovered a foot gun in gcc‘s libgcc. The investigation has cost me several days and I hope that by writing this up I can entertain others and save them the journey.

TL;DR

If a C++ application is compiled with GCC on Linux and statically linked against a non-GLibC C-library (like libmusl), then there is a danger of a data race which leads to a busy loop happening after main() and all static destructors have finished. The race happens, if the application does not use pthread_cancel explicitly and if the very first exception which is thrown in the processes’ life is thrown in two different threads at the same time.

Background

This is about a problem which came up during CI-testing a large pull request for ArangoDB. Although all tests were green, every once in a while an arangod process would not terminate properly. The test would eventually time out and Jenkins would show me a bad red ball and deliver a core from a SIGABORT.

Before I tell you about the core, let me explain the setting. ArangoDB is a multi-threaded application written in C++. We build the executable on Linux in a Docker container using Alpine Linux and libmusl as a C-library, because we can then build a single self-contained statically linked executable which runs on basically all variants of Linux (which run on a x86_64 architecture). This simplifies packaging and greatly helps our support team, since they only have to deal with a single binary for each version. The binary even runs on Windows using the
“Windows subsystem for Linux” (WSL)! See details about this in this article.

The Problem

So what did the core show? There was only a single thread left, and the stack trace was this:

Note that main() has already terminated! We are in the middle of __do_global_dtors_aux, which is a function GCC creates to run all destructors for global objects. As it turned out later, all these destructors have already completed and we are already at the stage in which the C++ runtime does some cleanup and shutdown on its own. The process was aborted by the CI system, but at this stage, I was completely puzzled as to what the process might have been doing to prevent the proper shutdown.

And: How could it possibly be our fault or our PR’s fault???

The effect did not happen in every test run, but it was happening often enough to be annoying. And so my journey started…

First findings

Inspecting the core at first did not reveal much, since I did not know my way around the C++ runtime code in libgcc. Also, I could no longer single-step (since it was a core dump) and, obviously, I could not reproduce the problem in a debugger. However, I relatively quickly found a short loop in the assembler code and could look at the data structures it manipulates, to prove in the core that a busy loop was happening (if you do not know x86_64 assembler, just ignore this part and believe me that there is a busy loop):

(I shortened __deregister_frame_info_bases to __deregister for the
sake of brevity in the presentation!)

With %rbx being 0x7f4d1568e460 and the memory at this address looking like this:

it is clear that there is a busy loop, since the jump to __deregister+163 never happens. Looking at the C code in the GCC-library libgcc, we were in a loop which searches a singly linked list of objects of type struct object (whatever that is!):

(comments and dead code removed for brevity!)

On 64-bit machines, the object contains exactly six 64-bit words and the next pointer happens to point to the object itself, thus creating a circular data structure!

This finding took me quite some time, since I had to dig out old assembler knowledge and carefully look around in the core to find this. The theory was confirmed by our monitoring on the CI machines, which indeed showed that towards the end of the test run a single core was running at 100% CPU.

At this stage, I had no idea about the data structures in libgcc. However, in the C-code I saw two singly linked lists unseen_objects and seen_objects, which were static in the file libgcc/unwind-dw2-fde.c, so all the code dealing with them is in this module!

The only place modifying the seen_objects list is further down in the same file in the function _Unwind_Find_FDE, where objects in the unseen_objects list are processed and moved to the seen_objects list. As far as I could see, the code was completely correct (although I notoriously get code about linked lists wrong!). Furthermore, it was neatly protected by a mutex, as can be seen here. Surely, this code:

would protect the data structures against concurrent access by different threads. How could the list become circular???

Caught in the act

I had just proven that the code is correct and that the effect I saw in the core could never happen!

Out of sheer desperation I thought to myself that I had to actually catch it in the act of ruining the data structure. Furthermore, I was interested which piece of our code would trigger such an event.

The next thing I did was to run the tests in their native environment (many tests at the same time, high load on the machines), and put a probe into the unmodified executable using the Linux perf tool.

Much to my surprise, this was actually quite simple, I used the following commands to set up the probe and record the events:

This should essentially create an event every time the seen_objects list was modified. And in the case of an event I would see not only the complete call trace but also the value of the local variables ob and p. If I can catch it with this enabled, I should at least see which piece of our code triggered the bad behavior!

After a few tries the shutdown busy loop actually happened! I was very curious as to what I would find, and indeed, I now had a new core with the actual address of the loopy list and an event log of perf which should show the bad modification! And what I found was this (this is exactly happening here):

So indeed, the same exception was thrown at approximately the same time (50ms difference, which is of course an eternity!), but on a busy system with many processes running, a thread can be suspended for some time before it gets a time slice again. The output of the local variable ob actually shows that the same object was inserted into the singly linked list twice. At the time of the second insert the list is ruined and made loopy, because of the following. Assume the list looks like this:

and the same object B is inserted again. Because of some descending sorting by addresses which is going on, the position in which B is inserted is directly behind B itself. Normally, to insert a D there, the next pointer of D is changed to point to C and the next pointer of B is changed to point to D, yielding this:

But if D and B are identical, the result will be:

with B’s next pointer pointing to B itself. So what must have happened is that two threads were trying to do this manipulation at the same time. One must have grabbed the object from the unseen_objects list, but then been suspended. Then the other grabbed the object, too, and subsequently both inserted the same object into the seen_objects list.

However: How could this theory possibly be true, if there is a mutex around the whole code???

What goes wrong?

It turns out that the mutex is not actually used! However, this is cleverly hidden by C preprocessor magic. The idea is that the mutex in this place is not necessary, if the whole program is single-threaded. So, it sounds like a good idea to improve performance by not using the mutex in single threaded programs.

However, how do you detect that the program is single-threaded?

This is done very differently on different operating systems and even C-library versions. The basic idea is that if the pthreads library is present, then we assume that the program is multi-threaded and have to use the mutex. The gory details can be found here and the following 150 lines of code. It all comes down to the function
__gthread_active_p. The relevant implementation for us is here, this is the not FreeBSD and not Solaris version and looks like this:

It uses so-called “weak references” which are resolved at link time. In actual fact, the address of a library function is loaded and the linker replaces it by 0 if the library function is not in the executable. On GLibC systems, the (internal) function __pthread_key_create is used, which seems to be present if and only if pthread_create is. This seems to be reliable, since on GLibC the problem does not occur. On Android with the “Bionic” C-library, pthread_create is used directly, I do not know how reliably this works, since I did not try it. There is a comment above this piece of code saying that many libraries wrap pthread_create and thus the symbol could be there without libpthreads actually being linked. It seems that therefore, in the last case, which happens to be active with libmusl on Alpine Linux, the function pthread_cancel is used.

This actually works for non-statically linked executables! I assume that this is the case since if pthread_create is used, the whole shared library libc.musl-x86_64 is linked and this includes all pthread_* functions. However, when I link statically with -static, the linker manages to put pthread_create in but leave out pthread_cancel, since I am not explicitly using it. C++ programs will indeed rarely use pthread_cancel, since this functionality is not covered by the standard template library (STL)!

So therefore, what actually happens is that on Alpine Linux, with libmusl as C-library, in a program which uses multiple threads but not pthread_cancel explicitly, and is statically linked, this mutex is not being used!

When exactly does it happen?

So when is _Unwind_Find_FDE actually called? I still have not understood all the fine details. However, what I have understood is this: When an exception is thrown, the C++ runtime has to figure out how to unroll the stack, execute all cleanup destructors and catch sections, and resume the execution at the right place. For performance reasons, nothing to help with this is done at runtime, on the grounds that exception handling should be “zero overhead” as long as no exceptions are actually thrown. However, the compiler does a lot of preparations at compile time. It basically creates for each “object” (whatever this actually is!), a list of “frame descriptor entries” (FDEs), which basically contain the information where in the binary code each frame ends and begins. For example, the C++ program:

already contains multiple frames, I guess one for the try section, one for the catch section and one for the scope containing the local variable c. If an exception is thrown deep in callSomething(), all the C++ runtime has is the return address which points directly behind the call to callSomething(). It has to figure out that this is in the frame for the try and that now the catch frame has to be executed (or not, if the type does not match) and then do the cleanup for c.

In any case, all these data structures are prepared by the compiler, but also manipulated at runtime, and they reside in the unseen_objects and seen_objects lists. In all cases I have seen, there is a single object in unseen_objects at the beginning of main(), probably for the whole of the static executable, which contains lots of FDEs.

When the first exception is thrown in the life of the process, _Unwind_Find_FDE is called, and it will discover the single entry in unseen_objects. At this time, it is processed, the FDEs are potentially sorted, and the object is then added to the seen_objects list. I think this is some lazy initialization pattern, potentially squeezing a tiny bit of improvement for the process startup time.

In any case, if the mutex protecting all this is absent, and the first exception happening in the life of the process actually happens exactly at the same time in two threads, then we have a data race.

If things go wrong, the same (singleton) object is inserted twice into seen_objects and we have a shutdown busy loop. The process seems to run fine until then, although there might be situations in which the runtime data structures are destroyed, if there is more than one object in the list. However, I have not seen this happening in the wild.

To summarize:

If:

  • we are on Linux,
  • we are not using GLibC,
  • we are not on Android with the Bionic C-library,
  • we statically link the executable,
  • we do not explicitly use pthread_cancel,
  • the first exception that happens, happens in two threads exactly at the same time
  • in one thread the execution is suspended at an unlucky time during the list manipulation in _Unwind_Find_FDE, one object,

and all of this happens together, we get a shutdown busy loop after the termination of the main() function.

Example program to show the bug

Here is a sample program exposing the problem (store this as exceptioncollision.cpp in the current directory):

You can run Alpine Linux in a Docker container like this:

Then install g++ and gdb:

And compile the program as follows:

Finally (even outside the Docker container), do:

and you will soon see the busy loop.

Use

to enable the “fix” described below.

I have submitted a bug report against libgcc to the gcc team here. This was at first closed quickly on the grounds that it is a problem of libmusl. However, when I subsequently reported the problem to them on their mailing list, they quickly replied that the problem is actually in libgcc after all, and the first bug report was reopened. Let’s see how this issue is resolved between the two teams

How to circumvent the problem

Once understood, there are different ways to circumvent the problem. The first obvious one seems to be to use pthread_cancel explicitly. This will then activate lots of thread-related functionality including the particular mutex we have been talking about. Unfortunately, at the time of this writing I am running in a different problem with this. An example program exposing the problem is here (store this as exceptionbang.cpp in the current directory):

Simply compile on Alpine Linux like this:

It will immediately crash. It turns out that other crucial thread related symbols must additionally be inserted to create strong references. One would at least need pthread_once additionally and then it seems to work. So this approach is really a temporary solution but no permament one.

The next idea is to simply throw an exception whilst still single threaded. This seems to work beautifully, this PR implements it for the devel branch of ArangoDB.

That is, a single line fix at the beginning of main() like this:

solves the problem. However, this is also no permanent solution, since the C++ runtime would continue to treat our program as single-threaded and who knows what other problems arise down the line.

What is a possible fix?

From the perspective of a developer of a multi-threaded application the optimization to skip a mutex in the single threaded case seems to be a bit over the top, considering the complication this causes. So one way would be to simply use the mutex in all cases, at least in this place, where the mutex never seems to be a serious (or even measurable) bottleneck.

However, the whole gthreads business is probably also used elsewhere, where performance is more important. Nevertheless, most applications nowadays use more than a single thread and so the easiest would be to switch off the whole optimization for single threaded applications, at least when running on libmusl.

The second possibility would be to do the initialization of the FDE lists not in a lazy fashion, but eagerly at process start. This also seems to be a simplification of the code. It could, however, make the startup of a process slower in some cases, I guess. Maybe it is an important optimization to initialize only when the first exception actually happens, I do not know.

Which leaves the third idea of getting the recognition of single-threadedness right, even on libmusl and C++ programs which do not use pthread_cancel explicitly. This is probably the way to go. I am looking forward to hearing what the gcc developers suggest.

Lessons learned

In the course of the last few days I have learned a lot about the implementation of exceptions in C++ and gcc, about FDEs, code that runs before and after main(), about linking, weak references, different C-libraries, stack unwinding, gdb, assembler code and about the Linux perf tools. It has been fun and enlightening, but it also was a difficult investigation.

As lessons I take away:

  • Do not over-optimize at the cost of unnecessary complexity.
  • C++ runtime and exception handling is complicated.
  • C++ exceptions can be quite expensive when actually thrown, since the optimization is for the common case that no exception is thrown.
  • C-code can be hard to understand, in particular in the presence of lots of platform-specific special cases and preprocessor magic.
  • The Linux perf tools are an incredible asset for debugging.
  • Open source makes such an investigation possible in the first place.
  • The teams developing these fundamental tools like gcc and libmusl are very responsive and actually help when problems arise.

Once again, I hope my findings will save you the journey of investigating the same problem for days. If you want to know more about multi-model or ArangoDB, I recommend reading the white paper I wrote about the topic. You can download it here.