ArangoDB has an HTTP interface to talk to its clients. Sometimes people want to secure this connection and use SSL or TLS instead. That is where we are using OpenSSL. It provides all the methods to implemented HTTPS on top of an HTTP server. It worked well and the corresponding code is only some 300 lines of C++ code. The biggest obstacle was the documentation. You can basically only learn from examples. That’s what we did. However, we finally encountered a bizarre bug. ArangoDB uses a number of threads to handle I/O in an asynchronous manner. The underlying library for I/O is libev. We span three threads by default each with its own event loop. With HTTP everything is working fine and even HTTPS was no problem.
Until we made a mistake during testing. We started ArangoDB as HTTPS server and ran our unittests. Everything showed green until we connected with an HTTP client to the HTTPS port by accident. The client could not connect and returned an error message – as expected. Meanwhile the unittests started to fail for a few seconds and then recovered. What! This is a completely different socket connection. How is it possible that one connection influences the other?
First idea: We managed to somehow mangle the sockets. So we started to print out file descriptors for accept, read and write, close. As there are a lot of unittests, the output is quite messy. But after hours of debugging, we convinced yourself that everything looked fine.
Second idea: Somehow OpenSSL does not like our threads. We reduced the number of threads to one and the problem went away. The obviously we did something stupid – maybe forget to set the OpenSSL thread functions. Nope, but we used the “old” interface” CRYPTO_set_id_callback and not the new CRYPTO_THREADID_set_callback. So, we changed to the new interface, ran the tests with three threads again – and the error still occurred. We switched back to one thread – still getting the error. Going back to the old CRYPTO_set_id_callback and one thread – guess what? We got the error again. Back to start.
Third idea: The SSL context is shared between all the SSL connections. Maybe something is amiss there. We created 100 such contexts and used them in a round robin fashion. No change in behaviour. Back to start – again.
Fourth idea: Well, not really an idea. We started to look through the 300 lines of code again and again. The documentation of OpenSSL is in my opinon not really helpful. We started digging StackOverflow (again). Finally we found a remark about an obscure behaviour when using asynchronous I/O. SSL_read or SSL_write might return SSL_ERROR_WANT_READ or SSL_ERROR_WANT_WRITE. In this case, you need to supply the exact same arguments. I did not find this in the documentation – maybe it is there, maybe not. After checking this, we convinced ourselves that we are using the same buffers.
Fifth idea: Whilst looking at the buffers, we found one bug in our code. SSL_write might return SSL_ERROR_WANT_READ. We handled that. But SSL_write might also returned SSL_ERROR_WANT_WRITE. We missed that. Bug fixed – no change in behaviour. The test still failed.
Sixth idea: Wild guessing. We removed the socket close and the SSL_free. And the bug went away – well eventually we ran out of file descriptors. Again: What is going on? How can this be? We were sure that we ruled out socket file descriptor mismatch in step 1. Another idea: remove the reuse address flag. Maybe, just maybe we are reusing a file descripted while OpenSSL still believe it is open. Think about the “exact buffer” – SSL is doing some caching behind the curtain. But again no change.
Seventh idea: No ideas left, digging deeper through StackOverflow. And finally we found the missing link to our problem. The error queue. If you call (for example) SSL_read it might produce an error, or two, or three or whatever. Just looking at SSL_get_error is not enough to clear the queue. Even worse, these errors are visible in different contexts. Putting in three lines of ERR_clear_error, just before any call to an SSL operation, solved the problem. So the bug was entirely our fault, except that the SSL documentation did not clearly say that one has to do this. We also assumed that using different SSL contexts, separates the SSL connections and queues and caches completely. Wrong again. Never assume anything with OpenSSL unless you can prove it by looking at the code.
However, we started to hate OpenSSL because the documentation is really not helpful. Writing OpenSSL itself is a huge task and really complex – I am sure about that, and we are very thankful that someone did this as Open-Source. But the interface is so strange and complex that you need a good documentation, which is simply not there. So my hope is that whoever is going to clean up OpenSSL (libressl?) or write a new SSL library, will also provide a good reference document and explain the underlying concepts and give good examples / best practises to follow.