So you measured and tuned your system like described in the Part I and Part II of these blog post series. Now you want to get some figures how many end users your system will be able to serve. Therefore you define “scenarios” which will be typical for what your users do.
One such a user scenario could i.e. be:

  • log in
  • do something
  • log out

Since your users won’t nicely queue up and wait for other users to finish their business, the pace you need to test your defined system is “starting n scenarios every second”. Many scenarios simulating different users may be running in parallel. If your scenario would require 10 seconds to finish, and you’d start 1 per second, that means that your system needs to be capable to process 10 users in parallel. If it can’t handle that, you will see that more than 10 sessions are running in parallel, and the time required to handle such a scenario will lengthen. You will see the server resource usage go up and up, and finally have it burst in flames.

To find the bottlenecks in your scenario, you use statsd probes as we did in the last articles. Since we want to evaluate the possible bandwidth of our system, we will drive it into the red area, and finally make it burst in flames – to make it clear – we need to do this to see the possible limits.

How to create usage scenarios

You have to simulate clients in parallel. One solution would be to have one thread per client. However spawning threads isn’t that cheap and you will eventually overload your system doing the load injection. Since your client will be waiting for ArangoDB’s replies anyways, we can use an eventloop to handle test-runners “in parallel”: gevent.

Obey the rules of the eventloop!

We’re using pyArango which is using python requests as its backend. Since python requests is using blocking I/O (thus – waits for the server reply) we have to replace it by a non blocking version. We can use monkey patching to replace it under the hood:

As you can see, we report dis/connects to a statsd gauge so we also can monitor that this isn’t happening to frequently (as it’s also performance relevant).

Handling workers

Gevent has the concepts of greenlets as one lightweight processing unit. We’re going to map one greenlet to one scenario.

The main process in python will spawn the greenlets (we call them workers for now). One greenlet will become the lifecycle of one such simulated user session.

There are two ways to terminate greenlets, one is the main processes – collecting them. This will then wait for freeing the resources of the greenlets until the process exits. Thus we could only do that after our test scenario. Therefore these greenlets would linger around using memory waiting for the final end of the testrun. Since we’re also intending to run endurance tests, that would pile up till our main memory is used up – not that smart. Thus we take the other way, we do raise GreenletExit to tell gevent it should clean them up:

Scaling the loadtest

Our python load simulation process has to:

  • simulate our application flow
  • generate json post documents
  • generate http requests
  • parse the http replies
  • parse the json replies
  • check the replies for errors

Thus we also need a certain amount of CPU. Since we’ve chosen to go the single thread with the greenlets, we’re limited to the maximum bandwidth one python process can handle. Since we want to have a stable pattern emitted by the simulator, we have to watch top for our python process and its CPU usage. While on a multi-core system more than 100% CPU may be possible for multithreaded processes, that’s not true for our python process. To have a stable pattern we shouldn’t use more than 80% CPU.

If we want to generate more load, we need more python processes running and thus be able to i.e. split the userbase we run the test on amongst those processes.

How to best evaluate the possible bandwidth?

Now that we have the fair knowledge of everything working – how do we evaluate the possible bandwidth of a given system?

While ArangoDB may survive a short ‘burst’, that’s not actually what we want to have. We’re interested in finding out what load it can sustain over a longer timeframe, including the work of database internal background tasks etc. We define a ramp time which the system should keep up with a certain load pattern. i.e. 4 minutes could be a good ramp time for a first evaluation. Once it survived the ramp time, we push harder by increasing the load pattern. That way we should be able to quickly narrow in on the possible bandwidth.

Once we have a rough expectation what the system should survive, we can run longer tests only with that pattern for 10 minutes, or even hours. Depending on the complexity of your scenario, the best way to create that ramp is to spawn more python processes – either on one or more load injector machines. You should also emit that process count to your monitoring system, so you can see your ramping steps alongside the other monitoring metrics.

How to recognize if the SUT is maxed out?

There are several indicators your ArangoDB aka “system under test” (SUT) is maxed out:

  • Number of file descriptors – (that’s why we added the ability to monitor it to collectd beforehand) Each connection of a client to the database equals a file descriptor, plus the databases virtual memory and open logfiles, if it’s a cluster connection within the nodes, etc. this should be a rather constant number. Once you see this number rising, the number of parallel working clients is increasing, because of single clients not being ready on time due to database requests taking longer
  • Error rate on the client – You should definitely catch errors in your client flow, and once they occur tell statsd they’re happening and write some log output. However, you need to make sure they’re not because of i.e. ID re-use and thus overlapping queries which wouldn’t happen usually
  • CPU Load – Once all CPUs are maxed out to a certain amount, more load won’t be possible. Once that happens you’re CPU-Bound
  • I/O Amount – Once the disk can’t read/write more data at once, you’re I/O bound
  • Memory – this one is the hardest to detect, since the system can do a lot with memory, like using it for disk cache, simulating available memory from swap, and so on. So, being memory bound usually only indicates by disk and CPU being well below red levels
  • Memory Maps – These are used to back bigger chunks of memory on disk or to load dynamic libraries into processes. The linux kernel gives an overview of it in /proc//maps. We will be able to monitor these with collectd 5.8
  • Available V8 contexts – while it is possible to execute a wide range of AQL queries without using Javascript / V8, Transactions, User Defined Functions or Foxx Microservices require them; and thus they may become a sparse resource

Getting an example running

Initializing the data

To get a rough overview of the ArangoDB capability we create a tiny graph that we may use later on in document queries, or traversals:

The actual test code

We create an example doing a transaction with a graph query updating some of the level 1 depth vertices we find. We run some other queries to directly modify the primary vertex. We think of this use case being a login procedure that counts logins on the user, and the groups he is assigned to be in. We control a range of users and the load pattern we want to inject by command line parameters:

The additional dependencies of the above code can be installed like this on a debian system:

Analysing test results

We start a first load injector on 17:51 : ./ 100 2000 1 and a second one increasing the load on 17:55: ./ 2000 4000 1

This is what we see in kcollectd from the testrun (Graph 1 at the bottom):

We’re analyzing these graph starting at the lowest, and going up graph by graph. Since KCollectd can only group gauges on one axis and can’t do any value translation, the graphs are grouped by the value ranges so that we can see meaningful variances. In Graphite we may add a second Y-Axis, add scales etc. to better group values by their meaning instead of their value.

Graph 1


As mentioned above we’re also monitoring error counts via a statsd counter gauge, so we can actually see when the desired test pattern can’t be achieved either due to i.e. account overlapping or server side errors.

In the current test run we see that this error count starts to rise at 17:57 – thus the server fails to stand the load pattern. In the error output of the test we see the actual cause:

So the resource we run out of is the V8 contexts to execute our transaction. The percentile of the queries however states that the execution time is constant.

Graph 2


Inspecting the next graph, we see that the number of concurrently active scenarios rises sharply as we launch the second test agent. Once the errors rise, these active clients will drop again, since the failing sessions finish quicker.

Graph 3


In the 3rd graph we see the number of memory mapped files double from below 4k to above 8k – which would be around the expected figures when doubling the test scenario.

Graph 4


In the 4th graph we see that the user space CPU usage doubles, while the system CPU usage rises sharp. Part of working with more threads is doing locking of resources shared between threads. Thus the kernel space CPU usage rises with the server going into overload.

Graph 5


The 5th graph shows the thread count and the number of open file descriptors and server threads. As we raise the load we see them both climbing. Thus, the effect we see is, that the load test client opens more tcp connections (which are also file handles). The server spawns more threads to follow the load scenario – however the available throughput on this system isn’t bigger.

Thus more and more threads are launched waiting for CPU time to do their work. At some point, these threads start to run into the timeout to acquire V8 contexts for the transactions. While the client still opens more connections to satisfy the load pattern we wanted to create, the server can’t follow it anymore.


We successfully demonstrated a test client simulating client scenarios. Since we do much of our workload in a transaction and don’t transfer large replies to the client, the python/gevent combination can easily create the load pattern we desired. The SUT can sustain a throughput of 1 Scenarios/second. In subsequent ramping test runs one may try to evaluate in finer grained steps whether i.e. 1.5 Scenarios/second are possible. With such a result an endurance test run can be done over a longer time period.