TL;DR: Our initial benchmark has raised a lot of interest. Initially we wanted to show that multi-model can compete with other solutions. Due to the open and competitive way we have conducted the benchmark, the discussions around it have lead to improvements in all products, better algorithms, faster drivers and better ways to use the databases.
From the outset we published all code and data and asked the vendors of all tested products as well as the general public, not only to run the tests on their own machines, but also to suggest improvements in the data models, test code, database configuration, driver usage and server configuration. This lead to a lively discussion, lots of pull requests and even to the release of improved versions of the database products themselves!
This process exceeded all our expectations and is yet another great example of community collaboration not only for fact finding but also for product improvements. Obviously, the same benchmark code will always show slightly different results when run on different hardware, operating systems, network setups and with more or less RAM. Therefore, a reliable result of a benchmark can essentially only be achieved by allowing everybody to run it on their own machines.
The technical setup is described in the above blog post. Let me briefly repeat the key facts.
We wanted to test a client/server setup, where the client is implemented in node.js. The server and the client run on different machines.
We took realistic data from a social network that allowed to do document based as well as graph queries. For more details, see here.
For some databases it is possible to define a schema on the profile data set. As we want to test the schema-less implementation of the database engines, we have not defined a schema – with the exception of _key which is defined as string and contains a unique hash index for fast lookup.
The test-cases assume that there is enough main memory available. The test machine has 60GB of memory. If you do the tests on a machine with only a few GBs of RAM, the result will look differently. But that is not the test-case we had in mind, because in a productive environment, you normally want to avoid swapping at all costs. That is why we also measured the memory usage.
We’ve got many requests asking to test a particular database. The testing framework is open-source and available on GITHUB at
If you have a database DatabaseDB that you want to test, please create a directory called databasedb and within this directory provide a description.js file, which implements the database calls used in the tests. If possible, create an import.sh to generate the database – for example, see the script import.sh in neo4j. Then issue a pull request and we will run the tests on the GCE used for the initial tests.
New Products / Versions
New versions of Neo4J and OrientDB are available. Therefore we reran the tests. Michael Hunger has pointed out, that the single write test compares apple and oranges as Neo4J guaranties durability. We have therefore split this test into two use cases single write and single write sync. The latter waits until the write has synced to disk.
Changes in the OrientDB Test
A new version 2.1 RC 4 of OrientDB is available. This version implements the shortest path algorithm in a two-sided way (look at the remark).
Another major change is a different data-model. OrientDB provides so-called lightweight edges, which need to be turned on when creating the database, as you can see in their documentation. It is possible to use ALTER DATABASE to enable lightweight edges, but this will not change existing edges. Unfortunately, that meant we had to recreate the database from scratch – which took a while. Therefore we have created a new database dump using lightweight edges for your convenience, if you want to rerun the tests yourself. You can find the dump on S3.
An official node.js driver orientjs is available. It is a fork of the oriento driver with minor changes, look here for details. There is also a new version of the oriento driver. Both drivers show the same performance, therefore we have now switched to the official fork.
It is possible to define a schema. As mentioned above, we wanted to tests the schema-less implementation in all databases. Therefore we have not enabled a fixed schema in OrientDB for the final tests. Defining a schema in OrientDB reduces the average resident memory from 18GB to 15GB and speeds up the aggregation, but on the other hand slows down the single reads and neighbors.
Changes in Neo4J
A new version Enterprise 2.3 SNAPSHOT of Neo4J is available. We have upgraded to this version.
Michael Hunger has provided a much better warmup phase. This is now used in the tests and it has improved the shortest path dramatically.
We have switched from node-neo4j to neo4j for the node.js driver as suggested here. As mentioned in the blog, we have observed some glitches in the driver when doing a lot of single reads and writes in parallel. Following Michael Hunger’s suggestions we have used the async library to limit the outstanding requests issued to 32 concurrent requests for Neo4j. However, doing a direct test with Apache Bench shows a much higher throughput. Therefore we assume that there still are some improvements possible within the driver.
The dbms.pagecache.memory parameter has been set to 10GB.
We have also created a new database dump for the 2.3 version
Changes in ArangoDB
It is possible to configure the durability on a per-collection or per-write-request basis. For the durable-write test, the durability has been enabled on a collection basis.
Changes in MongoDB
It is possible to wait after a write-request until the data has been saved to the journal file, see under journaled. This options is used for the durable-write test.
The Hard Path
As mentioned in the original blog post, we originally started with 20 pairs – one, however, blew up the tests. We are proud to report that now OrientDB and Neo4J are capable of finishing the search for the missing path – maybe thanks to our tests:
- ArangoDB: 4ms
- Neo4J: 254ms
- OrientDB: 282.233 ms
The throughput measurements on the test machine for ArangoDB define the baseline (100%) for the comparisons. Lower percentages point to higher throughput and accordingly, higher percentages indicate lower throughput.
Overall test results:
For our tests we run the workloads 5 times (from scratch), averaging the results. For details about the hardware see the original blog post.
“Therefore, a reliable result of a benchmark can essentially only be
achieved by allowing everybody to run it on their own machines.”
Maybe I do not really understand this stament, but I would strongly disagree it.
Each benchmark should be executed (and reproduceable) on a machine with the same spec, e.g., EC2 instances of a certain type. A generic description of the benchmark itself would be even better, i.e.,
-this is the source dataset,
-these are the queries
-that should be answered and
-this is the expected result of them.
So that every vendor has the chance to provide the best solution for their DB with their best knowledge (because they (should) know their system at most). Everything is open-source and transparent and well-documented. Then a reader (hopefully) has the chance to get an (relatively) unbiased impression of the strengths and weaknesses of each DB.
just my 5p 😉
The statement was meant in the following way:
We had a particular use-case in mind. We created a setup on Google GCE, published the specs of the machines used, the benchmark program as open-source and the data-set. You can find the full description here: http://www.arangodb.com/2015/06/multi-model-benchmark/#Appendix, therefore everyone can run the tests using the same environment.
So, the statement is meant in the way you describe, namely that anyone can rerun the tests using the description given.
I have read a comment from Luca from OrientDB about those tests (my underlining):
“[…] (actually on “neighbors” OrientDB is the fastest). By looking at the kind of benchmark we understood why: it’s not what you can expect by a classic 2-nd level neighbors, but it returns only the IDs. On Arango, like any other Relational DBMS, you have primary keys that are on indexes.
So that particular query uses the index without even fetching the real documents. That’s why seems faster, but retrieving the ids is an edge case, without any particular meaning in a real use case. If you’re looking for neighbors you usually are interested on any information about the neighbors, like name, city, etc. Not just the IDs. […]”
What is your comment? Could you fix the tests (or add some tests) to actually fetch a real document, instead of only an id?
In https://groups.google.com/d/msg/orient-database/nW9k_IISz6U/yPpta_lK_VMJ there is some dispute regarding the integrity of the OrientDB database used in the benchmark. How was the data imported into the database? I couldn’t find this in the article, my apologies if I missed it. Is it worth open sourcing this procedure so it can be improved?
I’ve got an definition file from Luca, which I can publish. In order
to require the data a new version of OrientDB (2.1RC5) is required,
because the original data contains some illegal UTF-8 sequences. The
old database dump was created using a node program. I will push the definition file to github.
We had started with a version fetching the whole documents, but some
of the databases did not survive this tests. However, now there are
newer versions available. So, I can rerun the tests using neighbors
Sounds reasonable, thanks.
I like the openness you gents at Arango are showing to make this as fair a benchmark as possible. I find that refreshing in a world often full of cutthroat enterprises.
The benchmark was never supposed to make any database look bad. I was quite surprised by the some discussion that ended in a flame war. I wanted to show that multi-model does not mean you have to sacrifice performance. There will be a lot of use-case where one database is much better than another. While in a different use-case it is the other way around.
The goal was to show that ArangoDB is fast enough and we want to convince people with features like microservices, extensibility, easiness of use, flexibility.
“The benchmark was never supposed to make any database look bad.”
Unfortunately, that is the result of any benchmark. It is a comparison and rarely do all the products being compared come out looking the same. At least one always comes out looking the worst, even if the actual result isn’t all that bad. The others simply look better.
The objectivity and the accuracy of a benchmark are what are important and the fact you are allowing the other vendors to help improve the benchmark code and data is great. I also find it interesting how well MongoDB came out in some of the results, considering it doesn’t even really support graphs directly.
One thing I am also not certain about, what is the “write-sync” test? I don’t recall reading what is being tested with that.
I have created an import script and uploaded it to https://github.com/weinberger/nosql-tests/tree/master/orientdb I will also rerun the tests. Please note that currently a schema-full database is created.
@ftvsko:disqus for OrientDB I’m currently using the following query:
SELECT set(out_Relation.key, out_Relation.out.Relation._key) FROM Profile WHERE _key = :key
Which query do suggest to fetch the complete documents?
Nice that you like our approach. We have put a lot of effort into it and there is still a lot of work to do 😉 Therefore it is a bit sad, that some people accuse us of manipulating if we need a few days to react.
I was also very surprised by the initial results. For example, the shortest path results were confusing. I was expecting better results. Especially, if there is no shortest path at all. The algorithms are now fixed in the database. But it seems that shortest path does not play a big role in real projects.
A lot of improvements have happened in the meantime. Almost all of these improvements were inside the database, not in the node driver, which was marked as culprit early on by some.
“The objectivity and the accuracy of a benchmark are what are important and the fact you are allowing the other vendors to help improve the benchmark code and data is great.”
That is exactly what I meant. Normally you only get benchmarks like “on my computer I got” without any chance to test and improve. One can debate if the selected use-case are suitable or not. And surely the result will look different for a different set of tests. Or if you use different kind of servers. Or a different kind of operating-system. The use-cases we have selected are for real data and from a real project. However, we wanted to give everyone the opportunity to verify the tests or adjust the environment to her/his needs.
What does the write-sync test?
s.molinari sorry for the delay, I have been away a few days working on customer projects.
Michael Hunger has pointed out, that Neo4J gives more guarantees when writing, namely that the data has been added to the transaction log and that this log has being synced to disk. Therefore if you kill Neo4J directly after receiving the answer, the new written data will be recovered. For MongoDB and ArangoDB, the default guarantees are much more relaxed. Therefore this was not a fair comparison. So, we added a new category “write-sync”. However, with Neo4J there is no option to relax this, so we removed it from the “write” test.
Please note that “write-sync” will be extremely sensitive to the storage used. It will be much slower for a hard-disk. For hard-disk you might get as low as 40 documents writes per seconds (without parallelism) And even for SSD there will be huge difference depending on the model. There even were models which simply lied about fsync. See http://de.triagens.com/frank-celler/2011/10/benchmarking-ssd-with-mongodb-and-couchdb-part-3/ for details.
@scamo:disqus see my comment above
The discussion around the node driver Oriento has really escalated in a flamewar.
There are requests, where the performance of the driver should not play any role. For instance, for aggregation it is only one call, for shortest path it is 19 calls. In these case the performance improvements from minutes to milli-seconds clearly tributes to the new server. There are the bulk request read and write which are a mix. Part of the performance is lost in the node driver, but this is true for all databases.
BTW an update of the blog post is available, see http://www.arangodb.com/2015/07/multi-model-benchmark-round-1/