home shape

Getting Unique Values: Efficient Data Retrieval in ArangoDB

While paging through the issues in the ArangoDB issue tracker I came across issue #987, titled “Trying to get distinct document attribute values from a large collection fails”.

The issue was opened around 10 months ago when ArangoDB 2.2 was around. We improved AQL performance somewhat since then, so I was eager to see how the query would perform in ArangoDB 2.6, especially when comparing it to 2.2.

For reproduction I quickly put together some example data to run the query on:

var db = require("org/arangodb").db; 
var c = db._create("test"); 
for (var i = 0; i < 4 * 1000 * 1000; ++i) {
  c.save({ _key: "test" + i, value: (i % 100) }); 
}
require("internal").wal.flush(true, true);

This produces a collection named test with 4 million documents. Each document has a numeric value attribute, which in total has 100 unique values. I remembered from a conversation with the guy that opened the issue that the number of distinct values was 100 or even slightly lower. I didn’t bother to create an index on the value attribute, which might have sped up the query.

With data available, it was time to run the query and measure its execution time:

var time = require("internal").time;
var start = time();
db._query("FOR doc IN test COLLECT value = doc.value RETURN value");
time() - start;

Running this in 2.2.7 took 3 minutes and 18 seconds before bursting with the following error message:

#
# Fatal error in CALL_AND_RETRY_2
# Allocation failed - process out of memory
#

In the 2.2 branch AQL queries were fully transformed to JavaScript and executed using V8. Obviously that didn’t work well with large collections. That was one of the reasons why version 2.3 saw a major rewrite of the AQL engine.

As a consequence, running the query in 2.3 (2.3.5) worked fine. Execution took around 28 seconds. The same was true for 2.4 (2.4.8) and 2.5 (2.5.5).

Finally, running the query in 2.6.0 completed in just 3.2 seconds.

The reasons for the speedup are the optimizations done for COLLECT (see COLLECTing with a hash table, the Return value optimization for AQL and some minor optimizations within AQL that didn’t get a pretty working title.

Looks like in sum all the optimizations put into 2.6 really pay out.

Jan Steemann

Jan Steemann

After more than 30 years of playing around with 8 bit computers, assembler and scripting languages, Jan decided to move on to work in database engineering. Jan is now a senior C/C++ developer with the ArangoDB core team, being there from version 0.1. He is mostly working on performance optimization, storage engines and the querying functionality. He also wrote most of AQL (ArangoDB’s query language).

Leave a Comment





Get the latest tutorials, blog posts and news: