home shape

Efficient Data Collection with Hash Tables: ArangoDB Insights

ArangoDB 2.6 will feature an alternative hash implementation of the AQL COLLECT operation. The new implementation can speed up some AQL queries that can not exploit indexes on the COLLECT group criteria.

This blog post provides a preview of the feature and shows some nice performance improvements. It also explains the COLLECT-related optimizer parts and how the optimizer will decide whether to use the new or the traditional implementation.

Introduction to COLLECT

A quick recap: in AQL, the COLLECT operation can be used for grouping and optionally counting values.

Here’s an example, using flight data:

FOR flight IN flights
  COLLECT from = flight._from WITH COUNT INTO count
  RETURN { from: from, count: count }

This query will iterate over all documents in collection flights, and count the number of flights per different _from value (origin airport). The query result will contain only unique from values plus a counter for each:

[
  { "from" : "airports/ABE", "count" : 6205 },
  { "from" : "airports/ABQ", "count" : 39346 },
  { "from" : "airports/ACV", "count" : 362 },
  ...
  { "from" : "airports/YAP", "count" : 285 },
  { "from" : "airports/YKM", "count" : 879 },
  { "from" : "airports/YUM", "count" : 2275 }
]

As the COLLECT will group its result according to the specified group criteria (flights._from in the above query), it needs a way of figuring out to which group any input value does belong.

Before ArangoDB 2.6, there was a single method for determining the group. Starting with ArangoDB 2.6, the query optimizer can choose between two different COLLECT methods, the sorted method and the hash method.

Sorted COLLECT method

The traditional method for determining the group values is the sorted method. It has been available in ArangoDB since the very start.

The sorted method of COLLECT requires its input to be sorted by the group criteria specified in the COLLECT statement. Because there is no guarantee that the input data are already sorted in the same way, the query optimizer will automatically insert a SORT statement into the query in front of the COLLECT. In case there is a sorted index present on the group criteria attributes, the optimizer may be able to optimize away the SORT again. If there is no sorted index present on the group criteria attributes, the SORT will remain in the execution plan.

Here is the execution plan for the above query using the sorted method of COLLECT. We can see the extra SortNode with id #7 being added by the optimizer in front of the COLLECT:

execution-plan

The sorted method of COLLECT is efficient because it can write out a group result whenever an input value will start a new group. Therefore it does not need to keep the whole COLLECT result in memory. The downside of using the sorted method is that it requires its input to be sorted, and that this requires adding an extra SORT for not properly sorted input.

Hash COLLECT method

Since ArangoDB 2.6, the query optimizer can also employ the hash method for COLLECT. The hash method works by assigning the input values of the COLLECT to slots in a hash table. It does not require its input to be sorted. Because the entries in the hash table do not have a particular order, the query optimizer will add a post-COLLECT SORT statement. With this extra sort of the COLLECT result, the optimizer ensures that the output of the sorted COLLECT will be the same as the output of the hash COLLECT.

Here is the execution plan for the above query when using the hash method of COLLECT. Here we can see the extra SortNode with id #7 being added post-COLLECT:

execution-plan

The hash method is beneficial because it does not require sorted input and thus no extra SORT step in front. However, as the input is not sorted, it is never clear when a group is actually finished. The hash method therefore needs to build the whole COLLECT result in memory until the input is exhausted. Then it can safely write out all group results. Additionally, the result of the hash COLLECT is unsorted. Therefore the optimizer will add a post-COLLECT sort to ensure the result will be identical to a sorted COLLECT.

Which method will be used when?

The query optimizer will always take the initial query plan and specialize its COLLECT nodes to using the sorted method. It will also add the pre-COLLECT SORT in the original plan.

In addition, for every COLLECT statement not using an INTO clause, the optimizer will create a plan variant that uses the hash method. In that plan variant, the post-COLLECT SORT will be added. Note that a WITH COUNT INTO is still ok here, but that using a regular INTO clause will disable the usage of the hash method:

FOR flight IN flights
  COLLECT from = flight._from INTO allFlights
  RETURN { from: from, flights: allFlights }

If more than one COLLECT method can be used for a query, the created plans will be shipped through the regular optimization pipeline. In the end, the optimizer will pick the plan with the lowest estimated total cost as it will do for all other queries.

The hash variant does not require an up-front sort of the COLLECT input, and will thus be preferred over the sorted method if the optimizer estimates many input elements for the COLLECT and cannot use an index to process them in already sorted order. In this case, the optimizer will estimate that post-sorting the result of the hash COLLECT will be more efficient than pre-sorting the input for the sorted COLLECT.

The main assumption behind this estimation is that the result of any COLLECT statement will contain at most as many elements as there are input elements to it. Therefore, the output of a COLLECT is likely to be smaller (in terms of rows) than its input, making post-sorting more efficient than pre-sorting.

If there is a sorted index on the COLLECT group criteria that the optimizer can exploit, the optimizer will pick the sorted method because thanks to the index it can optimize away the pre-COLLECT sort, leaving no sorts left in the final execution plan.

To override the optimizer decision, COLLECT statements now have an OPTIONS modifier. This modifier can be used to force the optimizer to use the sorted variant:

FOR flight IN flights
  COLLECT from = flight._from WITH COUNT INTO count OPTIONS { method: "sorted" }
  RETURN { from: from, count: count }

Note that specifying hash in method will not force the optimizer to use the hash method. The reason is that the hash variant cannot be used for all queries (only COLLECT statements without an INTO clause are eligible). If OPTIONS are omitted or any other method than sorted is specified, the optimizer will ignore it and use its regular cost estimations.

Understanding execution plans

Which method is actually used in a query can found out by explaining it and looking at its execution plan.

A COLLECT is internally handled by an object called AggregateNode, so we have to look for that. In the above screenshots, the AggregateNodes are tagged with either hash or sorted. This can also be checked programatically by looking at the aggregationOptions.method attributes in the JSON result of an explain().

Here is some example code to extract this information, limited to the AggregateNodes of the query already:

var query = `
  FOR flight IN flights
  COLLECT from = flight._from WITH COUNT INTO count
  RETURN { from: from, count: count }
`;
var stmt = db._createStatement(query);
var plan = stmt.explain().plan;
plan.nodes.filter(function(node) {
  return node.type === 'AggregateNode';
});

For the above query, this will produce something like this:

[
  {
    "type" : "AggregateNode",
    ...
    "aggregationOptions" : {
      "method" : "hash"
    }
  }
]

Here we can see that the query is using the hash method.

Optimizing away post-COLLECT sorts

If a query uses the hash method for a COLLECT but the sort order of the COLLECT result is irrelevant to the user, the user can provide a hint to the optimizer to remove the post-COLLECT sort.

This can be achieved by simplying appending a SORT null to the original COLLECT statement. Here we can see that this removes the post-COLLECT sort:

execution-plan

Performance improvements

The improvements achievable by using the hash method instead of the sorted method obviously depend on whether there are appropriate indexes present for the group criteria. If an index can be exploited, the sorted method may be just fine. However, there are cases when no indexes are present, for example, when running arbitrary ad-hoc queries or when indexes are too expensive (indexes need to be updated on insert/update/remove and also will use memory).

Following are a few comparisons of the sorted and the hash methods in case no indexes can be used.

Here’s the setup for the test data. This generates 1M documents with both unique and repeating string and numeric values. For the non-unique values, we’ll use 20 different categories:

var test = db._create("test");
for (var i = 0; i < 1000000; ++i) {
  test.insert({
    uniqueNumber: i,
    uniqueString: String("test" + i),
    repeatingNumber: (i % 20),
    repeatingString: String("test" + (i % 20))
  });
}

Now let’s run the following query on the data and measure its execution time:

FOR v IN test
  COLLECT value = v.@attribute WITH COUNT INTO count
  RETURN { value: value, count: count }

The worst case is when the COLLECT will produce as many output rows as there are input rows. This will happen when using a unique attribute as the grouping criterion. We’ll run tests on both numeric and string values.

Here are the execution times for unique inputs. It can be seen that the hash method here will be beneficial if the post-COLLECT sort can be optimized away. As demonstrated above, this can be achieved by adding an extra SORT null after the COLLECT statement. If the post-COLLECT sort is not optimized away, it will make the hash method a bit more expensive than the sorted method:

collect method       @attribute                duration
-------------------------------------------------------
sorted               uniqueNumber               11.92 s
hash                 uniqueNumber               13.40 s
hash (sort null)     uniqueNumber               10.13 s
sorted               uniqueString               22.04 s
hash                 uniqueString               27.35 s
hash (sort null)     uniqueString               12.12 s

Now let’s check the results when we group on an attribute that is non-unique. Following are the results for numeric and string attributes with 20 different categories each:

collect method       @attribute                duration
-------------------------------------------------------
sorted               repeatingNumber             5.56 s
hash                 repeatingNumber             0.94 s
hash (sort null)     repeatingNumber             0.94 s
sorted               repeatingString            10.56 s
hash                 repeatingString             1.09 s
hash (sort null)     repeatingString             1.09 s

In these cases, the result of the COLLECT will be much smaller than its input (we’ll only get 20 result rows out instead of 1M). Therefore the post-COLLECT sort for the hash method will not make any difference, but the pre-COLLECT sort for the sorted method will still need to sort 1M input values. This is also the reason why the hash method is significantly faster here.

As usual, your mileage may vary, so please run your own tests.

Jan Steemann

Jan Steemann

After more than 30 years of playing around with 8 bit computers, assembler and scripting languages, Jan decided to move on to work in database engineering. Jan is now a senior C/C++ developer with the ArangoDB core team, being there from version 0.1. He is mostly working on performance optimization, storage engines and the querying functionality. He also wrote most of AQL (ArangoDB’s query language).

7 Comments

  1. CoDEmanX on April 24, 2015 at 4:08 pm

    Will there be a performance gain if you use WITH COUNT INTO and sort by count on repeating numbers and strings?

    • jsteemann on April 27, 2015 at 11:24 am

      It is definitely possible though not guaranteed. I’ll try to explain when to expect a speedup and when not.

      I understood the goal is to sort by the count values. Then the COLLECT will require a post-SORT by count in the AQL query. This SORT cannot be optimized away and it cannot use any index, because its input values are generated during the grouping phase of the `COLLECT`.
      However, if the input values to the COLLECT repeat a lot (as you mentioned), the output of the COLLECT should be much smaller than its input. And the smaller the output of the COLLECT is, the less expensive a SORT on it will become.

      That means if your query is something like `FOR doc IN collection COLLECT value = doc.value WITH COUNT INTO count SORT count DESC RETURN { value: value, count: count }` and there are many repeating values in `doc.value` and no sorted index on this attribute, the query can employ the hash table collect and run faster than before, when it required the input of the COLLECT to be sorted by `doc.value`. This sorting would have been the most expensive part of the query, and the goal of the hash table collect is to avoid it. So yes, here you should see a speedup.

      When there is already a sorted index on the group criterion (i.e. `doc.value` in the previous query), then the optimizer will likely use it to avoid the pre-COLLECT sorting altogether. In this case, it won’t use the hash table collect and query execution time shouldn’t change too much (at least not directly related to COLLECT and SORT, we have applied some other optimizations elsewhere that might still speed it up).

      Finally, if after a COLLECT there are as much output values as there are input values, then sorting by the final count values will still be expensive (though unavoidable). But you might still save the pre-COLLECT sort due to the optimizer using the hash table version of COLLECT if there is no index on the group criterion.

      • CoDEmanX on May 5, 2015 at 10:54 am

        Thanks Jan for the in-depth explanation! I should definitely see an improved performance in my case, because there were no indexes at all.

        I aggregated all fields taken from a RDBMS, so flat documents with a limited number of fields per collection, but without indexes, since the generation of them would had taken a lot of time and memory (and a script to set them all up, >10k). The time consumption was still acceptable, but should be a matter of seconds instead of hours now.

        • jsteemann on May 8, 2015 at 4:19 pm

          Fingers crossed! Please let us know if performance of your queries improves with ArangoDB 2.6. Thanks!

          • CoDEmanX on May 8, 2015 at 5:45 pm

            Will there be a 2.6 beta build for windows anytime soon? I’d like to test it sooner than later, but I’m a bit afraid of the windows compilation process. Linux builds are way easier, but would require the source data to be available inside the vm if I’m not mistaken.



          • CoDEmanX on June 1, 2015 at 11:24 pm

            I just re-ran my script to aggregate all fields with a minor change to the AQL query to utilize WITH COUNT INTO and on a slightly different data source (negligible though), and it’s 120x faster!!! (from 8 hours down to 4 minutes). I did/do not use indexes, since I would need them on every attribute on all hundred collections. This is an awesome improvement. Thanks, Jan!



          • jsteemann on June 2, 2015 at 8:51 am

            Thanks for the feedback. A 120x improvement looks good! Props also go to @weinberger for suggesting the hash collect method at all!



Leave a Comment





Get the latest tutorials, blog posts and news: