Geo Index Implementation

This section will cover the MMFiles based geo-index. The algorithm is optimized for in-memory accesses and optimal CPU cache utilization. The main goal for our geo queries is to reject as many distant possible result points as fast as possible.

One limitation of an approach purely using geostrings is, when one is trying to perform a query to find points near a target (see blog post Part I). Sometimes points close together on the surface might end up with entirely different geostring prefixes and cannot be scanned without seeks. We implemented a type of Metric Tree to optimize for nearest neighbor queries.

To consistently achieve fast queries the Hilbert geostrings are combined with a binary search tree, the current implementation chooses an AVL tree structure.

The AVL tree strictly balances itself to guarantee a O(log(n)) lookup and insertion performance. This means both branches of the root node will have a height difference of at most 1. In each leaf of the AVL tree we store a list of at maximum 6 points, before the leaf is split up again. The points in each leaf should be close together on the Hilbert curve, the basic assumption is always that geostrings which are close will also be close-ish on the earth’s surface. The search queries profit from having a tall tree structure which allows it to easily reduce the number of possible points.

To support typical CRUD operations in an index we need to be able to efficiently insert and delete elements. This means for every new element we need to find the right leaf in the tree.
A comparison function is required, which allows us to navigate the index tree quickly i.e. whether we need to walk down into the left or right child, starting from the root node.

ArangoDBs geoindex achieves this by storing the distance to a set of global fixpoints at every internal node (pot). These fixpoints F_1,…,F_6 are points on the surface of the earth, currently the north- and south pole, as well as equidistant points on the equator. There is nothing inherently special about this choice, the only requirement for these fixpoints is that they should have approximately the same distance from each other.

Each pot contains the maximum distance of all child pots or points to each of the 6 fixpoints, which defines an area on the earth-surface. These maximum distances essentially define a geometric shape, which changes dynamically to fit the contained coordinates. Unlike the R-tree, VP-Tree or other typical structures used for geo indexes, these bounds are not just based on a fixed geometry like a bounding rectangle or circle.

The distance calculation of a coordinate which each of the fixpoints can be made very fast, because we do not actually need to compute accurate distances. A measure like x^2+y^2+z^2 (squared euclidean distance) is sufficient to compare distances on an approximate sphere. Additionally distances are internally stored as 16bit integers to save some space.

Performance Comparison

Just for fun I will benchmark our code against an implementation of the Vantage-Point Tree spatial index data structure. The VP-Tree is a space partitioning data structure that allows for efficient querying of nearest neighbors.

VP-Tree Partitioning

This benchmark is not intended to be fully conclusive, it just serves to get an idea about the performance characteristic of our algorithm versus an implementation of a well-known algorithm for the same problem. The below diagrams show the duration it takes to create both indexes and fill them with points as well as the total duration to search for the 256 nearest neighbors of 500 randomly chosen target coordinates. As you can see the ArangoDB data structure can be filled faster than the VP-Tree and has comparable query performance.

Adding points to Data-Structure

Duration of 500 nearest neighbor queries for 500 random points

The code uses an in-memory version of ArangoDB’s geoindex structure, which is used within our MMFiles storage engine. The VP-Tree implementation used is from here and is a fairly straightforward adaptation. The benchmarks performs nearest neighbor queries for randomly selected locations on different kinds of geolocation sample data:
There are two kinds of datasets (10.000, 100.0000, …) specifies datasets filled with randomly generated geo-coordinates, whereas the dataset world cities can be found here and contains (unsurprisingly) the names and locations of all the cities in the world.

The source code of the benchmark is available on GitHub. It should run on any Unix-like system with a modern C++ compiler. For the ArangoDB part of the code the relevant parts look like this:

For some more information you can also watch this great talk by my colleague Richard.

Roadmap For Extended Geo Support In ArangoDB

The integration of the geo_cursor was already a good step forward for ArangoDB but of course there is still more useful functionality to be implemented.

As mentioned above the current geo-index is optimized for our MMFiles storage engine but not yet for our new RocksDB engine which is available since ArangoDB 3.2. More and more users switch to RocksDB as their storage engine and we will integrate an optimized solution to support efficient geo-queries also with this engine. The work on this has already started.

Of course speed is not everything, so we also want to provide a broader set of geo functionality by integrating full GeoJSON support including polygons, multi-polygons and other geometry primitives.

With these functionalities one can do more complex queries and build e.g. location-aware recommendation engines by combining the graph data model with geo-location aspects or use multiple data models. For instance, in the age of self-driving cars one can find the nearest available maintenance team (geo query) with the right permission (graph model) to repair a given problem (sent automatically to the db as e.g. a JSON document or key/value pair). We already started this integration and plan to release GeoJSON functionalities with ArangoDB 3.4 (early 2018).