home shape

Index types and how indexes are used in ArangoDB: Part I

As in other database systems, indexes can be used in ArangoDB to speed up data retrieval queries, sometimes by many orders of magnitude. Getting the indexes set up the right way is essential for good query performance, so this is an important topic that affects most ArangoDB installations.

This is Part I of how indexes are used by ArangoDB where we discuss what types of indexes are available in the database. In Part II, we will dig deeper into how to actually add indexes to a data model and speed up specific queries. Read Part II here.

This blog post relates to ArangoDB 3.2 and 3.3 versions.

New to multi-model and graphs? Check out our free ArangoDB Graph Course.

Introduction to different types of indexes

ArangoDB provides several different index types, which partly also have storage engine-specific differences. It is useful to understand the characteristics of each index type in order to make the best use of indexes.

Primary index

Every collection created in ArangoDB will have at least the so-called primary index. The primary index will be created automatically when the collection is created, and it cannot be removed or changed, nor can it be created explicitly. The primary index responsibility is to index the values of the `_key` attribute of the collection and to make sure that document keys in the collection are actually unique.

The primary index will be used for lookups by the `_key` or `_id` attribute of the collection, but only for equality lookup queries. Here are a few example queries that will use the primary index:


FOR doc IN collection
  FILTER doc._key == 'test'
  RETURN doc

FOR doc IN collection
  FILTER doc._id == 'collection/abc' OR doc._id == 'collection/def'
  RETURN doc

FOR doc IN collection
  FILTER doc._key IN ['abc', 'def']
  RETURN doc

The primary index will also be used for operations that update, replace or remove documents by their primary key(s). It will not be used for range queries or sort operations.

The internal representation of the primary index is storage engine-specific:

  • For the MMFiles storage engine, the primary index is an in-memory index. Lookups in the primary index will have an amortized complexity of O(1) for the MMFiles engine, meaning that this index is very efficient here.
  • For the RocksDB storage engine, the primary index is a sorted on-disk index internally, so it will have worse performance than O(1).

Edge index

Every edge collection that is created in ArangoDB will also have an edge index automatically. The edge index is responsible for indexing the `_from` and `_to` values, in order to support quick lookup of connected outgoing or incoming edges for graph traversals.
Edge indexes of a collection cannot be removed or changed, nor can they be explicitly created. Edge indexes are non-unique, meaning that there can be multiple edges connected to the same document by default.

The index will be used by queries that lookup by either `_from` or `_to`. A few example queries that will use the edge index of an edge collection:


FOR doc IN edgeCollection
  FILTER doc._from == 'collection/test'
  RETURN doc
FOR doc IN OUTBOUND 'collection/test' edgeCollection
  RETURN doc

Again, this type of index will be used for equality lookups only, but not for range queries. And again, the internal representation of an edge index depends on the storage engine used:

  • For the MMFiles storage engine, the edge index is an in-memory index. Lookups in the MMFiles engine’s edge index are therefore very efficient.
  • For the RocksDB storage engine, the edge index is a sorted on-disk index, but it will have an automatic hash cache in front of it that will cache the documents connected to each edge in memory. It should, therefore, be relatively efficient after the initial cache warmup.

With the RocksDB engine, the edge index may also be used for sorting by `_from` or `_to` for certain kinds of queries.

Hash index

In contrast to the indexes we have seen so far, this type of index can be created on demand by end users. A hash index can be created on a single attribute or on multiple attributes at the same time, as needed.

Creating an index on a single attribute is a no-brainer: the index will then only support queries that do lookups by that specific attribute. However, when an index covers multiple attributes (a so-called combined index), the rules get a bit more complex. And again, the rules are storage engine-specific.

For the MMFiles engine, a hash index is an actual in-memory hash index, which is rebuilt in memory from the on-disk data when a collection is (re)loaded. As it is a hash index, the query must do equality lookups for all index attributes in order to make use of the index. Range lookups are not supported for an MMFiles hash index, nor can it be used for sorting.

When creating an index on multiple attributes, a query must do equality lookups on all index attributes in order to make use of this index.

The following examples assume there is an index present on the attributes `country` and `zip`:


// good: "country" and "zip" both used, both with equality lookups
FOR doc IN collection
  FILTER doc.country == 'DE' AND doc.zip == '50674'
  RETURN doc

The next queries will not use that hash index with the MMFiles engine, because they do not query both index attributes by equality.


// bad: "zip" not used
FOR doc IN collection
  FILTER doc.country == 'DE'
  RETURN doc

// bad: "zip" is used for non-equality lookup
FOR doc IN collection
  FILTER doc.country == 'DE' AND doc.zip >= '50000' AND doc.zip <= '50999'
  RETURN doc

With the RocksDB engine, a hash index internally is again a sorted on-disk index. The name “hash” index was kept here mainly for compatibility reasons. A hash index in RocksDB will be used for queries that do equality lookups on the index attribute(s), if either all index attributes are covered by the query or a leftmost prefix of the index attributes is covered by the query. The last used index attribute in the leftmost prefix can also be used for a
range lookup.

The following queries will all use a hash index on attributes `country` and `zip` (in this order, the order of indexes attributes will matter) with the RocksDB engine:


// good: leftmost prefix ("country") is used
FOR doc IN collection
  FILTER doc.country == 'DE'
  RETURN doc

// good: leftmost prefix ("country") is used
FOR doc IN collection
  FILTER doc.country >= 'D' AND doc.country <= 'D'
  RETURN doc

// good: both index attributes are used, leftmost prefix does an equality lookup
FOR doc IN collection
  FILTER doc.country == 'DE' AND doc.zip == '50674' 
  RETURN doc

// good: both index attributes are used, leftmost prefix does an equality lookup
FOR doc IN collection
  FILTER doc.country == 'DE' AND doc.zip >= '50000' AND doc.zip <= '50999'
  RETURN doc

The following queries will not use a hash index on `country` and `zip` (in this order) with the RocksDB engine:


// bad: no leftmost prefix of the index attributes is used
FOR doc IN collection
  FILTER doc.zip == '50674' 
  RETURN doc

// bad: leftmost prefix uses range lookup. this will only use the index for 
// filtering on "country", and do post-filtering on "zip" without the index
FOR doc IN collection
  FILTER doc.country >= 'D' AND doc.country <= 'D' AND doc.zip == '50674'
  RETURN doc

A hash index can also be used for sorting on the index attribute(s) with the RocksDB engine, provided that the SORT clause uses the same direction (either all ascending or all descending) for all index attributes. Again, the index will support sorting on a leftmost prefix of the index attributes:


FOR doc IN collection
  FILTER doc.country >= 'D' AND doc.country <= 'D' 
  SORT doc.country
  RETURN doc

FOR doc IN collection
  FILTER doc.country == 'DE' AND doc.zip >= '50000' AND doc.zip <= '50999'
  SORT doc.country, doc.zip
  RETURN doc

User-defined hash indexes can be unique or non-unique. Declaring an index unique will give it a slight bonus in the query optimizer’s heuristics. If an index cannot be declared unique, the optimizer will use an average selectivity estimate to determine how many documents will be returned by the index on average. This selectivity estimate will be referred to by the query optimizer when there are multiple indexes to choose from.

A user-defined hash index can optionally be declared sparse. Making an index sparse will make it ignore all documents for which at least one of the index attributes does not exist or has a value of `null`. In this case, such document will not make it into the index. This will keep the index size smaller in case there are many documents with the index attribute(s) either not present or `null`. However, when an index is declared sparse, the optimizer may not consider it usable in all situations.

Example:


// bad: the provided range includes `null`, but the index does not contain 
// documents for which the index attribute value is `null`
FOR doc IN collection
  FILTER doc.country <= 'F'
  RETURN doc

Declaring an index sparse will make it unsafe to use in many cases from the perspective of the query optimizer, so this option should be used with a bit of care.

Skiplist index

In the MMFiles engine, the skiplist index actually uses an in-memory skiplist behind the scenes. A skiplist is a sorted data structure, so the skiplist index is a general purpose index type that can be used to support a lot of different types of queries (equality lookups, range scans, sorting).

In the RocksDB engine, the skiplist index shares the same implementation as the hash index, so all notes about the RocksDB-based hash index apply for the RocksDB-based skiplist index too. Again, the name “skiplist” was kept for compatibility reasons only.

Regardless of the underlying storage engine, the skiplist index will be used for queries that do equality lookups on the index attribute(s), if either all index attributes are covered by the query or a leftmost prefix of the index attributes is covered by the query. The last used index attribute in the leftmost prefix can also be used for a range lookup.

The following queries will all use a skiplist index on attributes `country` and `zip` (in this order, the order of index attributes does matter):


// good: leftmost prefix ("country") is used
FOR doc IN collection
  FILTER doc.country == 'DE'
  RETURN doc

// good: leftmost prefix ("country") is used
FOR doc IN collection
  FILTER doc.country >= 'D' AND doc.country <= 'D'
  RETURN doc

// good: both index attributes are used, leftmost prefix does an equality lookup
FOR doc IN collection
  FILTER doc.country == 'DE' AND doc.zip == '50674' 
  RETURN doc

// good: both index attributes are used, leftmost prefix does an equality lookup
FOR doc IN collection
  FILTER doc.country == 'DE' AND doc.zip >= '50000' AND doc.zip <= '50999'
  RETURN doc

The following queries will not use a skiplist index on `country` and `zip` (in this order) with the RocksDB engine:


// bad: no leftmost prefix of the index attributes is used
FOR doc IN collection
  FILTER doc.zip == '50674' 
  RETURN doc

// bad: leftmost prefix uses range lookup. this will only use the index for 
// filtering on "country", and do post-filtering on "zip" without the index
FOR doc IN collection
  FILTER doc.country >= 'D' AND doc.country <= 'D' AND doc.zip == '50674'
  RETURN doc

A skiplist index can also be used for sorting on the index attribute(s), provided that the SORT clause uses the same direction (either all ascending or all descending) for all index attributes. Again, the index will support sorting on a leftmost prefix of the index attributes:


FOR doc IN collection
  FILTER doc.country >= 'D' AND doc.country <= 'D' 
  SORT doc.country
  RETURN doc

FOR doc IN collection
  FILTER doc.country == 'DE' AND doc.zip >= '50000' AND doc.zip <= '50999'
  SORT doc.country, doc.zip
  RETURN doc

Fulltext index

The fulltext index in ArangoDB can be used to split a text attribute into individual words, which will then all be inserted into the index for the document that contains them. Later on, one can query documents by individual words or a combination of words.

The fulltext index is only used in queries that make use of the AQL function `FULLTEXT`.

Provided that there is a fulltext index present on attribute `text`, the following query would make use of that fulltext index to find all documents that contain the word `bart`:


FOR doc IN FULLTEXT(collection, 'text', 'bart')
  RETURN doc

We are currently working on ArangoSearch (coming with ArangoDB 3.4) which will be a superior replacement for the existing fulltext index. It offers relevance ranking, allows indexing of multiple attributes and supports multiple languages (including Chinese & Russian). You can already try it out by downloading a 3.4 Milestone and going through ArangoSearch tutorial.

Geo index

The geo index in ArangoDB will store a geo coordinate from a configurable location attribute of documents, and can be used to efficiently find all documents that are closest to a certain geo coordinate. Additionally it supports querying documents whose location is in a circle around a certain geo coordinate.

Historically, the geo index was only used in queries that made use of the AQL functions `NEAR` (read more) and `WITHIN` (read more).

In newer versions of ArangoDB (this includes 3.3) it will also be used for queries that filter on distance between the geo coordinate covered by the index and some other coordinate.

For example, the following query will use a geo index on attributes `latitude` and `longitude` (in this order):


FOR doc IN collection 
  FILTER DISTANCE(50.9322, 6.94, doc.latitude, doc.longitude) <= 40000
  RETURN doc

A geo index can also be used to efficiently find the documents whose location is closest to a certain geo coordinate:


FOR doc IN collection 
  SORT DISTANCE(50.9322, 6.94, doc.latitude, doc.longitude)
  LIMIT 25
  RETURN doc

Note that the sort order must be ascending here in order for the order index to be used. Queries like this are most efficient if the provided `LIMIT` value is rather low.

We are currently working on significantly extending and improving geo index functionality coming with ArangoDB 3.4. It will offer more advanced querying functionality, performance optimizations and support for standard GeoJSON types (polygons, multi-polygons, polylines, intersections, etc.).

In Part II of this blog post series, we discuss how to actually add indexes to a data model and speed up specific use cases.

Jan Steemann

Jan Steemann

After more than 30 years of playing around with 8 bit computers, assembler and scripting languages, Jan decided to move on to work in database engineering. Jan is now a senior C/C++ developer with the ArangoDB core team, being there from version 0.1. He is mostly working on performance optimization, storage engines and the querying functionality. He also wrote most of AQL (ArangoDB’s query language).

Leave a Comment





Get the latest tutorials, blog posts and news: