ArangoSearch Tutorial
This tutorial consists of two parts: Introduction – describing the concepts behind ArangoSearch and how it works; and Hands-on Tutorial – showing how to use ArangoSearch with practical examples using the IMDB movies dataset.
Introduction
In 3.4 we introduced a new database object called “ArangoSearch View” which is meant to be treated as another data source accessible via AQL. Search uses a special type of materialized view to provide full-text search across multiple existing collections.
The motivation behind having such view in ArangoDB is related to the fact that nowadays every single user is faced with huge amounts of semi-structured and unstructured data and when querying the data he has only a vague assumption of what he is looking for. That means when executing a query he is not only interested in datasets satisfying search criteria but particularly the most relevant ones. In order to achieve that, ArangoSearch combines two information retrieval models: boolean and generalized ranking retrieval so that each document “approved” by the boolean model gets its own rank from the ranking model.
What is ArangoSearch
Under the hood ArangoSearch view is an index powered by IResearch library combined of 2 data structures:
- Inverted index, which is designed to allow blazingly fast searches
- Columnar store, which is designed to provide fast access to arbitrary data by an internal primary key
Inverted index
In this paragraph, we’re going to have a look at how ArangoSearch inverted index conceptually looks like.
Let’s assume we have 2 documents:
1 2 |
{ "someText" : "quick brown fox jumps over the lazy dog" } // doc1 { "someText" : "quick brown dog leaps over the lazy fox in winter" } // doc2 |
As the next step, let’s split text string containing in someText
field into individual words, sort them and remove duplicates. As the result we’ll end up with the following table:
Word | Doc1 | Doc2 |
brown | x | x |
dog | x | x |
fox | x | x |
in | x | |
jumps | x | |
lazy | x | |
leaps | x | |
over | x | x |
quick | x | x |
the | x | x |
winter | x |
Now, if we want to search for documents containing any of the words: “fox”, “jumps”, we just need to find documents, in which any of the aforementioned words appears, and return them:
Word | Doc1 | Doc2 |
fox | x | x |
jumps | x | |
Total | 2 | 1 |
Naive ranking
Both documents match, but the first document has more matches than the second, so as the naive similarity measure (that just counts the number of matching terms), we can say that the first query is more relevant to the specified search criteria than the second one.
View concept
The integration of search index in ArangoDB is different compared to the other well-known databases (e.g. OrientDB, Couchbase). Usually, there is an index instance per queryable data source (e.g. collection or table). Since ArangoDB is a multi-model database, it allows you to efficiently handle graphs (consisting of multiple collections) and document collections, that’s why the view concept has been introduced.
An ArangoSearch view contains references to documents stored in different collections. This makes it possible to perform complex federated searches even over a complete graph including vertex and edge collections.
ArangoSearch brings 2 new steps to a regular ArangoDB data processing flow:
- Analysis
- Ranking (described above)

Analysis
Analysis, as the part of view DML processing flow, is meant to be treated as transformation (e.g. tokenization of the phrase in the example above) of field value in a document into a stream of “tokens”. The definition of “token” depends on the application, e.g. in case of text processing one can treat tokens as words in a phrase.
ArangoSearch analyzers are accessible in AQL via new TOKENS
function. TOKENS
function may be used in regular AQL expressions outside of ArangoSearch view context which is quite helpful, e.g. for debugging.
In 3.4 ArangoSearch ships with a set of built-in text processing analyzers for the most common languages like English, Chinese, German, Spanish, French, Swedish, Russian and more. Each analyzers performs language specific tokenization and stemming. Unfortunately there is no way to customize data analysis behavior in 3.4, namely all language-specific analyzers tokenize the value into case-sensitive word stems as per the specific locale and not discard any stopwords.
Example 1. English text analyzer
1 2 3 4 5 6 7 8 9 10 11 12 13 |
db._query('RETURN TOKENS("ArangoDB is a multi-model database", "text_en")') [ [ "arangodb", "is", "a", "multi", "model", "databas" ] ] |
Example 2. Chinese text analyzer
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
db._query("RETURN TOKENS('ArangoDB是一个多模型数据库', 'text_zh')") [ [ "arangodb", "是", "一个", "多", "模型", "数据", "库" ] ] |
Example 3. Identity analyzer
There is another built-in “identity” analyzer. It is used as a placeholder in some functions if no analyzer was explicitly specified by a user. We’ll give you more examples in the next chapters below.
“Identity” analyzer does nothing but treats input value as an atom, so no transformation is applied.
1 2 3 4 5 6 7 8 |
db._query('RETURN TOKENS("ArangoDB is a multi-model database", "identity")') [ [ "ArangoDB is a multi-model database" ] ] |
Hands-on Tutorial Part
Indexing and querying IMDB dataset
For querying examples in this tutorial we’re going to use IMDB dataset which is taken from the IMDB service http://www.imdb.com. One can follow the link to get all the instructions of how to load the data into ArangoDB.
Once data is loaded, your ArangoDB database will contain two collections:
- imdb_vertices (63027 documents)
- imdb_edges (225060 documents)
Create a view
Let’s first create your first ArangoSearch view. Please open arangosh and type:
1 |
var v = db._createView("v_imdb", "arangosearch"); |
After performing the command above, ArangoDB v_imdb
will be created in your database. To validate that you can retrieve a list of the existing views in the database.
1 |
db._views(); |
Since we omitted the 3rd argument when creating the view, ArangoSearch view was created with a default set of properties. You can inspect them by executing the following command:
1 |
v.properties() |
or
1 |
db._view("v_imdb").properties() |
Please note that links
is empty, which means that created view has no data, but it’s already queryable, e.g. the following query will return nothing:
1 |
db._query("FOR d IN v_imdb RETURN d") |
View can be dropped with one of the following commands, but that’s not the case for now:
1 |
v.drop(); |
or
1 |
db._view("v_imdb").drop() |
Link a view with a collection
In order to index some data you have to instantiate a link between the view and corresponding collection. Let’s start with creating a link definition:
1 2 3 4 |
var link = { includeAllFields: true, fields : { description : { analyzers : [ "text_en" ] } } }; |
Link definition above describes the following intention:
- We want to index every single field in every document in a linked collection (
includeAllFields : true
) - We want to apply
identity analyzer
to every field in a document except thedescription
field, which has to be indexed withtext_en
analyzer to properly tokenize and stem its data.
Let’s now establish a connection between collection imdb_vertices
and view v_imdb
:
1 |
v.properties({ links: { imdb_vertices: link }}) |
After executing the command above, links
properties should contain the following value:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
"links" : { "imdb_vertices" : { "analyzers" : [ "identity" ], "fields" : { "description" : { "analyzers" : [ "text_en" ] } }, "includeAllFields" : true, "storeValues" : "none", "trackListPositions" : false } } |
Let’s validate that all documents from linked collection are indexed by the view. The following query should return exactly the same number of documents as in the collection imdb_vertices
:
1 |
db._query("FOR d IN v_imdb COLLECT WITH COUNT INTO count RETURN count") |
Basic querying
Let’s try to find some movies:
1 |
db._query("RETURN LENGTH(FOR d IN v_imdb SEARCH d.type == 'Movie' RETURN d)") |
The query above returns 12862 documents.
We can specify search criteria to get only drama movies:
1 |
db._query("FOR d IN v_imdb SEARCH d.type == 'Movie' AND d.genre == 'Drama' RETURN d") |
The query above returns 2690 documents.
Or even get drama movies that are relatively short, between 10 and 50 minutes:
1 |
db._query("FOR d IN v_imdb SEARCH d.type == 'Movie' AND d.genre == 'Drama' AND d.runtime IN 10..50 RETURN d") |
The query above returns only 35 documents.
SEARCH vs FILTER
Curious user may note that we used SEARCH
statement instead of familiar FILTER
in the examples above, there is subtle but fundamental difference between the keywords.
In the context of ArangoSeach view FILTER
always means post-processing, it doesn’t use any indices, but rather performs a full-scan over the result produced by a corresponding view expression.
SEARCH
keyword does guarantee to use ArangoSearch indices allowing the most efficient execution plan.
SEARCH expression has the following noticeable differences compared to FILTER expressions:
- SEARCH expression, in contrast to
FILTER
, is meant to be treated as a part of theFOR
operation, not as an individual statement - There may be only one
SEARCH
expression related to a correspondingArangoSearch view
expression - Though
ArangoSearch view
contains documents having at least 1 indexed field,SEARCH
expression works only with indexed attributes (see Link properties), non-indexed attributes are treated as non-existent - Data source specific AQL extensions (e.g.
PHRASE
,STARTS_WITH
,BOOST
) are allowed only within a context ofSEARCH
expression
It’s always recommended to prefer SEARCH
expression over the FILTER
expression, that’s almost always possible.
The following query gives you an example of how SEARCH
and FILTER
expressions can be combined. It will return all drama movies within a random duration interval:
1 |
db._query("FOR d IN v_imdb SEARCH d.type == 'Movie' AND d.genre == 'Drama' FILTER d.runtime IN RAND()*10..RAND()*100 RETURN d") |
To learn more about SEARCH
expressions and its fundamental differences compared to FILTER
expression please check our AQL documentation.
Phrase search
With a phrase search one can search for documents containing an exact sentence or phrase rather than comparing a set of keywords in random order. In ArangoSearch phrase search is available via PHRASE function.
The following example gives me documents that contain phrase “Star wars” in the description:
1 |
db._query("FOR d IN v_imdb SEARCH PHRASE(d.description, 'Star wars', 'text_en') RETURN d") |
Proximity search
Proximity searching is a way to search for two or more words that occur within a certain number of words from each other.
PHRASE function described above allows performing complex proximity queries as well.
In the next example I’m looking for the word sequence to be present in the description of a movie: in <any word> galaxy
1 |
db._query("FOR d IN v_imdb SEARCH PHRASE(d.description, 'in', 1, 'galaxy', 'text_en') RETURN d") |
The following snippet gathers results of several parameterized proximity searches:
1 2 3 4 5 6 7 |
var res = new Array(); for (var i = 1; i <= 5; ++i) { res.push( db._query("FOR d IN v_imdb SEARCH PHRASE(d.description, 'in', @prox, 'galaxy', 'text_en') RETURN d", { prox: i }).toArray() ); } |
With a PHRASE function you can perform really complex proximity searches, e.g. story <any word> <any word> man <any word> <any word> undercover
:
1 |
db._query("FOR d IN v_imdb SEARCH PHRASE(d.description, 'story', 2, 'man', 2, 'undercover', 'text_en') RETURN d") |
Min match
MIN_MATCH function is another very interesting and useful ArangoSearch extensions. Essentially that is something in between of disjunction and conjunction. It accepts multiple SEARCH
expressions and numeric value denoting a minimum number of search expressions to be satisfied. The function matches documents where at least a specified number of search expressions are true.
In the following example we’re looking for 2 episodes of “Stars Wars” saga, namely “Phantom Menace” and “New hope”:
1 |
db._query("FOR d IN v_imdb SEARCH MIN_MATCH(PHRASE(d.description, 'Phantom Menace', 'text_en'), PHRASE(d.description, 'Star Wars', 'text_en'), PHRASE(d.description, 'New Hope', 'text_en'), 2) RETURN d") |
Analyzer
In last example we used PHRASE
functions 3 times with the same analyzer. The query above can be simplified with the help of special ANALYZER function:
1 |
db._query("FOR d IN v_imdb SEARCH ANALYZER(MIN_MATCH(PHRASE(d.description, 'Phantom Menace'), PHRASE(d.description, 'Star Wars'), PHRASE(d.description, 'New Hope'), 2), 'text_en') RETURN d") |
The query above will produce the same output as the previous one.
ANALYZER function does nothing but overrides analyzer in a context of SEARCH expression with another one, denoted by a specified analyzer, making it available for search functions. By default every SEARCH function that accepts analyzer as a parameter has identity analyzer in its context.
Ranking in ArangoSearch
In the beginning of the tutorial, we’ve described naive ranking similarity measure. Of course, it doesn’t work well in real-world use cases. Though understanding of how ranking works requires some advanced knowledge, in this section we’re going to give you an overview on ranking model AragnoSearch uses for measuring documents relevance.
The ranking, in general, is the process of evaluating document rank/score according to specified search criteria. For text processing, ArangoSearch uses Vector Space Model.
It postulates the following: in the space formed by the terms of the query the document vectors that are closer to a query vector are more relevant.
Practically the closeness is expressed as the cosine of the angle between two vectors, namely cosine similarity.

In other words, we treat documents and search criteria as vectors and evaluate cosine of the angle between them. Then we can sort documents by the evaluated cosine score value and finally retrieve documents that are closer (more relevant) to a specified query than the others.
Let’s consider the following example:
Assume we have 3 documents in a collection:
1 2 3 |
{ "text" : "quick dog" } // doc1 { "text" : "brown fox" } // doc2 { "text" : "quick brown fox" } // doc3 |
Now imagine we have a query:
1 |
FOR d in myView SEARCH d.text IN ["quick", "brown"] RETURN d |
For this query let’s imagine two dimensional space formed by words quick and brown

Then let’s put every document to the graph above:

From the plot above it’s clear that doc3 is the most relevant to the given query.
Ranking in AQL
In order to compute cosine, we have to evaluate vector components at first. There are a number of probability/statistical weighting models but for now, ArangoSearch provides two, probably the most famous schemes:
Under the hood both models rely on 2 main components:
Term frequency (TF) – in the simplest case defined as the number of times that term t occurs in document d. Inverse document frequency (IDF) is a measure of how much information the word provides, i.e. whether the term is common or rare across all documents.
In AQL both weighting methods are exposed via corresponding functions: BM25, TFIDF. Note that both function accept AQL corresponding loop variable as the first argument. In 3.4 these functions have the following limitation: that’s not possible to use evaluated score value in generic AQL expressions (e.g. filters or calculations), the aforementioned functions must be used in SORT statement related to an ArangoSearch view.
Let’s imagine that we’re extremely interested in a nice sci-fi movie. I want to find something that matches the following set of keywords: “amazing, action, world, alien, sci-fi, science, documental, galaxy”:
1 |
db._query("FOR d IN v_imdb SEARCH ANALYZER(d.description IN TOKENS('amazing action world alien sci-fi science documental galaxy', 'text_en'), 'text_en') LIMIT 10 RETURN d") |
Without any ranking I got “The Prime of Miss Jean Brodie” in the very beginning of result set. That’s is definitely not what I did want to watch. Nevertheless, you may get different results, that’s non-deterministic since we haven’t specified any sorting condition there.
Let’s rank result with VSM and BM25 algorithm:
1 |
db._query("FOR d IN v_imdb SEARCH ANALYZER(d.description IN TOKENS('amazing action world alien sci-fi science documental galaxy', 'text_en'), 'text_en') SORT BM25(d) LIMIT 10 RETURN d") |
After inspecting result set you note that first place is taken by “Religulous” movie. That’s again something that is not particularly relevant to our original query. The reason is the following: in the query above, we didn’t explicitly specified sorting order. Default value is ASC which means that we implicitly asked ArangoDB to show us the most irrelevant results first. Let’s fix it by explicitly specifying our intention to get the most relevant docs in the beginning of result set:
1 |
db._query("FOR d IN v_imdb SEARCH ANALYZER(d.description IN TOKENS('amazing action world alien sci-fi science documental galaxy', 'text_en'), 'text_en') SORT BM25(d) DESC LIMIT 10 RETURN d") |
And now the most relevant document is “AVPR: Aliens vs. Predator – Requiem”. Yes, that’s definitely better.
Query time relevance tuning
Another crucial point of ArangoSearch is the ability to fine-tune document scores evaluated by relevance models at query time. That functionality is exposed in AQL via BOOST function. Similarly to ANALYZER function, BOOST function accepts any other valid SEARCH expression and boost factor. It does nothing but imbues the provided boost value into SEARCH expression context making it available for scorer functions. The documents that match boosted part of search expression will get higher scores.
Let’s continue playing with sci-fi movies. The results we got in previous section are already good enough, but what if we want something more relevant to outer space, the movie that can show us the beauty of distant interstellar galaxies. For the sake of our interest to sci-fi movies, we can ask ArangoSearch to prefer “galaxy” amongst the others keywords:
1 |
db._query("FOR d IN v_imdb SEARCH ANALYZER(d.description IN TOKENS('amazing action world alien sci-fi science documental', 'text_en') || BOOST(d.description IN TOKENS('galaxy', 'text_en'), 5), 'text_en') SORT BM25(d) DESC LIMIT 10 RETURN d") |
And the winner is “Star Trek Collection”, super relevant result indeed.
For information retrieval experts ArangoSearch provides an easy and straightforward way of fine-tuning weighting schemes at query time, e.g. BM25 accepts free coefficients as the parameters and may be turned into BM15:
1 |
db._query("FOR d IN v_imdb SEARCH ANALYZER(d.description IN TOKENS('amazing action world alien sci-fi science documental', 'text_en') || BOOST(d.description IN TOKENS('galaxy', 'text_en'), 5), 'text_en') SORT BM25(d, 1.2, 0) DESC LIMIT 10 RETURN d") |
With BM25 we have a new champion: “The Ice Pirates” movie.