ArangoSearch

ArangoSearch Tutorial

This tutorial consists of two parts: Introduction – describing the concepts behind ArangoSearch and how it works; and Hands-on Tutorial – showing how to use ArangoSearch with practical examples using IMDB dataset.

Introduction

In 3.4 we’re introducing a new database object called “ArangoSearch View” which is meant to be treated as another data source accessible via AQL. Search uses a special type of materialized view to provide full-text search across multiple existing collections.

The motivation behind having such view in ArangoDB is related to the fact that nowadays every single user is faced with huge amounts of semi-structured and unstructured data and when querying the data he has only a vague assumption of what he is looking for. That means when executing a query he is not only interested in datasets satisfying search criteria but particularly the most relevant ones. In order to achieve that, ArangoSearch combines two information retrieval models: boolean and generalized ranking retrieval so that each document “approved” by the boolean model gets its own rank from the ranking model.

What is ArangoSearch

Under the hood ArangoSearch view is an index powered by IResearch library combined of 2 data structures:

  • Inverted index, which is designed to allow blazingly fast searches
  • Columnar store, which is designed to provide fast access to arbitrary data by an internal primary key

Inverted index

In this paragraph, we’re going to have a look at how ArangoSearch inverted index conceptually looks like.

Let’s assume we have 2 documents:

As the next step, let’s split text string containing in someText field into individual words, sort them and remove duplicates. As the result we’ll end up with the following table:

Word Doc1 Doc2
brown x x
dog x x
fox x x
in x
jumps x
lazy x
leaps x
over x x
quick x x
the x x
winter x

Now, if we want to search for documents containing any of the words: “fox”, “jumps”, we just need to find documents, in which any of the aforementioned words appears, and return them:

Word Doc1 Doc2
fox x x
jumps x
Total 2 2

Naive ranking

Both documents match, but the first document has more matches than the second, so as the naive similarity measure (that just counts the number of matching terms), we can say that the first query is more relevant to the specified search criteria than the second one.

View concept

The integration of search index in ArangoDB is different compared to the other well-known databases (e.g. OrientDB, Couchbase). Usually, there is an index instance per queryable data source (e.g. collection or table). Since ArangoDB is a multi-model database, it allows you to efficiently handle graphs (consisting of multiple collections) and document collections, that’s why the view concept has been introduced.

ArangoSearch view doesn’t store any data but references to actual documents (in columnar store). A single ArangoSearch view contains documents coming from different collections making it possible to perform complex federated searches even over the whole graph.

ArangoSearch brings 2 new steps to a regular ArangoDB data processing flow:

  • Analysis
  • Ranking (described above)
ArangoSearch - The View Concept

Analysis

Analysis, as the part of view DML processing flow, is meant to be treated as transformation (e.g. tokenization of the phrase in the example above) of field value in a document into a stream of “tokens”. The definition of “token” depends on the application, e.g. in case of text processing one can treat tokens as words in a phrase.

ArangoSearch analyzers are accessible in AQL via new TOKENS function. TOKENS function may be used in regular AQL expressions outside of ArangoSearch view context which is quite helpful, e.g. for debugging.

In 3.4 ArangoSearch ships with a set of built-in text processing analyzers for the most common languages like English, Chinese, German, Spanish, French, Swedish, Russian and more. Each analyzers performs language specific tokenization and stemming. Unfortunately there is no way to customize data analysis behavior in 3.4, namely all language-specific analyzers tokenize the value into case-sensitive word stems as per the specific locale and not discard any stopwords.

Example 1. English text analyzer

Example 2. Chinese text analyzer

Example 3. Identity analyzer

There is another built-in “identity” analyzer. It is used as a placeholder in some functions if no analyzer was explicitly specified by a user. We’ll give you more examples in the next chapters below.

“Identity” analyzer does nothing but treats input value as an atom, so no transformation is applied.

Hands-on Tutorial Part

Indexing and querying IMDB dataset

For querying examples in this tutorial we’re going to use IMDB dataset which is taken from the IMDB service http://www.imdb.com. One can follow the link to get all the instructions of how to load the data into ArangoDB.

Note: one can use https://github.com/KinoLien/gitzip to download individual folders from github

Once data is loaded, your ArangoDB database will contain two collections:

  • imdb_vertices (63027 documents)
  • imdb_edges (225060 documents)

Create a view

Let’s first create your first ArangoSearch view. Please open arangosh and type:

After performing the command above, ArangoDB v_imdb will be created in your database. To validate that you can retrieve a list of the existing views in the database.

Since we omitted the 3rd argument when creating the view, ArangoSearch view was created with a default set of properties. You can inspect them by executing the following command:

or

Please note that links is empty, which means that created view has no data, but it’s already queryable, e.g. the following query will return nothing:

View can be dropped with one of the following commands, but that’s not the case for now:

or

Note: Please consult our documentation to get comprehensive description of ArangoSearch view properties

Link a view with a collection

In order to index some data you have to instantiate a link between the view and corresponding collection. Let’s start with creating a link definition:

Link definition above describes the following intention:

  • We want to index every single field in every document in a linked collection (includeAllFields : true)
  • We want to apply identity analyzer to every field in a document except the description field, which has to be indexed with text_en analyzer to properly tokenize and stem its data.

Let’s now establish a connection between collection imdb_vertices and view v_imdb:

After executing the command above, links properties should contain the following value:

Note: Note: Please consult our documentation to get comprehensive description of ArangoSearch link properties

Let’s validate that all documents from linked collection are indexed by the view. The following query should return exactly the same number of documents as in the collection imdb_vertices:

Basic querying

Let’s try to find some movies:

The query above returns 12862 documents.

We can specify search criteria to get only drama movies:

The query above returns 2690 documents.

Or even get drama movies that are relatively short, between 10 and 50 minutes:

The query above returns only 35 documents.

SEARCH vs FILTER

Curious user may note that we used SEARCH statement instead of familiar FILTER in the examples above, there is subtle but fundamental difference between the keywords.

In the context of ArangoSeach view FILTER always means post-processing, it doesn’t use any indices, but rather performs a full-scan over the result produced by a corresponding view expression.

SEARCH keyword does guarantee to use ArangoSearch indices allowing the most efficient execution plan.

SEARCH expression has the following noticeable differences compared to FILTER expressions:

  • SEARCH expression, in contrast to FILTER, is meant to be treated as a part of the FOR operation, not as an individual statement
  • There may be only one SEARCH expression related to a corresponding ArangoSearch view expression
  • Though ArangoSearch view contains documents having at least 1 indexed field, SEARCH expression works only with indexed attributes (see Link properties), non-indexed attributes are treated as non-existent
  • Data source specific AQL extensions (e.g. PHRASE, STARTS_WITH, BOOST) are allowed only within a context of SEARCH expression

It’s always recommended to prefer SEARCH expression over the FILTER expression, that’s almost always possible.

The following query gives you an example of how SEARCH and FILTER expressions can be combined. It will return all drama movies within a random duration interval:

To learn more about SEARCH expressions and its fundamental differences compared to FILTER expression please check our AQL documentation.

Phrase search

With a phrase search one can search for documents containing an exact sentence or phrase rather than comparing a set of keywords in random order. In ArangoSearch phrase search is available via PHRASE function.

The following example gives me documents that contain phrase “Star wars” in the description:

Proximity search

Proximity searching is a way to search for two or more words that occur within a certain number of words from each other.
PHRASE function described above allows performing complex proximity queries as well.

In the next example I’m looking for the word sequence to be present in the description of a movie: in <any word> galaxy

The following snippet gathers results of several parameterized proximity searches:

With a PHRASE function you can perform really complex proximity searches, e.g. story <any word> <any word> man <any word> <any word> undercover:

Min match

MIN_MATCH function is another very interesting and useful ArangoSearch extensions. Essentially that is something in between of disjunction and conjunction. It accepts multiple SEARCH expressions and numeric value denoting a minimum number of search expressions to be satisfied. The function matches documents where at least a specified number of search expressions are true.

In the following example we’re looking for 2 episodes of “Stars Wars” saga, namely “Phantom Menace” and “New hope”:

Analyzer

In last example we used PHRASE functions 3 times with the same analyzer. The query above can be simplified with the help of special ANALYZER function:

The query above will produce the same output as the previous one.

ANALYZER function does nothing but overrides analyzer in a context of SEARCH expression with another one, denoted by a specified analyzer, making it available for search functions. By default every SEARCH function that accepts analyzer as a parameter has identity analyzer in its context.

Ranking in ArangoSearch

Note: due to the restriction in cluster-wide ArangoSearch views in 3.4 one may get different query results compared to the given below. To achieve the exactly same results, please ensure you’re using single server or cluster collection with only 1 shard.

In the beginning of the tutorial, we’ve described naive ranking similarity measure. Of course, it doesn’t work well in real-world use cases. Though understanding of how ranking works requires some advanced knowledge, in this section we’re going to give you an overview on ranking model AragnoSearch uses for measuring documents relevance.

The ranking, in general, is the process of evaluating document rank/score according to specified search criteria. For text processing, ArangoSearch uses Vector Space Model.
It postulates the following: in the space formed by the terms of the query the document vectors that are closer to a query vector are more relevant.

Practically the closeness is expressed as the cosine of the angle between two vectors, namely cosine similarity.

ArangoSearch Tutorial image 2

In other words, we treat documents and search criteria as vectors and evaluate cosine of the angle between them. Then we can sort documents by the evaluated cosine score value and finally retrieve documents that are closer (more relevant) to a specified query than the others.

Let’s consider the following example:

Assume we have 3 documents in a collection:

Now imagine we have a query:

For this query let’s imagine two dimensional space formed by words quick and brown

ArangoSearch Tutorial image 4

Then let’s put every document to the graph above:

ArangoSearch Tutorial image 1

From the plot above it’s clear that doc3 is the most relevant to the given query.

Ranking in AQL

In order to compute cosine, we have to evaluate vector components at first. There are a number of probability/statistical weighting models but for now, ArangoSearch provides two, probably the most famous schemes:

Under the hood both models rely on 2 main components:
Term frequency (TF) – in the simplest case defined as the number of times that term t occurs in document d. Inverse document frequency (IDF) is a measure of how much information the word provides, i.e. whether the term is common or rare across all documents.

In AQL both weighting methods are exposed via corresponding functions: BM25, TFIDF. Note that both function accept AQL corresponding loop variable as the first argument. In 3.4 these functions have the following limitation: that’s not possible to use evaluated score value in generic AQL expressions (e.g. filters or calculations), the aforementioned functions must be used in SORT statement related to an ArangoSearch view.

Let’s imagine that we’re extremely interested in a nice sci-fi movie. I want to find something that matches the following set of keywords: “amazing, action, world, alien, sci-fi, science, documental, galaxy”:

Without any ranking I got “The Prime of Miss Jean Brodie” in the very beginning of result set. That’s is definitely not what I did want to watch. Nevertheless, you may get different results, that’s non-deterministic since we haven’t specified any sorting condition there.

Let’s rank result with VSM and BM25 algorithm:

After inspecting result set you note that first place is taken by “Religulous” movie. That’s again something that is not particularly relevant to our original query. The reason is the following: in the query above, we didn’t explicitly specified sorting order. Default value is ASC which means that we implicitly asked ArangoDB to show us the most irrelevant results first. Let’s fix it by explicitly specifying our intention to get the most relevant docs in the beginning of result set:

And now the most relevant document is “AVPR: Aliens vs. Predator – Requiem”. Yes, that’s definitely better.

Query time relevance tuning

Another crucial point of ArangoSearch is the ability to fine-tune document scores evaluated by relevance models at query time. That functionality is exposed in AQL via BOOST function. Similarly to ANALYZER function, BOOST function accepts any other valid SEARCH expression and boost factor. It does nothing but imbues the provided boost value into SEARCH expression context making it available for scorer functions. The documents that match boosted part of search expression will get higher scores.

Let’s continue playing with sci-fi movies. The results we got in previous section are already good enough, but what if we want something more relevant to outer space, the movie that can show us the beauty of distant interstellar galaxies. For the sake of our interest to sci-fi movies, we can ask ArangoSearch to prefer “galaxy” amongst the others keywords:

And the winner is “Star Trek Collection”, super relevant result indeed.

For information retrieval experts ArangoSearch provides an easy and straightforward way of fine-tuning weighting schemes at query time, e.g. BM25 accepts free coefficients as the parameters and may be turned into BM15:

With BM15 we have a new champion: “The Ice Pirates” movie.

Do you like ArangoDB?
icon-githubStar this project on GitHub.
close-link