home shape

RC4 of ArangoDB 3.5: Configurable Analyzers & other ArangoSearch Upgrades

Step-by-step we are getting closer and closer to the official release of ArangoDB 3.5. First of all, we want to send a biiiiig “Thank You!” to all the testers so far and all your feedback! Super helpful for us!

This Release Candidate post is dedicated to the four new features of ArangoSearch which extend the capabilities and provide pretty huge performance improvements, especially for queries including search & sorting.

Probably the most interesting new thing you can do (and the most requested) is to configure analyzers and provide custom stopword-lists among other configuration options for analyzers. In addition, you can also now specify attribute(s) to index documents already sorted, to speed up queries that sort results on these attributes. The icing on the search cake is using scoring values directly within AQL queries to fine-tune the search results even further…

Please note: This is a pre-release version of ArangoDB 3.5 and should be used for testing purposes only!

Get the final Community Edition or Enterprise Edition.

Alright, let’s dive in…

Configurable Analyzers to Fine-Tune Queries

With the initial release of ArangoSearch in ArangoDB 3.4, we already provided a large variety of language analyzers including languages like English, Spanish, German, Dutch, Chinese and many more. But users couldn’t configure these analyzers nor provide custom stopword lists. The analyzers had pre-defined rules based on the chosen locale and also indexed everything, including stopwords.

With v3.5, analyzers are now configurable. You can also provide your own language specific stopword lists, do word stemming (“databases” or “database” -> “databas”) and also execute case-sensitive search queries. We hope that this upgrade will provide a nice extension of the capabilities you need in your existing or upcoming projects.

Sorted Indexes for Much Improved Performance

It is not uncommon, that one needs to sort text search results within their applications.

Starting with ArangoDB 3.5, you can now create ArangoSearch views that are already sorted by some attributes. This pre-computed sorting allows for much faster `search` queries including SORT.

Upon `view` creation you can specify the sort condition (asc/dec). If the sort condition specified within your query matches the sort condition of your view, the query will read the results directly from the index without any expensive sort operation to be executed.

Creating a sorted view can be done via `arangosh`

db._createView('myView', 'arangosearch', { links : { ... }, primarySort: [ { field: 'myField', direction: 'asc' }, { field: 'anotherField', direction: 'desc' } ] })
db._query('FOR d in myView SEARCH ... SORT d.myField ASC RETURN d`); // no sorting at query time

Check out the brief tutorial below to learn how to create sorted views and use them in your application (Performance Benchmarks will follow with the GA release).

We would be very keen to learn about the performance gains you see for your application with sorted views or any other feedback you have. Please let us know via hackers@arangodb.com.

Scorers as Numbers for Extended Query Capabilities

In many use cases, you might want to search for information that matches or exceeds a certain relevance. Relevance to a certain search term is calculated with BM25 and TFIDF algorithms which analyze a document and provide a score for each search term defined in queries.

With ArangoDB 3.5, you can now access these scores and use them within queries to e.g. further fine tune your search results. Imagine a movie dataset including a rating for the movie stored for each movie in your database. If you are now searching for science fiction movies which include the terms “galaxy”, “space” and “space battle”, you can now use the relevancy scores calculated by BM25 or TDIDF algorithms to integrate the movie rating, create a combined score (e.g. Rating * Relevance) and sort results based on this new score.

An AQL query using this new capability could look like this:

FOR d IN view SEARCH ... LET myScore = BM25(d)*LOG(1 + d.myRank) RETURN { doc: d, score: myScore }

Restrict Search to Specific Collections

In the initial release of ArangoSearch, it was not possible to restrict search queries to specific collections within an ArangoSearch view. One index was created for all collections within that view.

With ArangoDB 3.5 each collection part of a view gets its own inverted index. Therefore, it is now possible to define and restrict specific collections to be accessed within a query.

One example would be a SearchView peopleView that covers the collections suppliers and customers to allow searching over some common attribute names.

For some queries I want to explicitly search for documents in the customers collection. Here I can now use the option to limit the SEARCH to this collection:

FOR doc IN peopleView 
  SEARCH doc.dayOfBirth == '04.25' OPTIONS { collections : [ 'customers' ] } 
  RETURN doc

We hope that these improvements are useful for everyone using ArangoSearch functionalities.

Happy testing and it would be fantastic to hear about your feedback via Github.

Get ArangoDB 3.5

Julie Ferrario

Leave a Comment





Get the latest tutorials, blog posts and news: