Estimated reading time: 12 minute
This post will dive into the world of Natural Language Processing by using word embeddings to search movie descriptions in ArangoDB.
In this post we:
- Discuss the background of word embeddings
- Introduce the current state-of-the-art models for embedding text
- Apply a model to produce embeddings of movie descriptions in an IMDb dataset
- Perform similarity search in ArangoDB using these embeddings
- Show you how to query the movie description embeddings in ArangoDB with custom search terms
ArangoML Part 4: Detecting Covariate Shift in Datasets
ArangoML Part 3: Bootstrapping and Bias Variance
Very cool tutorial- thanks for sharing. I am really excited about using ArangoDB with Semantic queries, and this is a great overview. A couple questions:
* If I understand correctly, this approach is using the DistillBERT model in python to calculate embeddings for documents which are then stored in ArangoDB.
* I have seen elsewhere the use of ArangoSearch, which I think did tokenization and embedding directly in the database. Do I understand the difference between these approaches correctly?
* The query uses the expression below to calculate the dot-product of the query embedding to document embedding. This implies a slower single-thread approach, though if ArangoDB is calculating this value for multiple documents concurrently under the hood it would still get the benefit of multi-core processors. Any thoughts/comments on performance?
LET numerator = (SUM(
FOR i in RANGE(0,767)
RETURN TO_NUMBER(NTH(descr_emb, i)) * TO_NUMBER(NTH(v.word_emb, i))
Yes that’s correct!
Yes, ArangoSearch allows you to perform tokenization and full-text search directly in the database. At this point, word embeddings aren’t directly supported, which is what this tutorial lets you do. ArangoSearch does support vector space models such as BM-25 and TF-IDF for scoring search results. Please see here if you want to learn more about them.
Great question! The answer is that it depends. If you’re querying a single server, it will use a sequential scan (so a single thread). If you’re querying a collection on a cluster, and the collection is sharded across different servers, then there will be concurrency at a database server level, but within those server processes it will also be scanned sequentially.