home shape

Word Embeddings in ArangoDB

Estimated reading time: 12 minute

This post will dive into the world of Natural Language Processing by using word embeddings to search movie descriptions in ArangoDB.

In this post we:

  • Discuss the background of word embeddings
  • Introduce the current state-of-the-art models for embedding text
  • Apply a model to produce embeddings of movie descriptions in an IMDb dataset
  • Perform similarity search in ArangoDB using these embeddings
  • Show you how to query the movie description embeddings in ArangoDB with custom search terms
Check it out on github

Continue Reading

ArangoML Part 4: Detecting Covariate Shift in Datasets

ArangoML Part 3: Bootstrapping and Bias Variance

ArangoML Part 2: Basic Arangopipe Workflow

Alexander Geenen

Alex Geenen

Alex is a Machine Learning Ecosystem Engineer at ArangoDB. He is passionate about the practical application of new developments in the fast-moving field of Machine Learning.

2 Comments

  1. Fabio Mencoboni on July 2, 2021 at 2:24 pm

    Very cool tutorial- thanks for sharing. I am really excited about using ArangoDB with Semantic queries, and this is a great overview. A couple questions:
    * If I understand correctly, this approach is using the DistillBERT model in python to calculate embeddings for documents which are then stored in ArangoDB.
    * I have seen elsewhere the use of ArangoSearch, which I think did tokenization and embedding directly in the database. Do I understand the difference between these approaches correctly?
    * The query uses the expression below to calculate the dot-product of the query embedding to document embedding. This implies a slower single-thread approach, though if ArangoDB is calculating this value for multiple documents concurrently under the hood it would still get the benefit of multi-core processors. Any thoughts/comments on performance?
    LET numerator = (SUM(
    FOR i in RANGE(0,767)
    RETURN TO_NUMBER(NTH(descr_emb, i)) * TO_NUMBER(NTH(v.word_emb, i))
    ))

    • Alexander Geenen Alex Geenen on July 6, 2021 at 1:44 pm

      Hi Fabio,

      If I understand correctly, this approach is using the DistillBERT model in python to calculate embeddings for documents which are then stored in ArangoDB.

      Yes that’s correct!

      I have seen elsewhere the use of ArangoSearch, which I think did tokenization and embedding directly in the database. Do I understand the difference between these approaches correctly?

      Yes, ArangoSearch allows you to perform tokenization and full-text search directly in the database. At this point, word embeddings aren’t directly supported, which is what this tutorial lets you do. ArangoSearch does support vector space models such as BM-25 and TF-IDF for scoring search results. Please see here if you want to learn more about them.

      The query uses the expression below to calculate the dot-product of the query embedding to document embedding. This implies a slower single-thread approach, though if ArangoDB is calculating this value for multiple documents concurrently under the hood it would still get the benefit of multi-core processors. Any thoughts/comments on performance?

      Great question! The answer is that it depends. If you’re querying a single server, it will use a sequential scan (so a single thread). If you’re querying a collection on a cluster, and the collection is sharded across different servers, then there will be concurrency at a database server level, but within those server processes it will also be scanned sequentially.

Leave a Comment





Get the latest tutorials, blog posts and news: