This post will dive into the world of Natural Language Processing by using word embeddings to search movie descriptions in ArangoDB.

In this post we:

  • Discuss the background of word embeddings
  • Introduce the current state-of-the-art models for embedding text
  • Apply a model to produce embeddings of movie descriptions in an IMDb dataset
  • Perform similarity search in ArangoDB using these embeddings
  • Show you how to query the movie description embeddings in ArangoDB with custom search terms

N.B. Before you run this notebook!!!

If you are running this notebook on Google Colab, please make sure to enable hardware acceleration using either a GPU or a TPU. If it is run with CPU-only enabled, generating the word embeddings will take an incredibly long time! Hardware acceleration can be enabled by navigating to Runtime -> Change Runtime. This will present you with a popup, where you can select an appropriate Hardware Accelerator.

Open In Colab

Word Embeddings

In this notebook, we will use word embeddings to perform searches based on movie descriptions in ArangoDB.

We'll start by breaking down how to convert a string into a set of word embeddings produced by a state-of-the-art Transformer model. Then we'll use a higher-level API to create embeddings and compare them so that you can see their expressive power! Finally, we'll create embeddings for movie descriptions in our IMDb graph and perform similarity searches and query-based searches.

Introduction to Word Embeddings

Transforming text so that it can be used efficiently and processed correctly by computers has long been an open problem in Natural Language Processing (NLP). Word embeddings are a development that has been considered a breakthrough in this area. If you've ever used voice assistants such as Amazon Alexa or Siri, a translation service such as Google Translate, or a search engine - you've come into contact with applications of word embeddings!

Now you might be wondering: What is a word embedding?

Word embeddings are a numerical representation of text, consisting of ordered sequences of numbers called vectors. The intuition behind these embeddings is that words that appear in similar contexts and share similar meanings should have similar embeddings. In practice, these representations are vectors that are calculated based on the setting that they are in. These vectors are unique per word, so if we were to retrieve the embedding for king, it would have a different embedding than man. Since these embeddings contain a representation of meaning, we can also use them to do approximate math. For example, if we do king - man + woman, we will end up with a vector that is close to the queen word embedding. It's important to note that the context matters. This effect is most pronounced if your text contains homonyms. Homonyms are words that have the same spelling but different meanings. For example, address has a different meaning in the following sentences:

"He was about to address the congregation."

"I would like to update my current address."

In this case, the word embedding you would compute for the word address will differ per sentence (provided you are using a recent embedding model).

These embeddings are created by training a neural network model to predict the context around a target word (the target word is the one we're trying to make an embedding for). The first word embedding models used a sliding window around the target word as the relevant context. An example of a sliding window around "fox" can be seen below:

Skip-gram-with-window-size-2-from-McCormick-11.png

[Example of a sliding window. From Wittum, Gabriel & Hoffer, Michael & Lemke, Babett & Jabs, Robert & Naegel, Arne. (2020). Automated methods for the comparison of natural languages. Computing and Visualization in Science. 23. 10.1007/s00791-020-00325-2.]

Recently, research has yielded a new mechanism for taking context into account: Attention. Attention is a mechanism that allows a model to focus on the relevant parts of an input text as needed. The idea of attention stems from the notion that all the words used in a sentence are interconnected, and the model should use those hidden connections at the appropriate time to link words together. These connections can span entire sentences (and, in some cases, even paragraphs), so they aren't limited to a single sliding window, dramatically increasing the expressive power of these representations.

attention.png

[An example of attention in a sentence. From the Google AI Blog]

Here we can see an example of attention in action. The attention mechanism correctly places a higher emphasis on "animal" when calculating the attention for the word "it" in the sentence.

The first model that introduced this (and afterward spawned a whole host of improvements) is known as a Transformer. It has managed to achieve state-of-the-art results for tasks related to word embeddings such as Language Modeling.

The building block of a Transformer model is known as multi-headed attention, which is a set of multiple attention layers applied in parallel to the same input. This allows the model to pay attention to different aspects of the input text simultaneously. Multiple multi-headed attention blocks are stacked on top of one another until we reach a final representation for each word in the input text. This may seem difficult to understand, but don't worry! We've included an interactive visualization of multi-headed attention in the tutorial below.

Transformer model

[The internals of a Transformer embedding model. From Castellucci, Giuseppe & Bellomaria, Valentina & Favalli, Andrea & Romagnoli, Raniero. (2019). Multi-lingual Intent Detection and Slot Filling in a Joint BERT-based Model.]

The representations produced by these models will be the core focus of this notebook. If you would like a more in-depth visual explanation of word embeddings and Transformer models you can read these excellent posts by Jay Lammar: Illustrated Word2Vec and Illustrated Transformers.

Once a word embedding model such as a Transformer has been trained on a large dataset, typically consisting of millions of sentences, the embeddings are ready for use in downstream tasks.

This notion of "train once - use everywhere" is a concept called Transfer Learning. By pre-training embeddings on a large corpus, the resultant embeddings are well-generalized and are effective in various settings. In this notebook, we'll be using pre-trained embeddings when calculating our movie similarities.

However, it is important to note that there is no such thing as one size fits all, as different domains use words in different contexts with different meanings. To address this, you can apply what is known as fine-tuning - which is a process where you take the pre-trained model and train it further on a corpus for your specific application. We won't be going into further detail here, but if you would like to learn more, here are some resources to get you started:

Setup

In [1]:
%%capture
!git clone -b oasis_connector --single-branch https://github.com/arangodb/interactive_tutorials.git
!rsync -av interactive_tutorials/ ./ --exclude=.git
!chmod -R 755 ./tools
!git clone -b imdb_complete --single-branch https://github.com/arangodb/interactive_tutorials.git imdb_complete
!rsync -av imdb_complete/data/imdb_dump/ ./imdb_dump/
!pip3 install torch
!pip3 install transformers
!pip3 install sentence-transformers
!pip3 install bertviz
!pip3 install pyarango
!pip3 install "python-arango>=5.0"
In [2]:
import itertools
import json
import requests
import sys
import oasis
import time
import textwrap

from pyArango.connection import *
from arango import ArangoClient
from transformers import AutoTokenizer, AutoModel
from sentence_transformers import SentenceTransformer
import torch.nn.functional as F
import torch
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm
from bertviz import head_view
/usr/local/lib/python3.7/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)

Creating Word Embeddings

Here we're creating our Transformer model:

In [3]:
%%capture
model_name = "distilbert-base-uncased"
model = AutoModel.from_pretrained(model_name, output_attentions=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

The model consists of two components: the actual model, and a tokenizer. The tokenizer splits the input string into a series of tokens.

In [4]:
tokenized = tokenizer("This is an input sentence!", return_tensors="pt")
tokenized
Out[4]:
{'input_ids': tensor([[ 101, 2023, 2003, 2019, 7953, 6251,  999,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}

We can also decode the tokens:

In [5]:
tokenizer.decode(tokenized["input_ids"].tolist()[0])
Out[5]:
'[CLS] this is an input sentence! [SEP]'

When we do so, we can see that some special tokens have been inserted. [CLS] is the Class token, which is used for downstream classification tasks, and, in our case, as a token that when embedded, represents the embedding of our entire input string.

Now that we have a tokenized representation, we can then calculate the actual word embeddings:

In [6]:
model_output = model(**tokenized)
print(model_output.last_hidden_state)
tensor([[[-0.2255, -0.1828,  0.0393,  ..., -0.0928,  0.0159,  0.5379],
         [-0.6093, -0.5398, -0.1058,  ..., -0.3815,  0.2062,  0.4208],
         [-0.4014, -0.3205,  0.1996,  ..., -0.2355, -0.1038,  0.9821],
         ...,
         [ 0.1199, -0.0851,  0.0276,  ..., -0.2006, -0.2935,  0.0138],
         [-0.0806, -0.2896,  0.0197,  ...,  0.1106,  0.1365,  0.1611],
         [ 0.9731,  0.0738, -0.5174,  ...,  0.2108, -0.7434, -0.1991]]],
       grad_fn=<NativeLayerNormBackward>)

Visualizing Multi-headed Attention

We can also visualize the attention heads across all of the stacked multi-headed attention blocks below. Running this cell will produce an interactive graphic!

In [7]:
sentence = "Jack was tired so he went to sleep"
tokenized_sent = tokenizer(sentence, return_tensors="pt")
preds = model(**tokenized_sent)

attention = preds[-1]
tokens = tokenizer.convert_ids_to_tokens(tokenized_sent["input_ids"][0].tolist())
head_view(attention, tokens)
Layer:

The visualization above is interactive. The first thing that you can filter on is which layer of multi-headed attention you are viewing. The transformer model we're using has 6 layers, so you can switch between the various layers using the dropdown. Each of the colors corresponds with one of the 12 individual attention heads in that layer. These can be toggled on and off. By hovering over individual tokens, you can view the attention to and from that token! The opacity of each connection signifies its strength.

We've included an example of how you can interact with this below. Here you can see that one attention head (pink) focuses on "jack" when we hover over "jack", but another (green) focuses on "he"! Other attention heads are less opaque (for example, the purple attention head on "he" and "was"), but as we mentioned before, these are less strong.

Word Embedding Similarities

Now that we've seen how some of the parts of this model fit together, we can make use of a higher level abstraction. Sentence transformers give us an easier API for computing sentence level embeddings. These sentence level embeddings are derived from the [CLS] token mentioned above.

In [8]:
%%capture
model = SentenceTransformer("paraphrase-TinyBERT-L6-v2")

def embed_and_compare(inputs):
  input_embeddings = torch.from_numpy(model.encode(inputs))

  n = input_embeddings.shape[0]

  combos = list(itertools.product(list(range(n)), list(range(n))))

  for a, b in combos:
    if a == b or a > b:
      continue
    print(f"1st input: {inputs[a]}")
    print(f"2nd input: {inputs[b]}")

    cosine_sim = F.cosine_similarity(input_embeddings[a], input_embeddings[b], dim=0).numpy()
    print(f"Cosine similarity: {cosine_sim:.3f}")
    print("\n")

We can use this package to calculate the similarity of terms such as synonyms and antonyms. As mentioned previously, embeddings are vectors, so we can use a metric called cosine similarity, which works on non-zero vectors to measure the similarity between them. This metric measures the cosine of the angle between two vectors and can be calculated for embeddings $\vec{a}$ and $\vec{b}$, both of length $n$ (read: the dimensionality of the embedding) as follows: $$ \text{cosine_similarity}(\vec{a}, \vec{b}) = \cos(\theta) = \frac{ \sum\limits_{i=1}^{n}{a_i b_i} }{ \sqrt{\sum\limits_{j=1}^{n}{a_j^2}} \sqrt{\sum\limits_{k=1}^{n}{b_k^2}} } $$

Here $\theta$ denotes the angle between the vectors. Cosine similarity can take on values between -1 and 1. A value of 1 means that the vectors are identical, while -1 means that they are pointing in opposite directions (so they are not similar).

Using this metric, let's calculate the similarity between some synonyms and antonyms!

In [9]:
# Handle synonyms and antonyms
terms = [
    "happy",
    "cheerful", 
    "sad"
]
embed_and_compare(terms)
1st input: happy
2nd input: cheerful
Cosine similarity: 0.760


1st input: happy
2nd input: sad
Cosine similarity: 0.172


1st input: cheerful
2nd input: sad
Cosine similarity: 0.257


As you can see, the synonyms "happy" and "cheerful" have a very high cosine similarity, while the antonym "sad" is far less similar.

We can also use these embeddings to calculate sentence similarities! Let's embed a few similar sounding sentences, and throw in a couple of unrelated sentences too, and see how similar they are:

In [10]:
sentences = [
    "This is an input sentence",
    "Totally unrelated thing.",
    "This is an input query.",
    "This is another sentence!",
]
embed_and_compare(sentences)
1st input: This is an input sentence
2nd input: Totally unrelated thing.
Cosine similarity: 0.068


1st input: This is an input sentence
2nd input: This is an input query.
Cosine similarity: 0.676


1st input: This is an input sentence
2nd input: This is another sentence!
Cosine similarity: 0.596


1st input: Totally unrelated thing.
2nd input: This is an input query.
Cosine similarity: 0.081


1st input: Totally unrelated thing.
2nd input: This is another sentence!
Cosine similarity: 0.140


1st input: This is an input query.
2nd input: This is another sentence!
Cosine similarity: 0.311


As you can see, the unrelated sentence "Totally unrelated thing" with no words in common with the other three had a very low similarity in all 3 of its comparisons. The model also demonstrates that it's able to handle words in similar contexts ("sentence" v. "query") to score the closest sentences the highest ("This is an input sentence" and "This is an input query")!

ArangoDB Setup

Now that we have a better idea of Transformer word embedding models can aid us in text search and comparison, let's start to apply it to a graph. To do so we'll first need to configure our database and load the data.

Create the temporary database:

In [11]:
# Retrieve tmp credentials from ArangoDB Tutorial Service
login = oasis.getTempCredentials(tutorialName="WordEmbeddings", credentialProvider="https://tutorials.arangodb.cloud:8529/_db/_system/tutorialDB/tutorialDB")

# Connect to the temp database
# Please note that we use the python-arango driver as it has better support for ArangoSearch 
database = oasis.connect_python_arango(login)
Requesting new temp credentials.
Temp database ready to use.
In [12]:
print("https://"+login["hostname"]+":"+str(login["port"]))
print("Username: " + login["username"])
print("Password: " + login["password"])
print("Database: " + login["dbName"])
https://tutorials.arangodb.cloud:8529
Username: TUTiwd95vued8fcexwudqlj2
Password: TUT9fr7cpwkh6j8iz9cnwua2a
Database: TUTsc8awj1iweafrehvogoccs

Feel free to use the above URL to checkout the ArangoDB WebUI!

Import IMDB Example Dataset

imdb

Next we will import the IMBD Example Dataset which contains information about various movies, actors, directors, ... as a graph. N.B. the included arangorestore will only work on Linux or Windows systems, if you want to run this notebook on a different OS please consider using the appropriate arangorestore from the Download area.

Linux

In [13]:
! ./tools/arangorestore -c none --server.endpoint http+ssl://{login["hostname"]}:{login["port"]} --server.username {login["username"]} --server.database {login["dbName"]} --server.password {login["password"]} --default-replication-factor 3  --input-directory "imdb_dump"
2021-06-22T08:04:10Z [205] INFO [05c30] {restore} Connected to ArangoDB 'http+ssl://tutorials.arangodb.cloud:8529'
2021-06-22T08:04:11Z [205] INFO [3b6a4] {restore} no properties object
2021-06-22T08:04:11Z [205] INFO [9b414] {restore} # Re-creating document collection 'imdb_vertices'...
2021-06-22T08:04:11Z [205] INFO [9b414] {restore} # Re-creating document collection 'Users'...
2021-06-22T08:04:11Z [205] INFO [9b414] {restore} # Re-creating edge collection 'imdb_edges'...
2021-06-22T08:04:12Z [205] INFO [9b414] {restore} # Re-creating edge collection 'Ratings'...
2021-06-22T08:04:12Z [205] INFO [6d69f] {restore} # Dispatched 4 job(s), using 2 worker(s)
2021-06-22T08:04:12Z [205] INFO [94913] {restore} # Loading data into document collection 'imdb_vertices', data size: 4752344 byte(s)
2021-06-22T08:04:12Z [205] INFO [94913] {restore} # Loading data into document collection 'Users', data size: 15255 byte(s)
2021-06-22T08:04:13Z [205] INFO [6ae09] {restore} # Successfully restored document collection 'Users'
2021-06-22T08:04:13Z [205] INFO [94913] {restore} # Loading data into edge collection 'imdb_edges', data size: 4634754 byte(s)
2021-06-22T08:04:17Z [205] INFO [75e65] {restore} # Current restore progress: restored 1 of 4 collection(s), read 25292083 byte(s) from datafiles, sent 4 data batch(es) of 8514829 byte(s) total size, queued jobs: 1, workers: 2
2021-06-22T08:04:20Z [205] INFO [69a73] {restore} # Still loading data into document collection 'imdb_vertices', 16777216 byte(s) restored
2021-06-22T08:04:22Z [205] INFO [6ae09] {restore} # Successfully restored document collection 'imdb_vertices'
2021-06-22T08:04:22Z [205] INFO [94913] {restore} # Loading data into edge collection 'Ratings', data size: 1239753 byte(s)
2021-06-22T08:04:22Z [205] INFO [75e65] {restore} # Current restore progress: restored 2 of 4 collection(s), read 46363374 byte(s) from datafiles, sent 7 data batch(es) of 29585981 byte(s) total size, queued jobs: 0, workers: 2
2021-06-22T08:04:27Z [205] INFO [75e65] {restore} # Current restore progress: restored 2 of 4 collection(s), read 46363374 byte(s) from datafiles, sent 7 data batch(es) of 29585981 byte(s) total size, queued jobs: 0, workers: 2
2021-06-22T08:04:28Z [205] INFO [69a73] {restore} # Still loading data into edge collection 'imdb_edges', 16777216 byte(s) restored
2021-06-22T08:04:32Z [205] INFO [75e65] {restore} # Current restore progress: restored 2 of 4 collection(s), read 59512095 byte(s) from datafiles, sent 9 data batch(es) of 46363164 byte(s) total size, queued jobs: 0, workers: 2
2021-06-22T08:04:34Z [205] INFO [69a73] {restore} # Still loading data into edge collection 'Ratings', 13148721 byte(s) restored
2021-06-22T08:04:34Z [205] INFO [6ae09] {restore} # Successfully restored edge collection 'Ratings'
2021-06-22T08:04:37Z [205] INFO [75e65] {restore} # Current restore progress: restored 3 of 4 collection(s), read 67900703 byte(s) from datafiles, sent 10 data batch(es) of 59512062 byte(s) total size, queued jobs: 0, workers: 2
2021-06-22T08:04:42Z [205] INFO [75e65] {restore} # Current restore progress: restored 3 of 4 collection(s), read 67900703 byte(s) from datafiles, sent 10 data batch(es) of 59512062 byte(s) total size, queued jobs: 0, workers: 2
2021-06-22T08:04:44Z [205] INFO [69a73] {restore} # Still loading data into edge collection 'imdb_edges', 33554432 byte(s) restored
2021-06-22T08:04:47Z [205] INFO [75e65] {restore} # Current restore progress: restored 3 of 4 collection(s), read 76289311 byte(s) from datafiles, sent 11 data batch(es) of 67900587 byte(s) total size, queued jobs: 0, workers: 2
2021-06-22T08:04:52Z [205] INFO [75e65] {restore} # Current restore progress: restored 3 of 4 collection(s), read 79024524 byte(s) from datafiles, sent 12 data batch(es) of 76289152 byte(s) total size, queued jobs: 0, workers: 2
2021-06-22T08:04:54Z [205] INFO [69a73] {restore} # Still loading data into edge collection 'imdb_edges', 44678253 byte(s) restored
2021-06-22T08:04:54Z [205] INFO [6ae09] {restore} # Successfully restored edge collection 'imdb_edges'
2021-06-22T08:04:55Z [205] INFO [a66e1] {restore} Processed 4 collection(s) in 45.872036 s, read 79024524 byte(s) from datafiles, sent 12 data batch(es) of 79024520 byte(s) total size

Movie Description Embeddings

Start by retrieving all movies that we want to produce description embeddings for.

In [14]:
cursor = database.aql.execute(
"""
FOR d IN imdb_vertices 
   FILTER d.type == "Movie"
   FILTER d.description != "No overview found."
   RETURN {
     _id: d._id,
     description: d.description
    }
"""
)
movie_descriptions = list(cursor)

# let's take this list of movie descriptions and put it in a dataframe for ease of use
movies_df = pd.DataFrame(movie_descriptions)
movies_df = movies_df.dropna()

Next, we can produce the embeddings.

In [15]:
# Now iterate over the descriptions and produce the sentence embeddings
batch_size = 32

all_embs = []

for i in tqdm(range(0, len(movies_df), batch_size)):
  descr_batch = movies_df.iloc[i:i+batch_size].description.tolist()
  embs = model.encode(descr_batch)
  all_embs.append(embs)

all_embs = np.concatenate(all_embs)
movies_df.loc[:, "word_emb"] = np.vsplit(all_embs, len(all_embs))
movies_df["word_emb"] = movies_df["word_emb"].apply(lambda x: x.squeeze().tolist())

Now let's upload these embeddings to our ArangoDB database in batches.

In [16]:
BATCH_SIZE = 250
movie_collection = database["imdb_vertices"]

for i in range(0, len(movies_df), batch_size):
  update_batch = movies_df.loc[i:i+batch_size, ["_id", "word_emb"]].to_dict("records")
  movie_collection.update_many(update_batch)

Similarity Search Using Embeddings

Once the embeddings have been uploaded, we can query the database and use these embeddings to find the most similar movies based on their description's embeddings!

In [17]:
cursor = database.aql.execute(
"""
  FOR m in imdb_vertices
    FILTER m._id == "imdb_vertices/28685"
    RETURN { "title": m.title, "description": m.description }
""")

# Iterate through the result cursor
for doc in cursor:
  print(doc)
{'title': 'The Karate Killers', 'description': 'International spies Napoleon Solo (Robert Vaughn) and Illya Kuryakin (David McCallum) travel around the globe in an effort to track down a secret formula that was divided into four parts and left by a dying scientist with his four of five daughters, all of whom live in different countries. His widow, Amanda, is murdered at the beginning by the counter-spies of the organization THRUSH. Evil THRUSH agent Randolph also wants the formula, and is aided by his karate-chopping henchmen.'}

Let's see if we can retrieve movies that are similar to "The Karate Killers".

In [18]:
cursor = database.aql.execute(
"""
LET descr_emb = (
  FOR m in imdb_vertices
    FILTER m._id == "imdb_vertices/28685"
    FOR j in RANGE(0, 767)
      RETURN TO_NUMBER(NTH(m.word_emb,j))
)

LET descr_mag = (
  SQRT(SUM(
    FOR i IN RANGE(0, 767)
      RETURN POW(TO_NUMBER(NTH(descr_emb, i)), 2)
  ))
)

LET dau = (

    FOR v in imdb_vertices
    FILTER HAS(v, "word_emb")

    LET v_mag = (SQRT(SUM(
      FOR k IN RANGE(0, 767)
        RETURN POW(TO_NUMBER(NTH(v.word_emb, k)), 2)
    )))

    LET numerator = (SUM(
      FOR i in RANGE(0,767)
          RETURN TO_NUMBER(NTH(descr_emb, i)) * TO_NUMBER(NTH(v.word_emb, i))
    ))

    LET cos_sim = (numerator)/(descr_mag * v_mag)

    RETURN {"movie": v._id, "title": v.title, "cos_sim": cos_sim}

    )

FOR du in dau
    SORT du.cos_sim DESC
    LIMIT 50
    RETURN {"movie": du.title, "cos_sim": du.cos_sim} 
""")

# Iterate through the result cursor
for doc in cursor:
  print(doc)
{'movie': 'The Karate Killers', 'cos_sim': 1}
{'movie': 'The Saint', 'cos_sim': 0.5992472001022718}
{'movie': 'The Ipcress File', 'cos_sim': 0.5548649945964909}
{'movie': 'This Gun for Hire', 'cos_sim': 0.5544204616173648}
{'movie': 'Indiana Jones and the Kingdom of the Crystal Skull', 'cos_sim': 0.5536571008669408}
{'movie': 'Scanners', 'cos_sim': 0.5526100236437222}
{'movie': 'Little Nikita', 'cos_sim': 0.5506570565818808}
{'movie': 'Hannibal', 'cos_sim': 0.546326160655643}
{'movie': '13 Tzameti', 'cos_sim': 0.5455302288472099}
{'movie': 'National Treasure: Book of Secrets', 'cos_sim': 0.5433633759999847}
{'movie': 'Shooter', 'cos_sim': 0.5424165239671833}
{'movie': 'Cutaway', 'cos_sim': 0.5411610216760899}
{'movie': 'The Master of Disguise', 'cos_sim': 0.5403693664436306}
{'movie': 'The Interpreter', 'cos_sim': 0.5375088814337703}
{'movie': 'Moon 44', 'cos_sim': 0.537464421963995}
{'movie': 'Avalanche Express', 'cos_sim': 0.5366415573053243}
{'movie': 'The Jewel of the Nile', 'cos_sim': 0.5351865012229464}
{'movie': 'Missing', 'cos_sim': 0.5342071102881718}
{'movie': 'Hollow Man 2', 'cos_sim': 0.532559642516418}
{'movie': 'La cage aux folles II', 'cos_sim': 0.5318047426024654}
{'movie': 'A Countess from Hong Kong', 'cos_sim': 0.5247826129785597}
{'movie': 'Mission: Impossible II', 'cos_sim': 0.5246244467645046}
{'movie': 'The Big Empty', 'cos_sim': 0.5211710976344817}
{'movie': 'Far Cry', 'cos_sim': 0.5199747820724628}
{'movie': 'Sky Captain and the World of Tomorrow', 'cos_sim': 0.5178535077960535}
{'movie': 'Armour of God II: Operation Condor', 'cos_sim': 0.5172624811067572}
{'movie': 'Frantic', 'cos_sim': 0.5152937620962541}
{'movie': 'The Third Man', 'cos_sim': 0.5152438329036567}
{'movie': 'The Contractor', 'cos_sim': 0.5115278849057233}
{'movie': 'Epic Movie', 'cos_sim': 0.5106933684426239}
{'movie': '8MM', 'cos_sim': 0.5015512304777318}
{'movie': 'Condorman', 'cos_sim': 0.5015009862314348}
{'movie': 'Dirty Hands', 'cos_sim': 0.49912704764497756}
{'movie': 'Spaceballs', 'cos_sim': 0.4991064824570654}
{'movie': 'Terminal Velocity', 'cos_sim': 0.4976728169679435}
{'movie': "Charlie Ve'hetzi", 'cos_sim': 0.49703448563228253}
{'movie': 'Around the World in 80 Days', 'cos_sim': 0.495163883579988}
{'movie': 'The List of Adrian Messenger', 'cos_sim': 0.49498964581228283}
{'movie': 'Journey to the Center of the Earth (V)', 'cos_sim': 0.4936975544475762}
{'movie': 'Duplicity', 'cos_sim': 0.4936802720223825}
{'movie': 'Breakheart Pass', 'cos_sim': 0.49348741589575695}
{'movie': 'You Only Live Twice', 'cos_sim': 0.49331722665168987}
{'movie': 'Dumb & Dumber', 'cos_sim': 0.4926515336540689}
{'movie': 'Shoot the Piano Player', 'cos_sim': 0.49153979225260724}
{'movie': 'Romasanta', 'cos_sim': 0.49093239661261306}
{'movie': 'Cypher', 'cos_sim': 0.4907920797730376}
{'movie': 'Wrongfully Accused', 'cos_sim': 0.4905935451517069}
{'movie': 'El Rey de la montaña', 'cos_sim': 0.49035660275056303}
{'movie': 'Cabo Blanco', 'cos_sim': 0.4903399704081535}
{'movie': 'Naked Weapon', 'cos_sim': 0.48930599630114474}

Here we're using the cosine similarity to retrieve the closest matches to Karate Killer based on the embeddings of their movie descriptions. We're calculating this metric based on the equation we previously mentioned above: $$ \frac{ \sum\limits_{i=1}^{n}{a_i b_i} }{ \sqrt{\sum\limits_{j=1}^{n}{a_j^2}} \sqrt{\sum\limits_{k=1}^{n}{b_k^2}} } $$

Once we calculate the cosine similarities, we can then SORT the movies and return the top 50 most similar movies!

Search Using Query Embeddings

We aren't necessarily limited to the embeddings that we have in the graph already. We can also use a query with search terms.

In [19]:
# Query something specific

search_term = "jedi stars fighting"
search_emb = model.encode(search_term).tolist()

Here we're embedding our search terms into one vector. Below we're then loading this embedding into the same query that we used above:

In [20]:
emb_str = f"""
LET descr_emb = (
  {search_emb}
)
"""
cursor = database.aql.execute(
emb_str + """
LET descr_size = (
  SQRT(SUM(
    FOR i IN RANGE(0, 767)
      RETURN POW(TO_NUMBER(NTH(descr_emb, i)), 2)
  ))
)

LET dau = (

    FOR v in imdb_vertices
    FILTER HAS(v, "word_emb")

    LET v_size = (SQRT(SUM(
      FOR k IN RANGE(0, 767)
        RETURN POW(TO_NUMBER(NTH(v.word_emb, k)), 2)
    )))

    LET numerator = (SUM(
      FOR i in RANGE(0,767)
          RETURN TO_NUMBER(NTH(descr_emb, i)) * TO_NUMBER(NTH(v.word_emb, i))
    ))

    LET cos_sim = (numerator)/(descr_size * v_size)

    RETURN {"movie": v._id, "title": v.title, "cos_sim": cos_sim}

    )

FOR du in dau
    SORT du.cos_sim DESC
    LIMIT 50
    RETURN {"movie": du.title, "cos_sim": du.cos_sim} 
""")

# Iterate through the result cursor
for doc in cursor:
  print(doc)
{'movie': 'Star Wars: Episode III: Revenge of the Sith', 'cos_sim': 0.5821351784708155}
{'movie': 'Star Wars: Episode V: The Empire Strikes Back', 'cos_sim': 0.5560696612919952}
{'movie': 'Star Wars: Episode II - Attack of the Clones', 'cos_sim': 0.5450031257231361}
{'movie': 'Star Wars: Episode VI - Return of the Jedi', 'cos_sim': 0.5379025574457419}
{'movie': 'Star Wars: Revelations', 'cos_sim': 0.523136347761933}
{'movie': 'Star Wars: Episode I - The Phantom Menace', 'cos_sim': 0.4907419353395951}
{'movie': 'Battle Beyond the Stars', 'cos_sim': 0.48619554247981983}
{'movie': 'Aragami', 'cos_sim': 0.48151278329679276}
{'movie': 'Xian si jue', 'cos_sim': 0.4811747948463836}
{'movie': 'Death Note 2: The Last Name', 'cos_sim': 0.47559611384088374}
{'movie': 'Star Trek', 'cos_sim': 0.46015266178066017}
{'movie': 'Star Wars: The Clone Wars', 'cos_sim': 0.4508071372951382}
{'movie': 'Lethal Weapon 4', 'cos_sim': 0.44977667866267623}
{'movie': 'The Star Wars Holiday Special', 'cos_sim': 0.44668274393318164}
{'movie': 'Ghostbusters Collection', 'cos_sim': 0.4393276129722392}
{'movie': 'Lone Wolf McQuade', 'cos_sim': 0.43396932777064795}
{'movie': 'Highlander III: The Sorcerer', 'cos_sim': 0.4338269719799663}
{'movie': 'Batman Forever', 'cos_sim': 0.42712738717824594}
{'movie': 'The Storm Riders', 'cos_sim': 0.4267595486179909}
{'movie': 'The One', 'cos_sim': 0.42216612398814457}
{'movie': 'The Last Unicorn', 'cos_sim': 0.41211424683909065}
{'movie': 'Han cheng gong lüe', 'cos_sim': 0.4088065705668135}
{'movie': 'Predator Collection', 'cos_sim': 0.40649964927690974}
{'movie': 'Fanboys', 'cos_sim': 0.40509197697131816}
{'movie': 'Buzz Lightyear of Star Command: The Adventure Begins', 'cos_sim': 0.4013855902162252}
{'movie': 'The Last Starfighter', 'cos_sim': 0.3990636300673203}
{'movie': 'Star Wars: Episode IV - A New Hope', 'cos_sim': 0.3973798291628842}
{'movie': 'Starship Troopers 3: Marauder', 'cos_sim': 0.3937467262691094}
{'movie': 'AVPR: Aliens vs. Predator - Requiem', 'cos_sim': 0.39317817765790497}
{'movie': 'The White Dragon', 'cos_sim': 0.3910749145346981}
{'movie': 'Targets', 'cos_sim': 0.3901472530278403}
{'movie': 'Kung Fu Panda', 'cos_sim': 0.39002439074630024}
{'movie': 'The Covenant', 'cos_sim': 0.3874615517371047}
{'movie': 'Planet Terror', 'cos_sim': 0.3873849216040952}
{'movie': 'Hao xia', 'cos_sim': 0.38348617433409254}
{'movie': 'Shenmue: The Movie', 'cos_sim': 0.38253578887972384}
{'movie': 'Magicians', 'cos_sim': 0.381430920153646}
{'movie': 'Alien Agent', 'cos_sim': 0.3812589309607092}
{'movie': 'Battle Royale', 'cos_sim': 0.38102117685054016}
{'movie': 'Mortal Kombat', 'cos_sim': 0.3802985860954904}
{'movie': 'Shinobi: Heart Under Blade', 'cos_sim': 0.3796447039766749}
{'movie': 'Spaceballs', 'cos_sim': 0.37861355658967394}
{'movie': 'Family Guy: Blue Harvest', 'cos_sim': 0.3784594793380485}
{'movie': 'Big Trouble in Little China', 'cos_sim': 0.3768939796309322}
{'movie': 'Hellboy II: The Golden Army', 'cos_sim': 0.37567653033183485}
{'movie': 'Universal Soldier: The Return', 'cos_sim': 0.3754388589039555}
{'movie': 'Critters 4', 'cos_sim': 0.37266948187690785}
{'movie': 'Prince of Persia: The Sands of Time', 'cos_sim': 0.37216179830557766}
{'movie': 'Avatar', 'cos_sim': 0.3716035222637364}
{'movie': 'The Prestige', 'cos_sim': 0.36977452888554946}

We managed to find a bunch of Star Wars movies - hopefully, what you expected to see!

Next Steps

In this tutorial we've learned how to take unstructured text, and use it to perform similarity searches in ArangoDB. If you would like to continue learning more about ArangoDB, here are some next steps to get you started!