SatelliteGraphs in ArangoDB 3.7

SatelliteGraphs in ArangoDB 3.7

Open In Colab

ArangoDB is a distributed Database allowing it to query large datasets distributed across multiple nodes. Great scale often comes at a price though, in this case network traffic and coordination.

When executing queries involving graph traversals, shortest path, or k-shortest paths computations in an ArangoDB cluster, data has to be exchanged between different servers. In particular graph traversals are usually executed on a Coordinator, because they need global information. This results in a lot of network traffic and slow query execution.

SatelliteGraphs are the natural extension of the concept of SatelliteCollections, improving join operations by replicating a small collection to all nodes, to graphs.

ArangoDB, being a Multi-Model database, is often used for use-cases where one has large amounts of data in collections sharded across multiple database nodes for scalability and performance.

Consider for example the massive amount of sensor data generated by IoT use-cases. The corresponding metadata describing the individual sensors (locations, type, accuracy, …) is stored in a graph allowing simple graph queries retrieving a particular subset of sensors. A simplified version of this use case is shown in the following jupyter notebook. You can see the output in this article or click the open in Colab button to get access to a temporary ArangoDB Oasis database and run it for yourself.

The first few code blocks contain some of the setup:

  1. Install and import necessary packages
  2. Setup a function that provides us with a temporary Oasis database
  3. Setup a simple cleanup function
In [0]:
%%capture
!git clone -b oasisConnector --single-branch https://github.com/cw00dw0rd/ArangoNotebooks.git
!rsync -av ArangoNotebooks/ ./ --exclude=.git
!pip3 install pyarango
!pip3 install "python-arango>=5.0"
In [0]:
import json
import requests
import sys
import pprint
import oasis

from pyArango.connection import *
from pyArango.collection import Collection, Edges, Field
from pyArango.graph import Graph, EdgeDefinition
from pyArango.collection import BulkOperation as BulkOperation\
In [0]:
def cleanupCollections(db):
  try:
    db['Location'].delete()
  except:
    pass
  try:
    db['Sensor'].delete()
  except:
    pass
  try:
    db['SensorLocation'].delete()
  except:
    pass
  try: 
    db['Sensordata'].delete()
  except:
    pass
  try:
    db.graphs['MySatelliteGraph'].delete()
  except:
    pass
  db.reload()
  db.dropAllCollections() 

Now, connect to the temporary Oasis database and cleanup the collections.

In [18]:
pp = pprint.PrettyPrinter()

# Retrieve tmp credentials from ArangoDB Tutorial Service
login = oasis.getTempCredentials(tutorialName='satelliteGraphs37', tempURL='https://d383fa0b596a.arangodb.cloud:8529/_db/_system/tutorialDB/tutorialDB')

## Connect to the temp database
conn = oasis.connect(login)
db = conn[login["dbName"]] 
pp.pprint(login)

# Cleanup (just in case the example is rerun)
cleanupCollections(db)
Requesting new temp credentials.
Temp database ready to use.
{'dbName': 'TUTu0n0wpjwuopfgxrnwiq3',
 'hostname': 'd383fa0b596a.arangodb.cloud',
 'password': 'TUT2wpdufx1ug5ea1wn47l4th',
 'port': 8529,
 'username': 'TUTp29lfmdha1hdeq2a4qw64h'}

For this example we will generate the IoT metadata documents and save them to the Sensordata collection.

In [19]:
# Define large (i.e., in reality shareded) collection]
collection = db.createCollection(name="Sensordata")
docs= []
for i in range(100):
    doc = collection.createDocument()
    doc["id"] = i
    doc["data"] = "Large amount of data"
    docs.append(doc)

# Returns number of inserted documents
collection.bulkSave(docs)
Out[19]:
100

Setting up a SatelliteGraph requires the same type of graph definition as before but we instead call the createSatelliteGraph function.

Now that the graph has been created, we can add our collection data to it.

In [20]:
class Location(Collection):
    _fields = {
        "Location": Field()
    }
class Sensor(Collection):
    _fields = {
        "id": Field()
    }
class SensorLocation(Edges):
    _fields = {
        "lifetime": Field()
    }

class MySatelliteGraph(Graph) :
    _edgeDefinitions = [EdgeDefinition("SensorLocation", fromCollections=["Location"], toCollections=["Sensor"])]
    _orphanedCollections = []

theSatelliteGraph = db.createSatelliteGraph("MySatelliteGraph")
print("Our first SatellitGraph: " + str(theSatelliteGraph))

# Add data to  MySatelliteGraph
s1 = theSatelliteGraph.createVertex('Sensor', {"id": 1})
s2 = theSatelliteGraph.createVertex('Sensor', {"id": 2})
l1 = theSatelliteGraph.createVertex('Location', {"location": "CA"})
l2 = theSatelliteGraph.createVertex('Location', {"location": "WA"})
theSatelliteGraph.link('SensorLocation', l1, s1, {"lifetime": "eternal"})
theSatelliteGraph.link('SensorLocation', l2, s2, {"lifetime": "eternal"})
Our first SatellitGraph: ArangoGraph: MySatelliteGraph
Out[20]:
ArangoEdge '_id: SensorLocation/14020089, _key: 14020089, _rev: _apjeYl6--_, _to: Sensor/18020027, _from: Location/16020130': <store: {'lifetime': 'eternal'}>

Without SatelliteGraphs this query would involve a lot of network traffic as the query would need to fetch all data and then execute the Graph traversal.

But as the graph based metadata is small, we can define it as a SatelliteGraph which is synchronously replicated to all DB-Servers that are part of a cluster. DB-Servers can then execute graph traversals, shortest path, and k-shortest paths computations locally. Having all collections defined in the graph stored locally greatly improves performance for such queries, while still maintaining the benefits of a distributed environment.

In [21]:
# Join between the SatelliteGraph and "sharded" collection
print("Joining SatelliteGraph and 'sharded' collection")
aql = """
FOR loc in Location
    FILTER loc.location == "CA"
    FOR sensor IN 1..1 OUTBOUND loc._id GRAPH "MySatelliteGraph"
      // Join with large collection
      For sensordata in Sensordata
        FILTER sensordata.id == 1 //== sensordata.id
        RETURN {
         "sensor" : sensor.id,
         "data" : sensordata.data
         }
  """


queryResult = db.AQLQuery(aql, rawResults=True, batchSize=1)
document = queryResult[0]
print(document)

# Next Steps
print()
print("If you are running this notebook in Google Colab, use these credentials to access the ArangoDB Web UI at:")
print("https://"+login["hostname"]+":"+str(login["port"]))
print("Username: " + login["username"])
print("Password: " + login["password"])
Joining SatelliteGraph and 'sharded' collection
{'sensor': 1, 'data': 'Large amount of data'}

If you are running this notebook in Google Colab, use these credentials to access the ArangoDB Web UI at:
https://d383fa0b596a.arangodb.cloud:8529
Username: TUTp29lfmdha1hdeq2a4qw64h
Password: TUT2wpdufx1ug5ea1wn47l4th

If you would like to dive deeper into this example, feel free to use the Explain feature from the ArangoDB Web UI.

If you have been running the Colab up to this point, simply use the credentials that were generated for you above.

Otherwise, if you have not run the notebook in Colab, click the Open in Colab button at the top of the page.

Please, keep in mind that this database is temporary and will be automatically deleted. If you would like to have a permanent deployment with ArangoDB Oasis, sign up for free!

If you would like to continue exploring ArangoDB and all of the new features of 3.7, you can download the beta here.

Do you like ArangoDB?
icon-githubStar this project on GitHub.
close-link