home shape

Using SmartGraphs in ArangoDB

This tutorial describes how the SmartGraph feature within ArangoDB works and which benefits you gain from using it.

NOTE: The feature presented here is enterprise-only, so in order to follow, you need to download the Enterprise Edition of ArangoDB (free for evaluation).

scroll down line

The Dataset

In this tutorial, we are using a dataset of the pokec social network which is originally published by SNAP. In the preceding Pregel Tutorial, we have applied a community detection algorithm on the dataset to find communities in a “chaotic” graph dataset. This algorithm generated an additional attribute community on each of the vertices. This attribute is numeric and describes that all vertices with identical attributes form a community. It is defined that vertices have more edges to members of the same community and fewer edges are across different communities.

After this modification, we have used arangoexport and the Smartifier (See the Smartifier Tutorial). This Smartifier generated two jsonl-files, one for vertices and one for edges. These jsonl-files are in a SmartGraph format and we are going to use them within this tutorial. If you have followed the preceding tutorials you have already generated those files, if not you can just download the files here (smartified_pokec.tar.gz, ~1.4GB). However, it is recommended that you follow the preceding tutorials to understand how to transform your existing dataset into a SmartGraph dataset. The provided dataset is rather small for a SmartGraph case but large enough to show the performance differences.

ArangoDB Setup

For this tutorial and specifically for the later performance analysis we used a minimal high availability ArangoDB cluster setup consisting of:

  • 3 Coordinators
  • 3 DBServers
  • 3 Agents

We used 3 physical machines, each of those serving exactly one Coordinator, one DBServer and one Agent.
They were started with the ArangoDBStarter.

The SmartGraph idea

High-performance scaling of a graph database is a challenging topic. At this point, we do not even name the challenges in distributing the data and giving consistency guarantees, for graphs they are pretty much the same as for documents with distributed joins. Offering high performant queries is even more of a challenge. In terms of graph queries we only have two types of queries:

  1. Category A: Batch processing (Queries that needs computation on large portions or even the entire graphs)
  2. Category B: Point Lookups (Queries that start a specific point (vertex) and search for patterns in a local area)

For Category A queries ArangoDB offers the Pregel framework as described here: This tutorial will focus on Category B queries doing point lookups.

Those queries are typically solved by so called traversals:

    1. We first find the starting vertex.
    2. We iterate through all connected edges and find the vertex on the other side. Here we apply condition checks if the pattern can still be fulfilled. If so we continue.
    3. For each vertex we check the depth and validate if the pattern requires a deeper search, if so restart at (1) with this vertex.

So the most costly operation in the traversal is looking for edges and then looking for vertices. This is all fine if the data is located on the same physical machine, then doing these lookups is comparably cheap.

In a scale-out environment, the assumption is, that it is not possible anymore to store the data on a single physical machine but we want to shard the data across several machines, such that each machine is only responsible for a smaller portion.
Now there is the poor traverser execution standing at a vertex and searching for edges. These edges could be stored on a different machine, so at this point, we may require a network hop, which is, of course, expensive (~300x more expensive compared to in-memory lookups). After that, it needs to find the connected vertex. On bad luck that could again be on a different machine, so again an expensive network hop.
Querying with bad distribution

Querying with bad distribution

Bad Distribution

In this image, we see a bad distribution of the graph on the cluster. The path the traverser has to walk through is colored in orange. As you can see traversing this path requires switching between DBServers 6 times.

If we would instead invest in a better distribution of the same graph, by storing some vertices on different DBServers, we can potentially speed up the execution. In the new and better distribution we try to minimize the number of edges that point to datasets on different machines:

Querying with good distribution

Querying with good distribution

Good Distribution 1

This image shows a better distribution of the graph across the machines. The traverser needs to walk an identical path as before. However this time only a single switch of machines is necessary, which will give a significant speedup.
Note here: A single traversal query typically requires to compute thousands or millions of those paths, so optimizing the time required for each will add up nicely.

SmartGraphs in ArangoDB

How can we now achieve such a good distribution?

In most real-world use-cases there is a known attribute on the data that groups vertices nicely together. Typically this is a geographical region or a customer/product identifier.

If there is no such attribute, it can be computed in a preprocessing step following the Pregel Tutorial. ArangoDB can now use this attribute to shard the vertices and the edges. This has in many cases the nice effect that the majority of connected vertices end up on the same physical machines and edges between machines are seldom.
Furthermore, the SmartGraph-Feature has means to identify which of the neighboring vertices are located on the same machine and can apply optimizations/computations for those vertices directly on the data instead of sending it across the network.

Importing the dataset

But enough theory. Now let’s get our hands dirty and start with getting the dataset into ArangoDB. It is important to note that the dataformat generated is SmartGraph compatible, but on import, it will not automatically generate a SmartGraph out of it. We can also import it in a non-smart way. But now let us first import it as a SmartGraph. Therefore we first have to create a new Graph, we open the webinterface and switch to the Graphs-Section.

There we click on Add Graph and fill out the dialog as follows:

CreationSmart

This has created a SmartGraph that is sharded by the attribute: community and uses 9 shards for each collection. In this case, we have two collections profiles_smart and relations_smart. The graph is named pokec_smart and we will use this name to reference it within our queries.
Note here: the attribute community has to be present on each vertex (in this case profiles), it is not required on the edges.

Now we need to import the dataset. Therefore we open up a console and navigate to the .jsonl files of our dataset. Then we start with importing the profiles:
arangoimp --server.endpoint tcp://localhost:8530 --collection profiles_smart --file profiles.jsonl --type json

And afterwards the relations:
arangoimp --server.endpoint tcp://localhost:8530 --collection relations_smart --file relations.jsonl --type json --to-collection-prefix profiles_smart --from-collection-prefix profiles_smart

In order to do a performance comparison of standard vs. SmartGraph we now do an identical import but do not use the SmartGraph feature, but the general graph, which is also available in the community version. The process is rather identical, we start again in the webinterface and fill the Add Graph as follows:

CreationNoSmart

Now we again have graph with 9 shards for each collection, but the data will be evenly distributed accross those shards, not taking communities into account. In this case we have two collections profiles_random and relations_random.

We use the following import statements (only swapped the collection names):

arangoimp --server.endpoint tcp://localhost:8530 --collection profiles_random --file profiles.jsonl --type json

arangoimp --server.endpoint tcp://localhost:8530 --collection relations_random --file relations.jsonl --type json --to-collection-prefix profiles_random --from-collection-prefix profiles_random

Now we should have two graphs imported pokec_random and pokec_smart, they have identical data, but follow a different distribution of the data on the database servers.

Querying a SmartGraph

In order to query the SmartGraph no additional knowledge is needed, it can be queried in exactly the same way as a non-smart graph can be queried. We can query it with the traversal statement by giving the name of the SmartGraph e.g.:

  FOR v,e, p IN 3 ANY 'profiles_random/73418751:P1098613' GRAPH 'pokec_random'
  FILTER p.vertices[1].AGE == 21
  FILTER p.vertices[2].AGE == 22
  FILTER p.vertices[3].AGE == 44
  RETURN {key: v._key, AGE: v.AGE}
Result:
GraphQuery1Random

However under the hood SmartGraphs will do a lot of optimisations taking knowledge of the sharding into account and optimizing the execution steps of the queries with it. Therefore SmartGraphs typically get a better performance in comparison to randomly sharded graphs.

Performance comparison

In this chapter, I would like to show which level of speedup we are able to achieve with this improved distribution.
Let’s first get some information about the dataset we are using:

  • Profiles (~1.6 mio entries) are vertices with a larger data footprint
  • Relations (~30.5 mio entries) are edges between Profiles, without additional data.

Using our simple community-detection on the dataset the resulting relations show:

  • ~17.5 mio edges connecting vertices of the same community.
  • ~13 mio edges connecting vertices of a different community.

This is by far not an optimial cut for SmartGraphs, but nevertheless, will give measurable benefits on query time. The optimal ratio for this feature would be all to same, and none between different communities.

Hardware Setup:

In this setup we used three physical machines with the following specs:

  • CPU: 2 * 8 cores @ 2.00 GHz
  • RAM: 64 GB
  • Storage: SSD

Each machine is serving one Coordinator, one DBServer and one Agent.

Performance outcome:

We execute identical queries against the randomly distributed graph and the SmartGraph. First, we have a query that applies a filter on all vertices of the path:

Query Performance Random1

PerfRandom1

Query Performance Smart1

PerfSmart1

In the random distribution, we needed 4.5s to compute the result. With SmartGraphs the result could be retrieved in below a second. Yielding a factor of ~4.5 of performance gain.

The second query checks for different features on different traversal depth, the filter is a bit more complex, but the result is way smaller. So in general the query is expected to return faster:

Query Performance Random2

PerfRandom2

Query Performance Smart2

PerfSmart2

Indeed it resolves the result faster (~800ms with random distribution). But for SmartGraphs it only requires ~100ms so almost a factor 8 performance gain.

From our analysis we could see the following:
The more specific the filter conditions are, the more effect the SmartGraph has on performance. In same bad cases, e.g. vertices that do not belong to another community, the SmartGraph feature does not give a significant boost. Nevertheless, it is never slower than the randomly distributed graph, so always worth a shot.

When do SmartGraphs help and when not?

From our experience we can see that SmartGraphs help in a lot of cases, but it does not solve everything. So in this paragraph we will shed some light on the upsides and downsides of this feature.

Where it shines:

  • If there is a good distribution of the graph into communities, specifically if you can find/generate an attribute that most edges point from vertices to other vertices inside the same community and cross community edges are seldom.
  • Best case: there are no cross community edges. This can e.g. be the case if you have one community per customer.

The reason is (as you can see in The SmartGraph Idea chapter) that cross community edges most likely require a network hop within the query, where inter-community edges can stay on a single machine only. So a lot of optimizations can be applied on those edges.

How can you identify a good distribution?
In many cases a distribution is inherent in the use-case, e.g. in Social Networks it is most likely that a user has many friends from the same country or region and fewer friends abroad. So an attribute storing the country or region is already in the data and can be used for the sharding directly. This attribute could also be related to a customer id if a single customer generates quite a lot of data, but customers do not interact with each other. Here it is quite easy to store the customers’ id on every item of her data and use that for distribution.

If in your dataset no such attribute is inherent you could execute the same step as in the preceding tutorials. Where you can try to identify good communities by the relation of vertices to one another, however, this requires some execution time and some manual effort.

Limitations of this feature:
It does not give a good benefit if your dataset does not have good communities (e.g. it is not possible to find any constellation where many more edges are inter-community then there are cross-community), however, it should never be worse than the random distribution, so you are still safe.

The feature does only scale well if you have at least as many communities as you have shards in the graph, each community will be entirely stored within one shard. So if you have loads of shards, but only a handful of communities, most shards will be empty, so you will waste scaling power. Furthermore, the distribution of data is in most cases not even, most likely you will have bigger shards, where a large community is stored in, and smaller shards, where only smaller communities are stored in.
“Worst case”: Each vertex is more or less its own community, so almost all edges are cross community. In this case the random distribution will be as efficient as the SmartGraph one.

Interesting to note:
There is one specific edge-case:
If you do not need the cluster to scale, but you only need it for high-availability, you can create a SmartGraph with a single shard and a higher replication factor.
In this setup you will get (almost) local query performance and synchronous replication with automatic failover all at once.