In a nutshell, graph databases store schema-free objects (vertices or nodes) where arbitrary data can be stored (properties) and relations between the objects (edges). Edges typically have a direction going from one object to another or multiple objects. Vertices and edges form a network of data points which is called a “graph”.
In discrete mathematics, a graph is defined as a set of vertices and edges. In computing, it is considered an abstract data type which is really good at representing connections or relations – unlike the tabular data structures of relational database systems, which are ironically very limited in expressing relations.
Graph Database Basics
Graphs can be of different nature. Graphs can be undirected, directed or form a so-called Directed Acyclic Graph (DAG).
Undirected – edges connect pairs of nodes without having a notion of direction
Directed – edges have a direction associated with them
Directed Acyclic Graph (DAG): edges have a direction and there are no loops. One example for a DAG is a tree topology.
Stored edges always have a direction
_from one vertex
_to another. Seen from a certain vertex, incoming edges are called
inbound and outgoing edges
outbound. During queries, the stored direction can be ignored by the actual query when deciding to follow any direction.
Typical Query Patterns in a Graph Database
Graph databases offer specialized algorithms to analyze the relationships of data.
The simplest algorithm is a so-called graph traversal. A graph traversal begins to traverse the graph beginning at a defined start vertex and ends at a defined depth with the end vertex.
When applying filters during a graph traversal on the properties of a vertex or an edge the pattern matching algorithm is being used.
You can also analyze the shortest distance between two given vertices or nodes. This query pattern is called shortest path.
An easy way to imagine a graph is thinking about a social network. In a social network, you have friends and something that is also common is that they may have other friends besides you (Gasp!), you may even be friends with those ‘other’ friends.
This relationship between you, your friends, and their friends is a part of what forms your social network. These connections can easily be translated into a graph and in fact, it could be very useful to structure a social network as a graph. You and your friends could be represented as individual vertices (nodes) and then the things that tie you together or describe your relationship would be an edge or the lines that connect the nodes.
So the simplest edge would be the line that connected you to a friend. However, what if this connection went one step further and described more things about your relationship? You could include details that are common among you, such as the fact that you both love Avocados (who doesn’t!?) and then when you wanted to find friends to join you for the Avocado Festival you could easily query that information. This would allow for things such as suggesting new friends, finding events based on you and your friends matching interests, or even recording important dates such as the date you became friends or other shared life events.
The details that make up the things you like, the things your friends like, and then the things that you share in common could be thought of as the properties of you and your friendships. This concept of modeling your data with descriptive labels is how data is modeled in a property graph. Property graphs use relevant semantic labels to model your data and its connections. This means that data can be structured in a way that is easily understood by a human. Since the data is modeled using relevant terms it can also be queried in an easy to read way. ArangoDB allows for storing information on the vertices as well as the connecting edges, that’s why you can define the things you and your friends have on the edges while maintaining the personal properties on the individual vertices. I used the example of a social network here but the existence of networks exists everywhere and if you would like a full dive into a real-world example using the example of airport and flight data be sure to take the next step with our Graph Course for Freshers that takes you from zero knowledge to advanced queries.
Using Graphs in ArangoDB
Unlike many NoSQL databases, ArangoDB is a native multi-model database. You can store your data as key/value pairs, graphs or documents and access any or all of your data using a single declarative query language. You can combine different models in one query. And, due to its native multi-model approach, you can build high performance applications and scale horizontally with all three data models.
ArangoDB as a Graph Database
The graph capabilities of ArangoDB are similar to a property graph database but add more flexibility in terms of data modeling as vertices and edges are both full JSON documents.
For each document, a unique
_id attribute is stored automatically. To build a relation (i.e., an edge) between two documents (i.e., vertices), both
_id attributes are stored in a special edge document known as
_to attributes, forming a directed connection between two arbitrary vertices. Edges are then stored in a special edge collection.
ArangoDB enables efficient and scalable graph query performance by using a special hash index on
_to attributes (i.e., an edge index). This allows for constant lookup times. Using an edge index, ArangoDB can process graph queries very efficiently.
Graph databases usually store edges connected to vertices directly at the vertex object. In ArangoDB this is handled differently (if you want to take a technical dive into ArangoDB’s approach, see this article about index-free adjacency vs. hybrid indexes).
Vertices and edges are both full JSON documents and can hold arbitrary data. By this approach combined with the edge index, ArangoDB is one of the few graph databases capable of horizontal scaling. Each edge and vertex can contain complex data in the form of nested properties, and all graph functions are deeply integrated into the ArangoDB Query Language, (AQL).
Graph Use Cases
Using the graph database capabilities of ArangoDB opens the door for a lot of interesting use cases, including:
Graph Database Features
ArangoDB provides a broad spectrum of graph database features:
- Graph traversals
- Shortest path(s)
- Pattern matching
- Graph Viewer
- Integrations to Keylines & Cytoscape
- Horizontal Scaling with Graph Data and Queries
- Distributed graph processing via Pregel
ArangoDB supports document, graph, and key/value data models. Due to this natively integrated support, users can also take the result of a JOIN operation, geospatial query, text search or any other access pattern as a starting point for further graph analysis and vice versa – all in one query, if needed. This is an advantage of a native multi-model database like ArangoDB.
A graph can be visualized and manipulated directly within the ArangoDB WebUI. The WebUI provides many configurations for displaying edges and vertices. Here is a view of the IMDB dataset with its search depth set to 4, results limited to 300, the edge visualization type has been set to curved, and with custom vertex and edge labels. This gives a quick view of genres, movies in those genres, and actors who played in those movies.
A nice feature of the Graph Viewer is the ability to select a node and set it as your start node. Here we chose James Cameron as the start node and now can see the movies he was involved in and then, depending on the depth set, further relationships from there. So, for this example, we see that he directed both Avatar and Titanic, which in this dataset are both classified as Action movies, and we can also other Action movies.
We provide this functionality out of the box to make visualizing your data easy. If you would like to take your graph visualization one step further, it’s easy to use other visualization librariers with ArangoDB, get started with the tutorial for how to visualize your data with Cytoscape.
If you’re interested in learning how to access the graph capabilities of ArangoDB, the ArangoDB Graph Course is a great place to start.
Access to graph functionality isn’t enough when using a database, the database must perform well. ArangoDB is built from the ground up as a native multi-model database and in order to be a suitable solution, ArangoDB needs to perform on par with leading single-model databases. ArangoDB supports key/value, document, and graph data models and if you would like to learn more about our multi-model approach check out the white paper.
In order for us to know that we are competitive with the leading single-model databases we put together an open-source performance benchmark, that is used internally to make sure we are achieving our performance goals. We made this benchmark open-source and publicly available for anyone to inspect and reproduce.
As a native multi-model ArangoDB can compete with single model databases on their home turf. In our performance benchmark, we compared our performance to MongoDB, PostgreSQL, OrientDB, and Neo4j. The benchmark tests include single read and write, aggregation, shortest path, neighbor lookup, and memory utilization.
If you would like to read the full article that goes into detail on how we arrived at these results have a look at the NoSQL Performance Benchmark.
Scaling with Graphs
As your application grows, chances are the size of your graph will grow along with it. In order to make sure graph traversals stay as performant as possible, even when being sharded across multiple servers in a cluster, ArangoDB provides a solution in the form of SmartGraphs.
The primary hit to performance comes from network latency. As shown below, when doing a traversal with data sharded across a cluster, multiple back and forth network hops are usually necessary and this can have a large impact on performance. For larger datasets this is a common situation and SmartGraphs reduces the needed network hops by intelligently sharding data.
In many data-sets there are highly interconnected communities, but few connections between these communities. For instance, a set covering your customers, regions or any other logic you apply to organize your graph at the application layer can in turn be used in sharding the graph through the cluster.