Some graph database vendors propagandize index-free adjacency for the implementation of graph models. There has been some discussion on Wikipedia about what makes a database a graph database. These vendors tried to push the definition of index-free adjacency as foundation of graph databases, but were stopped by the community.
Therefore a graph database remains “a database that uses graph structures for semantic queries with nodes, edges and properties to represent and store data” – independent of the way the data is stored internally. It’s really the model and the implemented algorithms that matter.
Interested in trying out ArangoDB? Fire up your database in just a few clicks with ArangoDB Oasis: the Cloud Service for ArangoDB. Start your free 14-day trial here.
Still, the question remains if index-free adjacency as an implementation detail is a suitable concept for a graph database. The idea is that each node contains a list of pointers of its edges, therefore avoiding look-ups. In a distributed world, it’s clear that such a definition is meaningless, as the edges can live on different servers and there is no such thing as a “pointer” of an edge. In this case, all potential improvements discussed below are negligible in comparison to the communication latency, so much cleverer algorithms are needed in the distributed case.
So, let’s concentrate on the single server case and dive into the computer science behind it.
The key message of index-free adjacency is, that the complexity to traverse the whole graph is O(n), where n is the number of nodes. In contrast, using any index will have complexity O(n log n). While this sounds plausible at first, it is simply wrong. Using a novel index, which combines hashes with linked-list, it is possible to gain the same complexity O(n) when traversing the whole graph. However, index-free adjacency has some severe pitfalls. It has drawbacks when there exist super-nodes. Access can be done fast, but deleting or modifying edges of such nodes becomes a nightmare. If you have a typical social network, then celebrities will have millions of followers. If you hit such a node, modifying operations suddenly become too complex to handle without indexes.
Can one do better? Yes, by using our novel hybrid index, one gets the best of both worlds. By combining a hash-index with a linked-list we were able to achieve the same access speed as an index-free implementation, while being way more efficient when deleting or modifying, especially when dealing with edges of super-nodes. Let’s explore the complexity analysis behind this approach.
If you store the vertices at each node as list of direct pointers, then traversing all neighbors has complexity O(k), if a vertex has k edges. Note that this is the best possible complexity because O(k) is the size of the answer. Deleting a single edge also has the same complexity of O(k) (assuming a doubly linked list), which is much worse than optimal. Furthermore, usually one will want to be able to traverse edges in both directions, which makes it necessary to store direct pointers on both vertices that are incident with an edge. A consequence of this is that deleting a supernode is even worse: To remove all incident edges one has to visit every adjacent vertex – and perform a potentially expensive removal operation for each of them.
Using indexes naively will not remedy the problem, as the complexity of traversing the edges raises to O(log E) + O(k) where E is the total number of edges. The choice of a hybrid index between hash and linked-list, solves this dilemma. We store all edges in a big hash table, and at the same time we put for each vertex V all those edges that are incident with V into a doubly linked list. Details of the implementation can be found in our github repository.
This allows to proceed as follows: Finding the starting point for the linked list of a vertex is a single hash lookup by the vertex key, which is O(1) with a very good constant. If a vertex has k neighbors, traversing all its neighbors has complexity O(k) because of the linked-list aspect of the index, we simply have to run through the linked list once we have found the start in O(1), but O(1) + O(k) = O(k). For single-edge modifications and single-edge deletions we first look up by the vertex key, in case the edge is the first in the linked list of the vertex, and if this fails to find the edge, we do one further lookup to find the edge by its own key. Thus, we can find the edge in O(1) and then modify or delete it. This is only possible, because the edge index is both a big hash table for all edges and a linked list for each vertex’s neighbours. Overall, the complexity is now O(k) for neighbors – as good as theoretically possible – and O(1) for single-edge modifications – again the best possible result.
Because of the much better complexity for modifications, it is unnecessary to investigate the involved constants. For the traversal case it is worthwhile to do so, though. So, how large are the speedups promised by direct pointers – at least when no modifications of the graph are involved? If the graph is unrealistically small (and fits into the L2 cache), then it is plausible that pointers will be faster. However, the potential savings with a direct pointer approach is negligible, since the direct pointer also has to fetch at least one cache line. Even this small benefit is immediately lost again if one does not use C++ or a similar low-level language.
On the other hand, if you have a large graph that does not completely fit into memory, a drawback of index-free adjacency becomes apparent. If you look at the iconic graph query, namely a pattern matching in a large graph, an index might even be faster, because you do not need to look at these vertices at all. You only need the index itself, which is much smaller than the complete node data. Therefore it is much more memory and cache friendly, which is of most importance with nowadays’ technology.
By selecting the right index, it is possible to reach the same read complexity with superior write complexity.
In addition, using an index keeps the payload out of the main memory. This makes the algorithms much more memory and cache friendly.
If you are interested in performance of such a solution have a look here.