Estimated reading time: 8 minutes

The ArangoDB Query Language(AQL) was designed to accomplish a few important goals, including:
- Be a human-readable query language
- Client independency
- Support complex query patterns
- Support all ArangoDB data models with one language
The goal of this guide is to ensure that you get the most out of the above design goals, by providing some suggested best practices. Just like many programming languages, AQL allows the developer to format their code in whatever way comes naturally to them. However, just like with programming languages, there are different style guides and best practices that the community agrees upon. These guidelines can make reading and maintaining code easier and sometimes even more performant.
By the end of this guide you will be aware of what ArangoDB considers best practices for both formatting and performance, allowing you to write fast and clean graph queries in AQL.
This guide is split into two sections:
- Formatting:
- Syntax formatting
- Styling conventions
- IMDB Dataset Example Notebook
This guide mostly focuses on best practices, styling, and considerations when writing queries and assumes some prior knowledge of AQL. As such, it will be beneficial if you are already familiar with AQL and performing graph queries.
If you would like to get up to speed with Graphs, AQL, and running queries take a look at the Graph Course for Freshers which takes you from beginner to advanced graph queries in AQL.
Formatting
The first section of the guide will contain:
- A review of the different graph syntax
- Terms used throughout the guide
- AQL styling conventions
I like your style
As was mentioned in the introduction, AQL has a flexible formatting schema, similar to most popular programming languages. Not requiring strict formatting in AQL was a conscious choice, meant to allow developers the flexibility to write queries in an easy and natural way.
This flexibility is especially helpful for developers new to AQL and removes some of the barriers to learning the query language. However, as your AQL queries start to become a part of your application’s business logic, maintainability and readability gain importance. It becomes important to have rules setup that allow other developers to be able to review and contribute changes as needed. When the business needs change and queries need to be updated or new ones added, having a thought out guide for formatting can save time and effort.
This section will lay the foundation for basic styling decisions and then we will expand upon these guidelines in the syntax and examples sections.
The following statement is the first line of a typical graph query in AQL. There are already a few things worth pointing and providing some insight into why we chose to format it this way.
1 |
FOR v, e, p IN 1..1 OUTBOUND |
Guideline #1: Capitalization
In AQL we often capitalize the keyword or function being used in the query. This capitalization is an example of a convention and not a requirement. However, variables declared in queries are case-sensitive and typically lowercase.
1 2 |
✘ for v,e,p in 1..1 outbound ✔ FOR v,e,p IN 1..1 OUTBOUND |
These guidelines are just that, guidelines, not requirements. The important thing is that you are thinking about your queries in terms of readability and if the lowercase version works for you and your team, that is what is important.
Guideline #2: Naming
In the following FOR loop we have supplied names for the 3 possible variables to be emitted in a graph traversal.
1 |
FOR v,e,p IN 1..1 OUTBOUND |
This guideline will most likely be review for many developers but it is equally important to choose descriptive and explicit names for your AQL queries. In fact, one could argue that instead of choosing non-descript letters for our variable names, we should instead choose the names themselves as the variables:
1 |
FOR vertex,edge,path IN 1..1 OUTBOUND |
This comes down to personal preference and what you decide is most readable.
You can use variables or reference attributes of documents with matching names of AQL keywords by using backticks, for example:
1 2 |
FOR `filter` IN collection RETURN `filter` |
This functionality exists to help in situations where this conflict cannot be avoided. The ArangoDB recommended guideline is to instead avoid using names that conflict with any AQL keyword.
Guideline #3: Next Line
When forming AQL queries you have the freedom to space and go to the next line, whenever it makes sense to you. For example, if we were to add a FILTER statement:
1 2 3 4 |
FOR vertex, edge, path IN 1..1 OUTBOUND startVertex GRAPH “graphName” FILTER vertex._key == “KeyValue” RETURN path |
It is convention to use two spaces on the next line after each FOR statement. The purpose of this is to use white space to show where the bulk of the query statements are happening and resembles function declarations in programming languages.
Guideline #4: Commenting
Putting comments in your AQL queries is an easy and inexpensive way to provide clarity to potentially complex queries. AQL supports two styles of commenting:
Single line commenting:
1 |
// Your comment here |
Multi-line comment (recommended):
1 2 |
/* Your multi-line comment here. */ |
There are certain situations where using the single line commenting format causes issues when attempting to copy-paste queries between systems. Either style will work properly with AQL but using the multi-line style of commenting is more portable and is our recommended style for comments.
Graph Syntax
This section serves two purposes:
- Highlight style and formatting decisions with graph traversals
- Review basic graph syntax
This is the graph syntax example pulled from our documentation and it shows all of the possible options available for graph queries. In this section we will go through each line and clarify the terms being used and the style decisions made.
1 2 3 4 5 6 7 |
[WITH vertexCollection1[, vertexCollection2[, ...vertexCollectionN]]] FOR vertex[, edge[, path]] IN [min[..max]] OUTBOUND|INBOUND|ANY startVertex GRAPH graphName || edgeCollection1, ..., edgeCollectionN [PRUNE pruneCondition] [OPTIONS options] |
WITH
1 |
[WITH vertexCollection1[, vertexCollection2[, ...vertexCollectionN]]] |
The first line in the query starts with the WITH keyword. WITH is a versatile keyword in AQL and in this context it is required for AQL queries in a cluster. While the WITH keyword is only required for cluster queries it is recommended that you use it with all graph queries. There are a few advantages to this:
- Provides clarity to collections used in query.
- Makes your queries future ready, when/if you move from single server to cluster.
- Read locks collections, avoiding deadlocking queries.
FOR
1 |
FOR vertex[, edge[, path]] |
Nothing new to point out here, we capitalize FOR and in this representation we are using the full names for the variables. Remember, it is not required to use these variable letters or names, just a convention. You could instead put a,b,c or node, line, route, or whatever works best for you and your team.
IN
1 |
IN [min[..max]] |
Here again, we capitalize IN and supply the min .. max value. We have this on a separate line keeping the depth separate from other parts of the query, we find this helps improve readability. As a general rule, going to the next line for different portions of the query provides some nice whitespace to improve readability and can be helpful when needing to make changes.
Direction
1 |
OUTBOUND|INBOUND|ANY startVertex |
We place the direction keyword on the same line as the startVertex because they both deal with navigation. This is another thing that helps keep the query readable and allows you to think about queries in bit size chunks, which is useful for new users reading the query and when you are debugging your own queries.
Graph
1 |
GRAPH graphName || edgeCollection1, ..., edgeCollectionN |
This line, in the query, is the difference between using a named graph or an anonymous graph. There are some key differences between the two but, style-wise, it is pretty straightforward.
Using the GRAPH keyword followed by the ‘graphName’ provides:
- Readability
- Maintainability by being able to update one graph definition used with multiple queries
- Potential performance decrease when traversing large graphs
Using an anonymous graph provides:
- Query-time flexibility
- Reduced readability
- Performance improvements, due to only traversing the specified collections
As you can see, the decision of which graph to use in AQL is not as clear. You can make a trade off of being able to keep queries clean and easily manageable, with a potential for a loss in performance when traversing graphs that contain a large number of collections.
While with an anonymous graph, you trade in readability and maintainability for some flexibility and possible performance gain. You would only improve performance if your graph contains many large collections that don’t need to be traversed in most of your queries.
In AQL, the decision to use a defined named graph instead of an anonymous graph comes down to how your data is modeled and the queries your application needs to run. This is what we will continue to explore throughout this guide.
Condition and Options
Conditions
The final two lines deal with the conditions for finding the desired documents. This may be done by using FILTER, PRUNE, and any other appropriate statements to narrow down the results for your traversal. These follow the general rules covered previously in this guide and conventionally you would go to the next line for each new statement.
Options
The options statement requires you to submit an object and typically you will see this object following a javascript-like format. This really comes down to what feels most natural to you but as an example here is how we format it:
1 2 3 4 5 |
OPTIONS { bfs: true, uniqueVertices: 'path', uniqueEdges: 'path' } |
We have covered some of the formatting, styling, and various conventions used in AQL graph traversals in the previous section. While some of it may have seemed obvious, I hope it serves as a good reference for deciding how to structure your queries, from a styling perspective.
In the next section we will put this styling to use and cover performance considerations when writing graph queries. We will take a look at some example queries and review some common pitfalls when coming from other query languages that can help you keep your queries fast and clean.
Examples
The below is an interactive Google Colab Notebook that will walk you through some examples of ways to keep your graph queries fast, clean, and readable. We use the IMDB dataset and AQL to solve various queries that benefit from using our graph AQL best practice guidelines.
%%capture
!git clone https://github.com/cw00dw0rd/ArangoNotebooks.git
!rsync -av ArangoNotebooks/ ./ --exclude=.git
!pip3 install pyarango
!pip3 install "python-arango>=5.0"
!pip3 install graphviz
import json
import requests
import sys
import oasis
import time
from pyArango.connection import *
from arango import ArangoClient
# Retrieve tmp credentials from ArangoDB Tutorial Service
login = oasis.getTempCredentials("BestPracticesTutorial")
# Connect to the temp database
# Please note that we use the python-arango driver as it has better support for ArangoSearch
database = oasis.connect_python_arango(login)
print("https://"+login["hostname"]+":"+str(login["port"]))
print("Username: " + login["username"])
print("Password: " + login["password"])
print("Database: " + login["dbName"])
IMDB Dataset¶
This notebook uses the IMDB dataset. It is loaded with detailed documents for actors, directors, and movies. The edges link the movies to the actors and movie genres.
You can access the ArangoDB WebUI and ArangoDB Graph Viewer to explore the data further at any time. Just click the link generated above, sign in with your temporary credentials, and you have access to a temporary but fully functional ArangoDB database.
!chmod 755 ./tools/arangorestore
!./tools/arangorestore -c none --server.endpoint http+ssl://{login["hostname"]}:{login["port"]} --server.username {login["username"]} --server.database {login["dbName"]} --server.password {login["password"]} --default-replication-factor 3 --input-directory "./data"
# Setup the IMDB Graph
if database.has_graph('IMDB'):
IMDB = database.graph('IMDB')
else:
IMDB = database.create_graph('IMDB')
if not IMDB.has_edge_definition('imdb_edges'):
IMDB.create_edge_definition(
edge_collection='imdb_edges',
from_vertex_collections=['imdb_vertices'],
to_vertex_collections=['imdb_vertices']
)
else:
IMDB.replace_edge_definition(
edge_collection='imdb_edges',
from_vertex_collections=['imdb_vertices'],
to_vertex_collections=['imdb_vertices']
)
# We setup this up to display our graph results in the notebook.
from graphviz import Digraph
from IPython.display import Image
def visualize(result, size='10'):
graph_name = 'imdb'
g = Digraph(graph_name, filename=graph_name, format='jpeg', engine='neato')
g.attr(overlap='false', size=size)
g.attr('node', shape='circle', fixedsize='false', margin='0', color='blue',style='filled', fillcolor='#dbe2e2', fontname='arial')
g.attr('edge', shape='arrow', color='gray')
for item in result:
for vertex in item['vertices']:
g.node(vertex['_id'], label=vertex['label'])
for edge in item['edges']:
g.edge(edge['_from'], edge['_to'])
return g
Let's Get Started¶
This first example is a simple lookup of the movies that Will Smith has acted in. Take note that in this query we are taking advantage of many of our guidelines:
- Capitlized keywords
- Lowercase variables
- Double space following a
FOR
statement - Multi-line comment style
- Verbose variables
Continue reading to take a look at some important performance considerations taken with this query.
aql = database.aql
actorName = 'Will Smith'
# Execute the query
cursor = aql.execute(
"""
WITH imdb_vertices
FOR actor IN imdb_vertices
FILTER actor.name == @name
FOR vertex, edge , path
IN 1..1
OUTBOUND actor /* The actor found from the first FOR loop */
GRAPH 'IMDB'
FILTER path.vertices[1].type == 'Movie'
RETURN path
""",
bind_vars={'name': actorName}
)
# Iterate through the result cursor
result = [doc for doc in cursor]
visualize(result)
Indexes with Graph Traversals¶
The decision to FILTER
on the path in this example, is an important consideration. In graph traversals, simply filtering on the vertex or edge variable does not utlize indexes.
The query goes from a vertex to an edge, then the edge indicates what the next connected vertex will be. The traversal continues in this way only finding out what the next vertex will be, once it reaches the next edge document. This is a natural process for a graph traversal, but it also means simply looking up all of the necessary documents from the index isn't possible, as it is not known what those documents will be until then.
Instead of filtering on every document that comes along, you can instead wait until you have a full path, from your start vertex to your destination, and match your criteria against the path. This can significantly improve performance, since this process can utilize indexes.
The following code block shows the AQL optomizer rule comparisons for the previous lookup, one query does the FILTER
on the path and the other does the FILTER
on the vertex.
import itertools
fast_profile = aql.explain(
"""
WITH imdb_vertices
FOR actor IN imdb_vertices
FILTER actor.name == 'Will Smith'
FOR vertex, edge, path
IN 1..1
OUTBOUND actor /* The actor found from the first FOR loop */
GRAPH 'IMDB'
FILTER path.vertices[1].type == 'Movie'
RETURN path
"""
)
slow_profile = aql.explain(
"""
WITH imdb_vertices
FOR actor IN imdb_vertices
FILTER actor.name == 'Will Smith'
FOR vertex, edge, path
IN 1..1
OUTBOUND actor /* The actor found from the first FOR loop */
GRAPH 'IMDB'
FILTER vertex.type == 'Movie'
RETURN path
"""
)
print("Filter on Path ".ljust(50, ' '), "Filter on Vertices")
print("--------------------------------------------------------------------------------")
for rule in itertools.zip_longest(fast_profile['rules'], slow_profile['rules']):
if rule[0] is not None:
p1 = (rule[0].ljust(40, ' '))
else:
p1 = None
if rule[1] is not None:
p2 = (rule[1].ljust(40, ' '))
else:
p2 = None
print(p1,"|".ljust(10, ' '), p2)
Looking at the optimizer rules applied for each traversal shows us two important rules that can be applied.
In the example that filters on the path, the optimizer is able to:
remove-filter-covered-by-traversal
remove-unnecessary-calculations-2
These rules indicate that because we are filtering on the path, our
FILTER
was indeed covered by our traversal. This also results in no longer needing to perform calculations on these documents, which is an expensive operation.With larger graph queries, taking advantage of these optimizations results in noticeable performance improvements.
That's Nice But..¶
You might be thinking, that's nice but how can I actually make use of this in my queries? That is a fair question and we will continue exploring indexes as well as building graph queries that utilize optimization rules and concepts.
To highlight what this means, the following sections will cover these topics:
- Index Utilization
- Choosing a start vertex
PRUNE
vsFILTER
We have already started the conversation on indexes in graph traversals and it will be a topic important for every query. The goal of the following example is to find all the movies directed by James Cameron and then list the associated actors for those movies. This example is a bit more complex as we start nesting FOR
loops. This example hopefully starts to show the benefits of some of the formatting guidelines such as, whitespace and capitalization.
directorName = "James Cameron"
results = aql.execute(
"""
WITH imdb_vertices
FOR director IN imdb_vertices
FILTER director.name == @name
LIMIT 1
FOR movie, edge, path
IN 1..1
OUTBOUND director
GRAPH "IMDB"
FILTER path.edges[*].`$label` ALL == 'DIRECTED'
FOR vertex2, edge2, path2
IN 1..1
INBOUND movie
GRAPH "IMDB"
FILTER path2.edges[*].`$label` ALL == 'ACTS_IN'
RETURN path2
""",
bind_vars= {'name': directorName}
)
visualize(results, size='13')
Start Vertex¶
In AQL it is required that you choose a start vertex as a place where your traversal will actually start. The choice of your start vertex can have a big impact on the performance of your queries.
The rule for your start vertex is that specificity is king.
There are a number of ways I could have attempted to achieve similar results for this query. The query could have instead started with an actor (perhaps the least specific option) or a movie but instead, it starts with the director. This is of course logical, considering we are only wanting the actors and movies that this director was involved with. It is also a good idea to start with the actual director as this allows us to search on the edges for the DIRECTED
label, which can be covered by an index with _from
and $label
.
Some questions to ask before a query like this are:
- What exactly is the intended result of the query?
- How is our data modeled?
- What is the specificity of our start vertex?
This dataset doesn’t contain any ‘super nodes’, nodes with a very large number of inbound and outbound edges, but that is something to keep in mind.
You never want to use a super node as a start vertex.
If you start at a vertex that has a very high number of connecting edges, your traversal will need to travel down each path and this can result in long running queries. Try to start at the most relevant place possible, such as a single director, and then filter down from there.
Matching Documents¶
For this example we start with Keanu Reeves and then look at the movies he has acted in and then the movies his co-stars have acted, all with the action genre type. If you are going to watch action movies, you might as well watch Keanu Reeves or at least the friends of Keanu Reeves.
cursor = aql.execute(
"""
WITH imdb_vertices
FOR actor IN imdb_vertices
FILTER actor.name == 'Keanu Reeves'
FOR v,e,p
IN 1..3
ANY actor
GRAPH "IMDB"
PRUNE p.vertices[*].genre ALL == "Action"
FILTER v.genre == "Action"
FILTER p.vertices[*]._key ALL != "action"
LIMIT 100
RETURN p
"""
)
visualize(cursor, size='12')
Two common ways of matching documents in a graph traversal are:
It is easy to think that perhaps PRUNE
might be used in place of FILTER
or vice-versa. However, rather than making a decision on which one to use, these two keywords instead make a powerful team. In the example below we use PRUNE
as a way to quickly stop traversing along a path if that path doesn’t contain all genre types of ‘Action’.
This PRUNE
is combined with a vertex that makes sure the final vertex is also of genre type ‘Action’. The final FILTER
is due to the fact that this dataset contains a genre document that can potentially link unrelated films and actors.
This query let us find very specific and fine tuned results. However, it is this sort of search criteria that can make following a specific guideline or rule for query performance difficult. These examples along with the formatting guidelines can serve as a base for improving the way you think about formatting your graph queries and your overall query performance.
Hear More from the Author
IP Whitelists with ArangoDB Oasis
Certificates with ArangoDB Oasis