home shape

Pattern Matching

This is an add-on to our Graph Course and uses the same example dataset of airports and flights. We will construct a complex pattern matching query step by step to find the best flights between two airports with the lowest total travel time. For instructions how to import and work with the dataset, as well as an introduction to graphs and graph queries please check out the course!

right blob img min

Why not a Shortest Path Query?

In a shortest path query, it is not possible to apply filters because of the algorithm used under the hood to efficiently determine one shortest path between two vertices. It can optionally take a weight attribute into account to choose one shortest path over the other, but that’s about it. However, you can use a shortest path query to determine the minimal length for such a path.

RETURN LENGTH(
  FOR v IN OUTBOUND
  SHORTEST_PATH 'airports/BIS' TO 'airports/JFK' flights
  RETURN v
)

This length minus 1 can be used as traversal depth in a separate query. If the result of above query is 3, then use a depth of 2. You start the traversal from the same start vertex and add a filter to match the same end vertex (also see Multiple Path Search). You may add additional filters to refine the search. If you traverse a graph to find paths fulfilling complex conditions then we call this kind of search pattern matching.

small left bkgd img

Excursion – Storing Results in Variables

Results of simple expressions as well as of entire subqueries can be stored in variables. To declare a variable, use the LET keyword followed by the variable name, an equal sign and the expression. The code must be in parentheses if the expression is a subquery.

LET h = FLOOR(f.DepTime / 100)
LET m = f.DepTime % 100
LET formatted = CONCAT(h, ':', m)
RETURN { hours: h, minutes: m, formatted }

The time format is like 1245. Hours and minutes can be separated with some math: divide by 100 and round off the decimal places for the hours part and modulo 100 for the division remainder to get the minutes part.

The calculated values are stored in two different variables h and m. These variables are then used to create another variable formatted like „12:45“ and all three are returned as result.

RETURN { myVariable } is a short form of RETURN { myVariable: myVariable } if you wondered.

In below example, the hours and minutes of the departure time are pre-calculated and stored in the variables h and m. They are used further down in the RETURN statement to create an ISO timestamp (DATE_ISO8601() documentation) together with the date attributes of the data – ignoring time zones and daylight saving time; we will later use the included UTC timestamps for time calculations:

FOR f IN flights
  FILTER f._from == 'airports/BIS'
  LIMIT 100
  LET h = FLOOR(f.DepTime / 100)
  LET m = f.DepTime % 100
  RETURN {
    year: f.Year,
    month: f.Month,
    day: f.Day,
    time: f.DepTime,
    iso: DATE_ISO8601(f.Year, f.Month, f.Day, h, m)
  }

Your results in Table view should look like this:

flights let query
right blob img min

Finding the Best Flight Step by Step

The real-world question we try to answer is:

What are the best connections between the airports Bismarck Municipal (BIS) and John F. Kennedy International (JFK) determined by the lowest total travel time?

Answering this question is a bit more complex so let’s go through it step-by-step and extend the query each time.

Step 1 – Flights from BIS to JFK

We just want to get from BIS to JFK – we don’t care about time, day or month yet.

Note: We already know that 2 steps is the shortest path, that’s why IN 2 OUTBOUND is used as minimal and maximal traversal depth.

FOR v, e, p IN 2 OUTBOUND 'airports/BIS' flights
  FILTER v._id == 'airports/JFK'
  LIMIT 5
  RETURN p

The result shows a graph but when switching to JSON view you will see 5 different flight plans from BIS → JFK

Step 2 – Only Flights on January First

We make sure that we fly on the same date on each segment, here on New Year (Month = 1, Day = 1).

FOR v, e, p IN 2 OUTBOUND 'airports/BIS' flights
  FILTER v._id == 'airports/JFK'
  FILTER p.edges[*].Month ALL == 1
  FILTER p.edges[*].Day ALL == 1
  LIMIT 5
  RETURN p

The result now shows two possibilities. In this case we could also fly via MSP (Minneapolis).

By the way: ALL == is an array comparison operator.

flights BIS JFK via DEN or MSP
right blob min

Step 3 – Calculate the Flight Time

We make sure that we pick those flights which have the lowest total flight time. For that, we calculate the difference between the departure time at start airport and the arrival time at the target airport in minutes with the DATE_DIFF() function. We use the result to sort in ascending order. Finally, we return the top 5 flights as well as the flight time.

FOR v, e, p IN 2 OUTBOUND 'airports/BIS' flights
  FILTER v._id == 'airports/JFK'
  FILTER p.edges[*].Month ALL == 1
  FILTER p.edges[*].Day ALL == 1
  LET flightTime = DATE_DIFF(p.edges[0].DepTimeUTC, p.edges[1].ArrTimeUTC, 'i')
  SORT flightTime ASC
  LIMIT 5
  RETURN { flight: p, time: flightTime }

Have a look at the results. You will see that some flight times are negative.

flights negative flight time

We need to make sure that departure and arrival time don’t overlap.

right blob img min

Step 4 – No Overlapping Flights

Let’s put in the final part of the query to get the best flights to JFK. In this case we assume that we need 20 minutes to get our connecting flight to JFK. By adding another filter that ensures that the next plane does not take off before we landed with the previous one, plus time for the transit, we won’t see negative flight times anymore and get viable flight connections:

FOR v, e, p IN 2 OUTBOUND 'airports/BIS' flights
  FILTER v._id == 'airports/JFK'
  FILTER p.edges[*].Month ALL == 1
  FILTER p.edges[*].Day ALL == 1
  FILTER DATE_ADD(p.edges[0].ArrTimeUTC, 20, 'minutes') < p.edges[1].DepTimeUTC
  LET flightTime = DATE_DIFF(p.edges[0].DepTimeUTC, p.edges[1].ArrTimeUTC, 'i')
  SORT flightTime ASC
  LIMIT 5
  RETURN { flight: p, time: flightTime }

Have a look at the results and you will see, that our best flight would not be via Denver (DEN) – which a shortest path query may have suggested – but via Minneapolis (MSP).

Optimization with Vertex-Centric Indexes

There are quite a lot of edges the traverser has to inspect and follow if we run our query against the example data. We can improve the query performance significantly, if we add a vertex-centric index to traverse edges of the relevant day only:

  • Click on COLLECTIONS in the ArangoDB WebUI
  • Open the flights collection
  • Click on the Indexes tab
  • Click the green button in the action column to add a new index
  • Set Type to Hash Index
  • Type _from,Month,Day into Fields
  • Leave the index options Unique and Sparse unticked
  • Click the green Create button

 

 

flights edge collection indexes
flights vertex centric index

When the index is ready, go back to the query editor and re-run the final version of our query. You should see a faster query execution time. Click on Explain and look below the Execution plan to see that the new index is being used:

Indexes used:
By  Type  Collection  Unique  Sparse  Selectivity  Fields                       Ranges
2   hash  flights     false   false   1.47 %       [ `_from`, `Month`, `Day` ]  base OUTBOUND

Without the vertex-centric index, all outgoing edges of the departure airport need to be followed using the standard edge index, then checked if they meet our conditions (on a certain day, arriving at desired destination, with a viable transit).

The extra index allows the traverser to quickly lookup the outgoing edges of the departure airport (_from attribute) for a certain day (MonthDay attributes), which eliminates the need to fetch and filter out all edges with flights on different days. It reduces the number of edges that need checking with a cheap index lookup and saves quite some time.

See our documentation about vertex-centric indexes for more details.

Conclusion

We constructed a complex pattern matching query to find the best flights between two particular airports. From here on, you can experiment further with the query on your own. For example, try other airports or conditions, possibly using some of the other attributes included in the dataset. All attributes are described in the Graph Course.

Can you come up with cool pattern matching queries that you want to share? Or do you still have questions? Let us know, we are happy to help!