Joins

So far we have only dealt with one collection (users) at a time. We also have a collection relations that stores relationships between users. We will now use this extra collection to create a result from two collections.

First of all, we’ll query a few users together with their friends’ ids. For that, we’ll use all relations that have a value of friend in their type attribute. Relationships are established by using the friendOf and thisUser attributes in the relations collection, which point to the userId values in the users collection.

Join tuples

We’ll start with a SQL-ish result set and return each tuple (user name, friends userId) separately. The AQL query to generate such result is:

FOR u IN users
    FILTER u.active == true
    LIMIT 0, 4
    FOR f IN relations
      FILTER f.type == @friend && f.friendOf == u.userId
      RETURN {
        "user" : u.name,
        "friendId" : f.thisUser
      }
Show query results
[
  {
    "user": "Abigail",
    "friendId": 3
  },
  {
    "user": "Abigail",
    "friendId": 2
  },
  {
    "user": "Abigail",
    "friendId": 4
  },
  {
    "user": "Mary",
    "friendId": 4
  },
  {
    "user": "Mary",
    "friendId": 1
  },
  {
    "user": "Mariah",
    "friendId": 2
  },
  {
    "user": "Mariah",
    "friendId": 1
  },
  {
    "user": "Fred",
    "friendId": 2
  },
  {
    "user": "Fred",
    "friendId": 5
  }
]
Hide query results

We iterate over the collection users. Only the ‘active’ users will be examined. For each of these users we will search for up to 4 friends. We locate friends by comparing the userId of our current user with the friendOf attribute of the relations document. For each of those relations found we return the users name and the userId of the friend.

Horizontal lists

Note that in the above result, a user can be returned multiple times. This is the SQL way of returning data. If this is not desired, the friends’ ids of each user can be returned in a horizontal list. This will return each user at most once.

The AQL query for doing so is:

FOR u IN users
  FILTER u.active == true LIMIT 0, 4
  RETURN {
    "user" : u.name,
    "friendIds" : (
      FOR f IN relations
        FILTER f.friendOf == u.userId && f.type == "friend"
        RETURN f.thisUser
    )
  }
[
  {
    "user" : "Abigail",
    "friendIds" : [
      108,
      102,
      106
    ]
  },
  {
    "user" : "Fred",
    "friendIds" : [
      209
    ]
  },
  {
    "user" : "Mary",
    "friendIds" : [
      207,
      104
    ]
  },
  {
    "user" : "Mariah",
    "friendIds" : [
      203,
      205
    ]
  }
]

In this query we are still iterating over the users in the users collection and for each matching user we are executing a subquery to create the matching list of related users.

Self joins

To not only return friend ids but also the names of friends, we could “join” the users collection once more (something like a “self join”):

FOR u IN users
  FILTER u.active == true
  LIMIT 0, 4
  RETURN {
    "user" : u.name,
    "friendIds" : (
      FOR f IN relations
        FILTER f.friendOf == u.userId && f.type == "friend"
        FOR u2 IN users
          FILTER f.thisUser == u2.useId
          RETURN u2.name
    )
  }
[
  {
    "user" : "Abigail",
    "friendIds" : [
      "Jim",
      "Jacob",
      "Daniel"
    ]
  },
  {
    "user" : "Fred",
    "friendIds" : [
      "Mariah"
    ]
  },
  {
    "user" : "Mary",
    "friendIds" : [
      "Isabella",
      "Michael"
    ]
  },
  {
    "user" : "Mariah",
    "friendIds" : [
      "Madison",
      "Eva"
    ]
  }
]

This query will then again in term fetch the clear text name of the friend from the users collection. So here we iterate the users collection, and for each hit the relations collection, and for each hit once more the users collection.

Outer joins

Lets find the lonely people in our database - those without friends.


FOR user IN users
  LET friendList = (
    FOR f IN relations
      FILTER f.friendOf == u.userId
      RETURN 1
  )
  FILTER LENGTH(friendList) == 0
  RETURN { "user" : user.name }
[
  {
    "user" : "Abigail"
  },
  {
    "user" : "Fred"
  }
]

So, for each user we pick the list of their friends and count them. The ones where count equals zero are the lonely people. Using RETURN 1 in the subquery saves even more precious CPU cycles and gives the optimizer more alternatives.

Index usage

Especially on joins you should make sure indices can be used to speed up your query. Please note that sparse indices don’t qualify for joins:

In joins you typically would also want to join documents not containing the property you join with. However sparse indices don’t contain references to documents that don’t contain the indexed attributes - thus they would be missing from the join operation. For that reason you should provide non-sparse indices.

Pitfalls

Since we’re free of schemata, there is by default no way to tell the format of the documents. So, if your documents don’t contain an attribute, it defaults to null. We can however check our data for accuracy like this:

RETURN LENGTH(FOR u IN users FILTER u.userId == null RETURN 1)
[
  10000
]
RETURN LENGTH(FOR f IN relations FILTER f.friendOf == null RETURN 1)
[
  10000
]

So if the above queries return 10k matches each, the result of the Join tuples query will become 100,000,000 items larger and use much memory plus computation time. So it is generally a good idea to revalidate that the criteria for your join conditions exist.

Using indices on the properties can speed up the operation significantly. You can use the explain helper to revalidate your query actually uses them.

If you work with joins on edge collections you would typically aggregate over the internal fields _id, _from and _to (where _id equals userId, _from friendOf and _to would be thisUser in our examples). ArangoDB implicitly creates indices on them.