SmartJoins: boosting cluster join query performance - ArangoDB

Sign up for ArangoDB Cloud

Before signing up, please accept our terms & conditions and privacy policy.

What to expect after you signup
You can try out ArangoDB Cloud FREE for 14 days. No credit card required and you are not obligated to keep using ArangoDB Cloud.

At the end of your free trial, enter your credit card details to continue using ArangoDB Cloud.

If you decide that ArangoDB Cloud is not (yet) for you, you can simply leave and come back later.

SmartJoins: boosting cluster join query performance

SmartJoins: boosting cluster join query performance

The “smart joins” feature available in the ArangoDB Enterprise Edition allows running joins between two sharded collections with performance close to that of a local join operation.

The prerequisite for this is that the two collections have an identical sharding setup.

See our feature video here:

The original Instacart dataset for this video is no longer available but can instead be downloaded from the instacart branch of the interactive tutorials repository.

DATASET: https://github.com/arangodb/interactive_tutorials/tree/instacart

Example setup

Let’s consider an example setup with collections, products and orders. Between these collections there is a 1-to-many relationship, e.g. for each product there can be any number of orders.

To use SmartJoins, the first step is to make the two collections shard their data alike. This can be achieved by making the distributeShardsLike attribute of the orders collection refer to the products collection:

Now both collections will have the same number of shards (3) each. The collection products will be sharded by the values in its _key attribute, and data in collection orders will be sharded by the values in the attribute productId. That latter attribute productId is a foreign key referring to the _key in the products collection.

Now we can populate the collections with some example data:

Data distribution

Due to the identical sharding setup all orders for a product with a given productId value will be located on the same shard as the products they are referring to, which means joins between the two collections that join them via their shard keys can be executed locally.

Let’s check the actual per-shard data distribution of the collections:

It means that on the first shard of products (s2010051) a single product is stored. All orders for this product must be located on the corresponding shard of the orders collection, i.e. s2010055. The second shard of products (s2010049) hosts two products, and its counterpart in the orders collection is responsible for 7 orders for these two products. The third shard of the products collection is not responsible for any data yet, and so is its corresponding shard in the orders collection.

Running a smart join query

Before actually executing the first SmartJoin, let’s create a non-unique index on the productId attribute of the orders collection, so any lookups by product id can be turned into quicker index lookups:

Here is the SmartJoin query, which is just an ordinary join query, with no extra hints added:

The AQL query optimizer will automatically detect that the join is via the two collections’ shard keys, and that the two collections are sharded equal:

As can be seen from the above query execution plan, the query optimizer employed the “smart-joins” optimizer rule, which means it has turned a distributed join into a local join. Thus there is no extra hop via the coordinator necessary to join data of the collections in nodes 3 and 7.

Comparison to regular joins

Compare this to the following execution plan, which is from the same query, just without the SmartJoins optimization applied:

Here we can see that between the two collections (nodes 16 and 7) there is an intermediate hop via the coordinator. This is also the general way of joining two collections with an arbitrary number of shards and unknown data distribution.

Performance comparison

The latter query (the one without SmartJoins) will require 15 network requests for joining the data of the two collections, in the simplest case. The value of 15 stems from these individual requests:

    • coordinator to shard 1 of orders
      • shard 1 of orders to the coordinator
        • coordinator to shard 1 of products
        • coordinator to shard 2 of products
        • coordinator to shard 3 of products
    • coordinator to shard 2 of orders
      • shard 2 of orders to the coordinator
        • coordinator to shard 1 of products
        • coordinator to shard 2 of products
        • coordinator to shard 3 of products
    • coordinator to shard 3 of orders
      • shard 3 of orders to the coordinator
        • coordinator to shard 1 of products
        • coordinator to shard 2 of products
        • coordinator to shard 3 of products

The SmartJoins do not need the extra coordinator hop, and do not need to fan out requests from each shard of one collection to each shard of the other collection. SmartJoins use the fact that the join data will always reside locally, so they do not reach out to the other shards via the network.

A lot of requests can be saved here! For the products and orders example query, the SmartJoin variant can get away with just 3 network requests:

  • coordinator to shard 1 of orders and products
  • coordinator to shard 2 of orders and products
  • coordinator to shard 3 of orders and products

Generally spoken, the performance advantage of SmartJoins compared to regular joins will grow with the number of shards of the underlying collections.

For two collections with n shards each, the minimal number of requests for the general join will be n * (n + 2). The number of network requests increases quadratically with the number of shards. SmartJoins can get away with a minimal number of n requests here, which scales linearly with the number of shards.

SmartJoins will also be especially advantageous for queries that have to ship a lot of data around for performing the join, but that will filter out most of the data after the join. In this case SmartJoins should greatly outperform the general join, as they will eliminate most of the inter-node data shipping overhead.

Learn More About Graph Databases

Read our latest Graph and Beyond  white paper to gain insights into how ArangoDB graph databases can support many use cases.
DOWNLOAD NOW!
close-link
Click Me