Smart Joins: boosting cluster join query performance

SmartJoins: boosting cluster join query performance

The “smart joins” feature available in the ArangoDB 3.5 Enterprise Edition allows running joins between two sharded collections with performance close to that of a local join operation.

The prerequisite for this is that the two collections have an identical sharding setup.

See our feature video here:

Example setup

Let’s consider an example setup with collections, products and orders. Between these collections there is a 1-to-many relationship, e.g. for each product there can be any number of orders.

To use SmartJoins, the first step is to make the two collections shard their data alike. This can be achieved by making the distributeShardsLike attribute of the orders collection refer to the products collection:

Now both collections will have the same number of shards (3) each. The collection products will be sharded by the values in its _key attribute, and data in collection orders will be sharded by the values in the attribute productId. That latter attribute productId is a foreign key referring to the _key in the products collection.

Now we can populate the collections with some example data:

Data distribution

Due to the identical sharding setup all orders for a product with a given productId value will be located on the same shard as the products they are referring to, which means joins between the two collections that join them via their shard keys can be executed locally.

Let’s check the actual per-shard data distribution of the collections:

It means that on the first shard of products (s2010051) a single product is stored. All orders for this product must be located on the corresponding shard of the orders collection, i.e. s2010055. The second shard of products (s2010049) hosts two products, and its counterpart in the orders collection is responsible for 7 orders for these two products. The third shard of the products collection is not responsible for any data yet, and so is its corresponding shard in the orders collection.

Running a smart join query

Before actually executing the first SmartJoin, let’s create a non-unique index on the productId attribute of the orders collection, so any lookups by product id can be turned into quicker index lookups:

Here is the SmartJoin query, which is just an ordinary join query, with no extra hints added:

The AQL query optimizer will automatically detect that the join is via the two collections’ shard keys, and that the two collections are sharded equal:

As can be seen from the above query execution plan, the query optimizer employed the “smart-joins” optimizer rule, which means it has turned a distributed join into a local join. Thus there is no extra hop via the coordinator necessary to join data of the collections in nodes 3 and 7.

Comparison to regular joins

Compare this to the following execution plan, which is from the same query, just without the SmartJoins optimization applied:

Here we can see that between the two collections (nodes 16 and 7) there is an intermediate hop via the coordinator. This is also the general way of joining two collections with an arbitrary number of shards and unknown data distribution.

Performance comparison

The latter query (the one without SmartJoins) will require 15 network requests for joining the data of the two collections, in the simplest case. The value of 15 stems from these individual requests:

    • coordinator to shard 1 of orders
      • shard 1 of orders to the coordinator
        • coordinator to shard 1 of products
        • coordinator to shard 2 of products
        • coordinator to shard 3 of products
    • coordinator to shard 2 of orders
      • shard 2 of orders to the coordinator
        • coordinator to shard 1 of products
        • coordinator to shard 2 of products
        • coordinator to shard 3 of products
    • coordinator to shard 3 of orders
      • shard 3 of orders to the coordinator
        • coordinator to shard 1 of products
        • coordinator to shard 2 of products
        • coordinator to shard 3 of products

The SmartJoins do not need the extra coordinator hop, and do not need to fan out requests from each shard of one collection to each shard of the other collection. SmartJoins use the fact that the join data will always reside locally, so they do not reach out to the other shards via the network.

A lot of requests can be saved here! For the products and orders example query, the SmartJoin variant can get away with just 3 network requests:

  • coordinator to shard 1 of orders and products
  • coordinator to shard 2 of orders and products
  • coordinator to shard 3 of orders and products

Generally spoken, the performance advantage of SmartJoins compared to regular joins will grow with the number of shards of the underlying collections.

For two collections with n shards each, the minimal number of requests for the general join will be n * (n + 2). The number of network requests increases quadratically with the number of shards. SmartJoins can get away with a minimal number of n requests here, which scales linearly with the number of shards.

SmartJoins will also be especially advantageous for queries that have to ship a lot of data around for performing the join, but that will filter out most of the data after the join. In this case SmartJoins should greatly outperform the general join, as they will eliminate most of the inter-node data shipping overhead.

Using smart joins today

SmartJoins are a new feature in the Enterprise Edition of ArangoDB 3.5.

Do you like ArangoDB?
icon-githubStar this project on GitHub.
close-link