home shape

SmartJoins: boosting cluster join query performance

The “smart joins” feature available in the ArangoDB Enterprise Edition allows running joins between two sharded collections with performance close to that of a local join operation.

The prerequisite for this is that the two collections have an identical sharding setup.

See our feature video here:

scroll down line

The original Instacart dataset for this video is no longer available but can instead be downloaded from the instacart branch of the interactive tutorials repository.

DATASET: https://github.com/arangodb/interactive_tutorials/tree/instacart

Example setup

Let’s consider an example setup with collections, products and orders. Between these collections there is a 1-to-many relationship, e.g. for each product there can be any number of orders.

To use SmartJoins, the first step is to make the two collections shard their data alike. This can be achieved by making the distributeShardsLike attribute of the orders collection refer to the products collection:

    db._create("products", { numberOfShards: 3, shardKeys: ["_key"] });
    db._create("orders", { distributeShardsLike: "products", shardKeys: ["productId"] });

Now both collections will have the same number of shards (3) each. The collection products will be sharded by the values in its _key attribute, and data in collection orders will be sharded by the values in the attribute productId. That latter attribute productId is a foreign key referring to the _key in the products collection.

Now we can populate the collections with some example data:

db.products.insert({ _key: "abc-10001", name: "Hyper Booster" });
    db.products.insert({ _key: "hlc-79452", name: "Holographic Horoscope" });
    db.products.insert({ _key: "zxf-19414", name: "Nutty Nunchaku" });
    
    db.orders.insert({ productId: "abc-10001", items: 2, date: "2019-02-17" });
    db.orders.insert({ productId: "abc-10001", items: 1, date: "2019-03-21" });
    db.orders.insert({ productId: "abc-10001", items: 4, date: "2019-03-25" });
    db.orders.insert({ productId: "abc-10001", items: 1, date: "2019-03-26" });
    db.orders.insert({ productId: "hlc-79452", items: 9, date: "2018-12-31" });
    db.orders.insert({ productId: "hlc-79452", items: 1, date: "2019-01-01" });
    db.orders.insert({ productId: "zxf-19414", items: 1, date: "2018-10-25" });
    db.orders.insert({ productId: "zxf-19414", items: 2, date: "2019-03-12" });
    db.orders.insert({ productId: "zxf-19414", items: 7, date: "2019-03-25" });

Data distribution

Due to the identical sharding setup all orders for a product with a given productId value will be located on the same shard as the products they are referring to, which means joins between the two collections that join them via their shard keys can be executed locally.

Let’s check the actual per-shard data distribution of the collections:

 db.products.count(true);
    {
      "s2010051" : 1,
      "s2010049" : 2,
      "s2010050" : 0
    }
 
      db.orders.count(true);
    {
      "s2010055" : 2,
      "s2010053" : 7,
      "s2010054" : 0
    }

It means that on the first shard of products (s2010051) a single product is stored. All orders for this product must be located on the corresponding shard of the orders collection, i.e. s2010055. The second shard of products (s2010049) hosts two products, and its counterpart in the orders collection is responsible for 7 orders for these two products. The third shard of the products collection is not responsible for any data yet, and so is its corresponding shard in the orders collection.

Running a smart join query

Before actually executing the first SmartJoin, let’s create a non-unique index on the productId attribute of the orders collection, so any lookups by product id can be turned into quicker index lookups:

 db.orders.ensureIndex({ type: "hash", fields: ["productId"] });

Here is the SmartJoin query, which is just an ordinary join query, with no extra hints added:

FOR p IN products 
      FOR o IN orders 
        FILTER p._key == o.productId 
        RETURN o

The AQL query optimizer will automatically detect that the join is via the two collections’ shard keys, and that the two collections are sharded equal:

 db._explain("FOR p IN products FOR o IN orders FILTER p._key == o.productId RETURN o");
 
    Query String:
     FOR p IN products FOR o IN orders FILTER p._key == o.productId RETURN o
 
    Execution plan:
     Id   NodeType                  Site  Est.   Comment
      1   SingletonNode             DBS      1   * ROOT
      3   EnumerateCollectionNode   DBS      9     - FOR o IN orders   /* full collection scan, 3 shard(s) */
      7   IndexNode                 DBS      0       - FOR p IN products   /* primary index scan, scan only, 3 shard(s) */
     10   RemoteNode                COOR     0         - REMOTE
     11   GatherNode                COOR     0         - GATHER
      6   ReturnNode                COOR     0         - RETURN o
 
    Indexes used:
     By   Name      Type      Collection   Unique   Sparse   Selectivity   Fields       Ranges
      7   primary   primary   products     true     false       100.00 %   [ `_key` ]   (p.`_key` == o.`productId`)
 
    Optimization rules applied:
     Id   RuleName
      1   interchange-adjacent-enumerations
      2   use-indexes
      3   remove-filter-covered-by-index
      4   remove-unnecessary-calculations-2
      5   smart-joins
      6   scatter-in-cluster
      7   remove-unnecessary-remote-scatter

As can be seen from the above query execution plan, the query optimizer employed the “smart-joins” optimizer rule, which means it has turned a distributed join into a local join. Thus there is no extra hop via the coordinator necessary to join data of the collections in nodes 3 and 7.

Comparison to regular joins

Compare this to the following execution plan, which is from the same query, just without the SmartJoins optimization applied:

 db._explain("FOR p IN products FOR o IN orders FILTER p._key == o.productId RETURN o");
 
    Query String:
     FOR p IN products FOR o IN orders FILTER p._key == o.productId RETURN o
 
    Execution plan:
     Id   NodeType        Site  Est.   Comment
      1   SingletonNode   DBS      1   * ROOT
     16   IndexNode       DBS      3     - FOR p IN products   /* primary index scan, index only, projections: `_key`, 3 shard(s) */
     14   RemoteNode      COOR     3       - REMOTE
     15   GatherNode      COOR     3       - GATHER
      8   ScatterNode     COOR     3       - SCATTER
      9   RemoteNode      DBS      3       - REMOTE
      7   IndexNode       DBS      3       - FOR o IN orders   /* hash index scan, 3 shard(s) */
     10   RemoteNode      COOR     3         - REMOTE
     11   GatherNode      COOR     3         - GATHER
      6   ReturnNode      COOR     3         - RETURN o
 
    Indexes used:
     By   Name                      Type      Collection   Unique   Sparse   Selectivity   Fields            Ranges
     16   primary                   primary   products     true     false       100.00 %   [ `_key` ]        *
      7   idx_1629018093411893248   hash      orders       false    false        59.52 %   [ `productId` ]   (p.`_key` == o.`productId`)
 
    Optimization rules applied:
     Id   RuleName
      1   use-indexes
      2   remove-filter-covered-by-index
      3   remove-unnecessary-calculations-2
      4   scatter-in-cluster
      5   remove-unnecessary-remote-scatter
      6   reduce-extraction-to-projection
 

Here we can see that between the two collections (nodes 16 and 7) there is an intermediate hop via the coordinator. This is also the general way of joining two collections with an arbitrary number of shards and unknown data distribution.

Performance comparison

The latter query (the one without SmartJoins) will require 15 network requests for joining the data of the two collections, in the simplest case. The value of 15 stems from these individual requests:

    • coordinator to shard 1 of orders
      • shard 1 of orders to the coordinator
        • coordinator to shard 1 of products
        • coordinator to shard 2 of products
        • coordinator to shard 3 of products
    • coordinator to shard 2 of orders
      • shard 2 of orders to the coordinator
        • coordinator to shard 1 of products
        • coordinator to shard 2 of products
        • coordinator to shard 3 of products
    • coordinator to shard 3 of orders
      • shard 3 of orders to the coordinator
        • coordinator to shard 1 of products
        • coordinator to shard 2 of products
        • coordinator to shard 3 of products

The SmartJoins do not need the extra coordinator hop, and do not need to fan out requests from each shard of one collection to each shard of the other collection. SmartJoins use the fact that the join data will always reside locally, so they do not reach out to the other shards via the network.

A lot of requests can be saved here! For the products and orders example query, the SmartJoin variant can get away with just 3 network requests:

  • coordinator to shard 1 of orders and products
  • coordinator to shard 2 of orders and products
  • coordinator to shard 3 of orders and products

Generally spoken, the performance advantage of SmartJoins compared to regular joins will grow with the number of shards of the underlying collections.

For two collections with n shards each, the minimal number of requests for the general join will be n * (n + 2). The number of network requests increases quadratically with the number of shards. SmartJoins can get away with a minimal number of n requests here, which scales linearly with the number of shards.

SmartJoins will also be especially advantageous for queries that have to ship a lot of data around for performing the join, but that will filter out most of the data after the join. In this case SmartJoins should greatly outperform the general join, as they will eliminate most of the inter-node data shipping overhead.