home shape

Build a Custom BI Connector for ArangoDB

In this tutorial, we will build a custom BI connector for ArangoDB. First, we will write a Foxx service that runs an AQL query to fetch data from ArangoDB, and then extend the service with pagination to allow incremental loading of the data into a BI tool.

This tutorial assumes you have access to a running instance of ArangoDB that already has a collection with data you want to export.

Creating a simple Foxx service

We’ll start by creating a Foxx service from scratch. At its most basic, a Foxx service is just JavaScript code executed by ArangoDB that uses special APIs to define routes other tools can access over HTTP.

Open your favorite code editor and create a new file called index.js.

background img

Settings things up

Start by adding the following code to the top of the file:

"use strict";
const joi = require("joi");
const { aql, db } = require("@arangodb");
const { context } = require("@arangodb/locals");
const createRouter = require("@arangodb/foxx/router");
 
const router = createRouter();
context.use(router);

We’re importing a few things to help us build the service:

  • the joi library will be used for validating request parameters in later parts of this tutorial
  • the aql helper lets us safely write AQL queries without having to worry about bind variables or escaping user input
  • the db object allows us to execute queries and access collections: this is how we access data in ArangoDB
  • the context object represents our service: we need it to be able to actually use the routes we define
  • the router object we create with createRouter lets us attach routes for our service

Generally Foxx services use at least one router and the router may in turn define various routes and/or use other routers itself. In this tutorial we will only define a single route, so one router is all we need.

Returning data

For this tutorial we’ll simply use a query that iterates over a collection called data and returns the first 1000 documents in it. A full introduction of AQL is out of scope for this tutorial but of course you can substitute a more complex query of your own to customize the service for your own scenario.

Add the following code at the end of the file:

router.get((req, res) => {
  const { query, bindVars } = aql`
    FOR doc IN data
    LIMIT 1000
    RETURN doc
  `;
  const result = db._query(query, bindVars, { fullCount: true });
  const { fullCount } = result.getExtra().stats;
  res.json({
    meta: {
      total_count: fullCount
    },
    records: result.toArray()
  });
});

This defines a GET route for your service that executes the query and returns the results as well as a meta object that for now just contains the total number of documents matched by the query.

large right background img

Installing the service

Our new service can be installed using the ArangoDB web interface or the Foxx CLI.

To install the service using the web interface:

  1. Open the ArangoDB web interface in your browser (e.g. http://localhost:8529 if ArangoDB is running locally).
  2. Enter your ArangoDB credentials and select the database containing your collections.
  3. Select the Services tab on the right, select the Upload tab from the top bar.
  4. Press the Upload File button and select the index.js we just created.
  5. Press the Install button, enter a mount point (e.g. /bi) and press Install.

To install the service using the Foxx CLI use the following command (assuming user root, database _system, mount point /bi and that the index.js file is in the current directory):

foxx install -u root -P -H http://localhost:8529 -D _system /bi index.js

The service should now appear in the Services tab. By clicking on the service, selecting the Settings tab in the top bar and pressing the Replace button you can upgrade the service whenever you make changes to the code.

To upgrade the service using the Foxx CLI you can use the following command:

foxx upgrade -u root -P -H http://localhost:8529 -D _system /bi index.js

To try out our route, click on the service, select the API tab in the top bar, expand the route by clicking on the GET route and then press the Try it out button and the Execute button.

If you have a lot of documents you may have noticed our query only returned the first 1000 documents and we didn’t provide a way of fetching the rest. Because every document we load has to fit into memory, it’s rarely a good idea to fetch every single document at once. Especially when you don’t know how many there might be. For this reason most BI tools support pagination or other means of fetching smaller slices of data instead of loading all data at once. This is what we’ll look into in the next steps.

Pagination and slicing

Most BI tools that support incremental or sequential data loading use one of the following three approaches:

  • offset-based pagination where each request specifies an offset in the total list of results
  • page-based pagination where each chunk of results is numbered like pages in a book
  • date ranges restricting results based on a fixed start and end date

In the following steps we’ll implement each one individually. We’ll always start with the simple service we’ve created in the previous section so feel free to skip ahead if you already know that your BI tool will use a specific mechanism.

large left background img min

Offset-based pagination

Offset-based pagination is the easiest pagination mechanism to implement because it matches exactly how the AQL LIMIT statement works.

We’ll need to add two query parameters to our route:

  • start indicates the offset from which we want to start returning results
  • count indicates the number of results we want to return

To make our route more robust, we’ll want the start offset to default to zero so the first chunk of results is returned if no offset was specified. We’ll also use a low default of 100 records for count as the value can easily be overridden later.

Replace the route definition from earlier with the following:

router.get((req, res) => {
  const { start, count } = req.queryParams;
  const { query, bindVars } = aql`
    FOR doc IN data
    LIMIT ${start}, ${count}
    RETURN doc
  `;
  const result = db._query(query, bindVars, { fullCount: true });
  const { fullCount } = result.getExtra().stats;
  res.json({
    meta: {
      start,
      count,
      total_count: fullCount
    },
    records: result.toArray()
  });
})
.queryParam("start", joi.number().integer().default(0).optional())
.queryParam("count", joi.number().integer().default(100).optional());

Let’s first look at the last two lines. These define the start and count query parameters and ensure that we have meaningful default values when these are omitted and that only whole numbers are allowed. Any other values will be rejected by Foxx with an error.

In the function body we now use req.queryParams to access these two parameters (or their default values) and then we use them in the LIMIT statement instead of just always returning a maximum of 1000 results. Finally we expose the start and count parameters in the meta information. This isn’t strictly necessary but can help for diagnostic purposes.

Page-based pagination

Page-based pagination is very similar to offset-based pagination but instead of simply being able to pass query parameters into an AQL LIMIT statement we first need to calculate the starting offset from the page number and requested number of results per page.

We’ll need to add two query parameters to our route:

  • page indicates the page number requested by the user
  • per_page indicates the number of results in each page

To make our route more robust, we’ll want the page offset to default to one so the first page of results is returned if no page was specified. We’ll also use a low default of 100 records for per_page as the value can easily be overridden later.

Replace the route definition from earlier with the following:

router.get((req, res) => {
  const { page, per_page } = req.queryParams;
  const start = (page - 1) * per_page;
  const { query, bindVars } = aql`
    FOR doc IN data
    LIMIT ${start}, ${per_page}
    RETURN doc
  `;
  const result = db._query(query, bindVars, { fullCount: true });
  const { fullCount } = result.getExtra().stats;
  res.json({
    meta: {
      page,
      per_page,
      total_count: fullCount
    },
    records: result.toArray()
  });
})
.queryParam("page", joi.number().integer().default(1).optional())
.queryParam("per_page", joi.number().integer().default(100).optional());

Let’s first look at the last two lines. These define the page and per_page query parameters and ensure that we have meaningful default values when these are omitted and that only whole numbers are allowed. Any other values will be rejected by Foxx with an error.

In the function body we now use req.queryParams to access these two parameters (or their default values) and then we use them to calculate a starting offset we can use in the LIMIT statement instead of just always returning a maximum of 1000 results. Finally we expose the page and per_page parameters in the meta information. This isn’t strictly necessary but can help for diagnostic purposes.

right blob img min

Date ranges

Date ranges can be significantly more complex than simple pagination as they generally rely on already having dates specified in a certain format and don’t explicit limit the maximum number of results. For this reason we’ll keep the limit of 1000 results in this example to set a hard limit on how many results we are willing to return at a time.

For this example we’ll assume date values are stored in a date property as timestamps, the number of milliseconds elapsed since the start of the UNIX epoch. This allows us to avoid having to convert between different formats. We’ll also assume all data is fairly recent, i.e. that all timestamp values are larger than zero.

We’ll need to add two query parameters to our route:

  • start indicates the (inclusive) minimum timestamp in our results
  • end indicates the (exclusive) maximum timestamp in our results

To make our route more robust, we’ll want the start to default to zero and the end to default to the current timestamp.

Replace the route definition from earlier with the following:

router.get((req, res) => {
  const { start, end } = req.queryParams;
  const { query, bindVars } = aql`
    FOR doc IN data
    FILTER doc.date >= ${start} && doc.date < ${end}
    LIMIT 1000
    RETURN doc
  `;
  const result = db._query(query, bindVars, { fullCount: true });
  const { fullCount } = result.getExtra().stats;
  res.json({
    meta: {
      start,
      end,
      total_count: fullCount
    },
    records: result.toArray()
  });
})
.queryParam("start", joi.number().default(0).optional())
.queryParam("end", joi.number().default(Date.now, "current time").optional());

Let’s first look at the last two lines. These define the start and end query parameters and ensure that we have meaningful default values when these are omitted and that only numbers are allowed. Any other values will be rejected by Foxx with an error. Note that instead of defining a fixed value as the default for end, we use a function that returns the current timestamp.

In the function body we now use req.queryParams to access these two parameters (or their default values) and then we use them in a FILTER statement to restrict which documents will be fetched while still using LIMIT to restrict the maximum number of documents. Finally we expose the start and end parameters in the meta information. This isn’t strictly necessary but can help for diagnostic purposes.

background img

Aggregation

In many cases where data is fetched based on specific date ranges it’s preferrable to return aggregate values instead of specifying a hard maximum limit of results per request or risking to overload the BI tool with too many results.

For this example we’ll assume we’re only interested in a single numeric attribute of our documents called value. We’ll aggregate this value using the AVERAGE function but AQL provides many more aggregation functions if your scenario requires something different.

Because we still want to return multiple values from our query instead of a single aggregate value for the entire date range, we’ll add a millisecond interval query parameter that defaults to 1/1000th of the duration. This means we’ll still get at most 1000 results even if the date range covers millions of documents.

Replace the route definition from earlier with the following:

router.get((req, res) => {
  const { start, end } = req.queryParams;
  const { interval = (end - start) / 1000 } = req.queryParams;
  const { query, bindVars } = aql`
    FOR doc IN data
    FILTER doc.date >= ${start} && doc.date < ${end}
    COLLECT date = FLOOR(doc.date / ${interval}) * ${interval}
    AGGREGATE value = AVERAGE(doc.value)
    RETURN [date, value]
  `;
  const result = db._query(query, bindVars, { fullCount: true });
  const { fullCount } = result.getExtra().stats;
  res.json({
    meta: {
      start,
      end,
      interval,
      total_count: fullCount
    },
    records: result.toArray()
  });
})
.queryParam("start", joi.number().default(0).optional())
.queryParam("end", joi.number().default(Date.now, "current time").optional())
.queryParam("interval", joi.number().optional());

Again starting at the bottom we find the new interval parameter. Note that this time we define the default value in the function body because we need the values of the other parameters to calculate it. We again expose this value in the meta attribute of the response object for diagnostic purposes.

Instead of returning documents we now return tuples for each aggregate consisting of the interval-based timestamp and its aggregate value.

Custom filtering

An easy way to make your custom BI connector more flexible is to provide an option to perform custom filtering with query parameters. In this part of the tutorial we’ll extend the service to provide such a parameter.

Defining the operators

There are a lot of useful operators in AQL but for this example we’ll stick to the basics. Add the following in the source code before the line const router = createRouter():

const OPERATORS = new Map([
  ["lt", aql.literal("<")],
  ["lte", aql.literal("<=")], ["gt", aql.literal(">")],
  ["gte", aql.literal(">=")],
  ["eq", aql.literal("==")],
  ["neq", aql.literal("!=")],
  ["in", aql.literal("in")],
  ["nin", aql.literal("not in")]
]);

This gives us a mapping of safe but human-readable names to AQL operators. The aql.literal function converts the strings to something we can use in an AQL query template without having to worry about being misinterpreted as a bind variable.

right blob img min

Extending the query parameters

We want to allow users to specify multiple filters. The easiest way to do this with the existing GET route is by adding a query parameter that takes a JSON value.

If you’ve already added pagination, find one of the lines starting with .queryParam( and add the following immediately before that line. Otherwise remove the semicolon at the end of the call to router.get() and add the following code immediately after it:

.queryParam(
    "filters",
    joi
      .array()
      .items(
        joi
          .object()
          .keys({
            fieldName: joi.string().required(),
            operator: joi.only(...OPERATORS.keys()).required(),
            value: joi.any().required()
          })
          .required()
      )
      .optional()
  )

In plain English this matches an optional JSON array containing objects with three attributes:

  • fieldName: a string value we will use to decide which field to compare
  • operator: one of the operator names we defined earlier
  • value: a value the field will be compared to using the operator.

    For example, this would limit the results to documents with a statusCode field set to either 400 or 500:

[{ "fieldName": "statusCode", "operator": "in", "value": [400, 500] }]

Applying the filter

Find the AQL query in your code. If you haven’t already added pagination it should look like this:

const { query, bindVars } = aql`
  FOR doc IN data
  LIMIT 1000
  RETURN doc
`;

Replace those lines with the following code:

large left background img min
const { filters: rawFilters } = req.queryParams;
 
const filters = rawFilters
  ? rawFilters.map(
      ({ fieldName, operator, value }) =>
        aql`FILTER doc[${fieldName}] ${OPERATORS.get(operator)} ${value}`
    )
  : [];
 
const { query, bindVars } = aql`
  FOR doc IN data
  ${aql.join(filters)}
  LIMIT 1000
  RETURN doc
`;

This code can be a bit difficult to wrap your head around so let’s go through it step by step:

First we take the filters query parameter and call it rawFilters to indicate that it requires some processing before we can put it in the query.

Then we create a filters array. If no filters were specified, this is an empty array. Otherwise we take each item from the parameter and translate it to an AQL FILTER statement.

For the following object:

{"fieldName": "statusCode", "operator": "eq", "value": 500}

this gives us the following statement:

FILTER doc["statusCode"] == 500

Finally the array of FILTER statements is joined into a combined AQL expression and inlined into our query.

Protecting the service with authentication

Right now everybody who knows the mount point of our service can access its routes. To change this we need to implement the HTTP Basic authentication scheme understood by most BI tools that support HTTP-based connectors. To keep things simple we’ll just define a single set of hardcoded credentials.

Starting with ArangoDB 3.5, Foxx provides a req.auth attribute on the request object that includes parsed credentials if the request was authenticated using HTTP Basic or HTTP Bearer authentication. As this tutorial assumes you are using ArangoDB 3.4, we will need to use the following helper method instead. Add the following in the source code before the line const router = createRouter():

function getAuth(req) {
  const header = req.get("authorization") || "";
  const match = header.match(/^Basic (.*)$/);
  if (match) {
    let credentials = "";
    try {
      credentials = new Buffer(match[1], "base64").toString("utf-8");
    } catch (e) {}
    if (!credentials) return { basic: {} };
    const i = credentials.indexOf(":");
    if (i === -1) {
      return { basic: { username: credentials } };
    }
    return {
      basic: {
        username: credentials.slice(0, i),
        password: credentials.slice(i + 1)
      }
    };
  }
}

This implements the same logic ArangoDB 3.5 uses to parse HTTP Basic authentication but as a standalone function that also works in previous versions of ArangoDB

Now we can use this function to check if the request was authenticated using our hardcoded credentials. Remember to replace YOUR_USERNAME and YOUR_PASSWORD in the following code with strings containing the username and password you want your custom BI connector to use. Find the line containing context.use(router) and add this code after it:

router.use((req, res, next) => {
  const auth = getAuth(req);
  if (!auth || !auth.basic) {
    res.throw(401, "Authentication required");
  }
  const { username, password } = auth.basic;
  if (username !== YOUR_USERNAME || password !== YOUR_PASSWORD) {
    res.throw(403, "Bad username or password");
  }
  next();
});

This adds a middleware function to your router that intercepts all requests and forces Foxx to return an appropriate error response if the request either doesn’t use authentication or uses the wrong credentials.