home shape

Crawling GitHub with Promises: ArangoDB Tutorial

The new Javascript driver no longer imposes any promises implementation. It follows the standard callback pattern with a callback using err and res.

I wanted to give the new driver a try. A github crawler seemed like a good side-project, especially because the node-github driver follows the same conventions as the Javascript driver.

There are a lot of promise libraries out there. The most popular one – according to NPM – was promises. It should be possible to use any implementation. Therefore I used this one.

The following source code can be found on github.

Pagination with Promises made easy

The github driver has a function to get all followers. However, the result is paginated. With two helper functions and promises it is straight forward to implement a function to retrieve all followers of an user.

function extractFollowers (name) {
  'use strict';

  return new Promise(function(resolve, reject) {
    github.user.getFollowers({ user: name }, promoteError(reject, function(res) {
      followPages(resolve, reject, [], res);
    }));
  });
}

The followPages function simply extends the result with the next page until the last page is reached.

function followPages (resolve, reject, result, res) {
  'use strict';

  var i;

  for (i = 0;  i < res.length;  ++i) {
    result.push(res[i]);
  }

  if (github.hasNextPage(res)) {
    github.getNextPage(res, promoteError(reject, function(res) {
      followPages(resolve, reject, result, res);
    }));
  }
  else {
    resolve(result);
  }
}

The promote error helper is a convenience function to bridge callbacks and promises.

function promoteError (reject, resolve) {
  'use strict';

  return function(err, res) {
    if (err) {
      if (err.hasOwnProperty("message") && /rate limit exceeded/.test(err.message)) {
        rateLimitExceeded = true;
      }

      console.error("caught error: %s", err);
      reject(err);
    }
    else {
      resolve(res);
    }
  };
}

I’ve decided to stick to the sequence reject (aka err) followed by resolve (aka res) – like the callbacks. The promoteError can be used for the github callback as well as the ArangoDB driver.

Queues, Queues, Queues

I’ve only needed a very simple job queue, so queue-it is a good choice. It provides a very simple API for handling job queues:

POST /queue/job
POST /queue/worker
DELETE /queue/job/:key

The new Javascript driver allows to access arbitrary endpoint. First install a Foxx implementing the queue microservice in an ArangoDB instance.

foxx-manager install queue-it /queue

Adding a new job from node.js is now easy

function addJob (data) {
  'use strict';

  return new Promise(function(resolve, reject) {
    db.endpoint("queue").post("job", data, promoteError(reject, resolve));
  });
}

Transaction

I wanted to crawl users and their repos. The relations (“follows”, “owns”, “is_member”, “stars”) is stored in an edge collection. I only add an edge if it is not already there. Therefore I check inside a transaction, if the edge exists and add it, if it does not.

createRepoDummy(repo.full_name, data).then(function(dummyData) {
  return db.transaction(
    "relations",
    String(function(params) {
      var me = params[0];
      var you = params[1];
      var type = params[2];
      var db = require("org/arangodb").db;

      if (db.relations.firstExample({ _from: me, _to: you, type: type }) === null) {
        db.relations.save(me, you, { type: type });
      }
    }),
    [ meId, "repos/" + data._key, type ],
    function(err) {
      if (err) {
        throw err;
      }

      return handleDummy(dummyData);
    });
})

Please note that the action function is executed on the server and not in the nodejs client. Therefore we need to pass the relevant data as parameters. It is not possible to use the closure variables.

Riding the Beast

Start an ArangoDB instance (i.e. inside a docker container) and install the simple queue.

foxx-manager install queue-it /queue

Start the arangosh and create collections users, repos and relations.

arangosh> db._create("users");
arangosh> db.users.ensureHashIndex("name");

arangosh> db._create("repos");
arangosh> db.repos.ensureHashIndex("name");

arangosh> db._createEdgeCollection("relations");

Now everything is initialized. Fire up nodejs and start crawling.

node> var crawler = require("./crawler");
node> crawler.github.authenticate({ type: "basic", username: "username", password: "password" })
node> crawler.addJob({ type:"user", identifier:"username" })
node> crawler.runJobs();

Please keep in mind that this is just an experiment. There is no good error handling and convenience functions for setup and start. It is also not optimized for performance. For instance, it would easily be possible to avoid nodejs / ArangoDB roundtrips using more transactions.

Sources used in this example:

The source code of this example is available from Github: https://github.com/fceller/Foxxmender

If you want to continue with other JavaScript related resources, you should start with ArangoDB NoSQL and JavaScript.

Frank Celler

Frank Celler

Frank is both entrepreneur and backend developer, developing mostly memory databases for two decades. He is the CTO and co-founder of ArangoDB. Try to challenge Frank asking him questions on C, C++ and MRuby. Besides Frank organizes Cologne’s NoSQL group & is an active member of NoSQL community.

2 Comments

  1. Paulius Uza on January 8, 2015 at 11:22 am

    There’s a small typo (users -> should be -> repos):
    arangosh> db._create(“repos”);
    arangosh> db.repos.ensureHashIndex(“name”);

    • fceller on January 8, 2015 at 1:11 pm

      You are totally right. Fixed the typo, Thanks

Leave a Comment





Get the latest tutorials, blog posts and news: