Estimated reading time: 6 minutes

Nothing performs faster than arangoimport and arangorestore for bulk loading or massive inserts into ArangoDB. However, if you need to do additional processing on each row inserted, this blog will help with that type of functionality.

If the data source is a streaming solution (such as Kafka, Spark, Flink, etc), where there is a need to transform data before inserting into ArangoDB, this solution will provide insight into that scenario as well.

Let’s delve into the specifics.

  • ArangoDB is version v3.6, and we will assume that it has already been installed (I am using a single instance on my Mac OS Catalina. The architecture and performance of the installation are outside the scope of this blog).
  • NodeJS should also be installed. The minimum version for this exercise should be v10 for its great capability of Async Iteration “for await (let chunk of readableStream) { … “ over chunks of a stream. The version used here is v12.13.1.

A good explanation for Async Iteration and the motivation behind it is found in this article.

The essence of it is that we need to use an await-able promise inside the loop because a regular (non-await-able) for-loop will sort of swamp the system with Promise requests causing them not to be resolved and the NodeJS heap to blow up. There is obviously an assumption that you also have a reasonably good understanding of the await / async construct in NodeJS.

Preparations:

Create a directory for your NodeJS Application. Create another directory under it called datafiles. To keep the demo simple, the data source is a csv file with only two fields per line:

I named mine lines_file_100k.csv, it contains, as the title suggests, 100k rows. (if you feel adventurous, you can expand this to try with as many rows as desired). We ran insert tests with some 4.2 million rows while NodeJS memory stayed below 35MB.

Below is the bash script used to create the data source file:

Place the script in the same datafiles directory created above and enter the following commands at a terminal prompt:

In ArangoDB, we created the database db0 with user db0_user, password db0_pass and a document type collection lines_test.

Now for the NodeJS preparation, you will need some modules setup in your Application directory. For that, copy the package.json file below into the NodeJS Application directory (the parent of datafiles) and then enter the following command at a terminal prompt:

The package.json file:

Your environment is all set now. You should see the node_modules directory under your NodeJS Application directory. We are now ready to start programming.

Caveat: The Basic Authentication for the HTTP POST calls used in this exercise is not advisable for production systems unless the communication is over secure channels.

The premise is to use the bare bones ArangoDB Web API so that it can be translated to the programming language of your choice. The interactive ArangoDB HTTP API documentation can be found at (Rest API tab):

Of course, the excellent NodeJS driver for ArangoDB – arangojs – could also be used for this exercise. In fact the HTTP POST function doing the actual insert is a tiny, simplified subset of that driver.

Create your NodeJS file with the same name used in the “main” section defined in the package.json file above:

First, we Promisify the HTTP POST function executing the actual document insert:

Now we have our asynchronous HTTP POST “await” capable and callable from inside an async function.

Next, we create the awaitable NodeJS stream processing part:

We now have the complete picture of the code, we have an awaitable chunk of stream capable of being processed and transformed then passed onto an awaitable HTTP POST function.

Data is processed orderly as it is received from the stream.

Here is the code in its entirety (arango_raw_stream_tests.js):

Continue Reading

Milestone 1 ArangoDB 3.3: Datacenter to Datacenter Replication

Setting up Datacenter to Datacenter Replication in ArangoDB

Auto-Generate GraphQL for ArangoDB