Brainhub: Nested document models with ArangoDB and AQL
As a software development company, we very often work on complex applications that need to handle lots of data. Recently, on one of our projects, we’ve faced a challenge – We had a lot of data on many levels and we had to be able to operate directly on these documents.
Do you want to know how we resolved our problem?
Join us on our journey to find out.
But first, let me introduce our background.
- On the backend:
- NodeJS as the platform
- Express or Koa as an HTTP library/framework
- Various DBMS which fit a particular need, mostly MongoDB and Redis
- RabbitMQ for queuing
- Consul for microservices registration
- GitLab CI or Circle CI as a continuous integration tool
- Google App Engine, Bluemix or own servers for deployment
- On the web frontend:
- Redux Saga
- On the mobile frontend:
- On the desktop frontend:
What kind of problem we have
We have a lot of data on many levels, which means, in a document model, many levels of nested documents. Moreover, we have to be able to operate directly on these nested documents (children, grandchildren, great-grandchildren etc.).
We have to create an API not only for our frontend but also for external integrations. The user should be able to send a JSON schema, which is later used for validation of provided data when creating or updating, and it’s also used to join documents from various collections.
An example of a simple JSON schema:
So there are the following solutions:
- Document database with foreign keys inside the documents:
- Operating on data with many queries
- Operating with a single query if a DBMS permits it
- Relational database:
- Manual serialization/deserialization
- Graph database
- Multi-model document-graph database
What type of DBMS we expect
We would like a DBMS which satisfies:
- Open source
- High performance
- Good community
- Supporting ACID (atomicity, consistency, isolability, durability)
Other useful features are:
- Powerful query language
Potential DBMS to choose
We love open source solutions, so we eliminated DBMS such as Oracle, SQL Server, DB2 and for licensing issues MySQL.
We made a comparison of many No-SQL DBMS (not only for this project but also to have some overview for other projects, the data is as of February 18th, 2018):
We took some DBMS from the top of the rank above but eliminating:
- LevelDB – is a great DBMS but designed for text storage, no document storage, which we need
- PouchDB – though it is higher in this rank than CouchDB, we decided to consider CouchDB instead because PouchDB is generally designed for backend-frontend synchronization, which we don’t need in our app, but CouchDB is a DBMS used typically on the backend side
- Memcached – allows caching data like Redis but after more detailed research, it looks like Redis is the undisputed winner over Memcached. Moreover, we’re looking for a document-based DBMS, so we don’t want to consider more key-value DBMS than Redis
- Firebase Realtime Database – not an Open Source and generally seems not to be enough good to take the risk of having some bugs which we cannot fix
- Neo4j – designed especially for Java but not NodeJS and is not document-based
Moreover, among the SQL DBMS, we decided to include only PostgreSQL in our research because it makes it possible to store JSON-like data, which we need.
Based on the table above and other research, DBMS that seem to suit our needs the most are:
- MongoDB – the most popular document DBMS:
- NodeJS developers have the greatest knowledge of this DBMS
- Great clustering options
- Missing joins – very important for our data; MongoDB Aggregation Framework is very limited and hard to debug
- Missing transactions – we need ACID (though planned in MongoDB 4.0)
- Missing expressive, dedicated query language – queries only in JSON
- Issues can be reported only in Jira but not GitHub, which is less user-friendly for the Open Source community
- RESTful API
- Missing transactions – we need ACID
- Relatively slow performance
- Super fast
- Popular – among the non-relational databases – only MongoDB is more popular
- The most stars on GitHub among all DBMS
- Not designed for durable persistence (as default everything is kept inside RAM)
- No query language
- Very limited queries (only very basic operators like get or incr)
- Very popular
- Supports SQL
- Supports many kinds of data like multi-dimensional arrays and user-defined types
- Is proved to be very mature in production
- Though user-defined functions and data types are very useful, their syntax seems to be from an old epoch like Fortran or Pascal
- PostgreSQL json and jsonb types have low performance in comparison with document databases
Sharding in PostgreSQL is not so easy – it requires either using a fork of the official database engine or an implementation in the application layer
- Like other relational DBMS requires a schema which slows down creation of a simple MVP microservice
Why we chose ArangoDB
ArangoDB seems to be something like MongoDB (we have the most experience in MongoDB) with some extra features. Of course it lacks some MongoDB features like the Aggregation Framework but, in reality, this one is not lacking but replaced with something more user-friendly – AQL + joins.
ArangoDB like MongoDB provides clustering, though the ArangoDB clustering has not proven to work stably on production. One of the key factors was a very active community. It has a very low ratio of open issues to the total number of issues. Moreover, everyone can easily access ArangoDB Slack where the support team is very helpful, and also in Stackoverflow they give adequate responses.
Another reason was that ArangoDB is a multi-model DBMS, which is useful as we were planning to extend our documents with using graphs.
How we have used ArangoDB
We have used the following ArangoDB features:
- Admin UI (only for some debugging purposes)
We are potentially planning to use in the future:
In the ArangoDB shell, we found a very useful feature which doesn’t exist in MongoDB. To learn AQL, no data in the collections was needed because it’s possible to type something like this:
db._query(‘for i in [1,2,3] return i * i’)
Because one of the requirements was to build the data from many collections using the provided JSON schema, we were looking for an ArangoDB query builder.
We found something which was rather unpopular and lacked many features, so we created our own ArangoDB query builder.
We created an abstract interface, so when replacing ArangoDB into another DBMS, only the inner implementation would be changed.
An example code of our query builder:
It was pretty cool that we were able to use some popular libraries like Lodash even on the database server.
ArangoDB provides an HTTP framework named Foxx which simplifies creating microservices that connect to ArangoDB.
However, we decided not to use Foxx because we didn’t want our microservices to be dependent on a database (like it is while using MongoDB + Mongoose or direct connecting between frontend and CouchDB REST API). Instead, we created abstract models which internally use ArangoDB.
This approach proved to be a good choice because later we had to replace ArangoDB into Redis in some critical places of the system in order to remove bottlenecks that were hindering the overall performance.
Our benefits from using ArangoDB
Our drawbacks from using ArangoDB
Unfortunately, ArangoDB was not fast enough to handle a very large number of write/reads in a short period of time (however, even with other DBMS, e.g. MongoDB or MySQL, writing data to the hard disk would be too slow). That’s why in some microservices we decided to replace ArangoDB into Redis, which works in memory.
However, even in this situation, using ArangoDB was a better choice than MongoDB + Mongoose because Mongoose models usually make the entire architecture dependent on DBMS.
On the other hand, we created an abstract model using ArangoDB, so replacing DBMS into another was relatively easy.
Take note that MongoDB can be used with such an abstract model as well – so I just say that ArangoDB + our abstract models can be much better than MongoDB + Mongoose.
Case study originally published on Brainhub Blog
Importance of key characteristics of ArangoDB
|Feature||not important||important||very important|
|AQL / JOINs||x|