Before we started programming the new open source NoSQL database, we reflected which design objectives to achieve or to drop. This article summarizes our considerations.

In a nutshell:

  • Schema-free schemas with shapes: Inherent structures at hand are automatically recognized and subsequently optimized.
  • Querying: ArangoDB is able to accomplish complex operations on the provided data (query-by-example and query-language).
  • Application Server: ArangoDB is able to act as application server on Javascript-devised routines.
  • Mostly memory/durability: ArangoDB is memory-based including frequent file system synchronizing.
  • AppendOnly/MVCC: Updates generate new versions of a document; automatic garbage collection.
  • ArangoDB is multi-threaded.
  • No indices on file: Only raw data is written on hard disk.
  • ArangoDB supports single nodes and small, homogenous clusters with zero administration.

Schema-free schemas with “shapes“

ArangoDB organizes data in documents, storing structure-information/metadata separately from user data.

User data is stored only once for all documents which have the same structure. This provides storing efficiency and offers a high-performance data access at the same time. You don’t need to determine the document’s structure at the time of access. You have done it already and therefore you generate an efficient access code using the “shape”. These processes run transparently behind the scenes for the developer’s eye.

The “shapes” concept combines all advantages of schema-free systems with those of the default schemas.

Querying

ArangoDB is able to conduct extensive queries. For that use, in addition to query-by-example, we provide a corresponding query language. Our language is capable to conduct queries of a complexity that overburden other approaches syntactically.

ArangoDB as application server

ArangoDB is capable of storing and executing Javascript-functions within the database as so-called “actions”, independently from user data. The actions are user-defined and therefore highly flexible.

This enables realizing database triggers or even devising atomic and isolated transactions. Generally, these “actions” make it possible to treat documents stored in the database as objects with a defined behavior.

Mostly Memory/Durability

Database documents are stored in the memory?memory-mapped files are used to store them. The operating system has the advantageous option to decide swapping sparsely used areas out of the main memory. Per default, these memory-mapped files are synced frequently?advantageously storing all documents securely at once (durability).

AppendOnly/MVCC

Instead of overwriting existing documents, a completely new version of the document is generated. The two benefits are:

  1. Objects can be stored coherently and compactly in the main memory.
  2. Objects are preserved—isolated writing and reading transactions allow accessing these objects for parallel operations.

The system collects obsolete versions as garbage, recognizing them as forsaken. Garbage collection is asynchronous and runs parallel to other processes.

Multi-threaded/CPU-bound

ArangoDB is not meant as trivial storage to copy and paste simple objects. It is devised as a database that enables complex operations, starting with extensive searches and data aggregation, leading to ArangoDB-stored and Javascript-executable codes.

We are convinced that an operation should be completed where the data is: the database. The limiting factor for this approach is the CPU, not flooded memory capacity or overloaded network connections.

ArangoDB is trying to harness the provided hardware optimally. With multi-core/multi-processor machines all around, ArangoDB is multi-threaded, of course.

No indices on file/startup on runtime

ArangoDB writes only raw data on the hard disk. All supporting data, i.e. indices, are stored only in the main memory. On the down side, you have to generate indices anew after a system failure or rebooting. On the up side, this approach offers a better performance working with applications with frequent writing accesses.

No large cluster/zero administration/synchronous master-master replication

Our design aim is to achieve zero administration of consistent, synchronous master-master replicating clusters on few servers. The same data is available on all servers per synchronous replication with minimal administrative effort.

We expect that most projects are not becoming the next Amazon and a single node or small cluster fits in 99 percent of the use cases.