ArangoDB Enterprise: Datacenter to Datacenter Replication
Every company needs a disaster recovery plan for all important systems. This is true from small units like single processes running in some container to the largest distributed architectures. For databases in particular this usually involves a mixture of fault-tolerance, redundancy, regular backups and emergency plans. The larger a data store, the more difficult it is to come up with a good strategy.
Therefore, it is desirable to be able to run a distributed database in one datacenter and replicate all transactions to another datacenter in some way. Often, transaction logs are shipped over the network to replicate everything in another, identical system in the other datacenter. Some distributed data stores have built-in support for multiple datacenter awareness and can replicate between datacenters in a fully automatic fashion.
ArangoDB 3.3 takes an evolutionary step forward by introducing multi-datacenter support, which is datacenter to datacenter replication. Our solution is asynchronous and scales to arbitrary cluster sizes, provided your network link between the datacenters has enough bandwidth. It is fault-tolerant without a single point of failure and includes a lot of metrics for monitoring in a production scenario.
What it does
This feature allows you to run two ArangoDB clusters in two different datacenters A and B, and set up asynchronous replication from A to B. This means that Cluster A in Datacenter A can be used as usual for read and write operations, and all changes to the data are replicated over the network to the other Cluster B in datacenter B. The replication is asynchronous, that is, changes appear on the other side after a short delay, usually within a few seconds. (Read more about the ArangoDB cluster architecture in our Whitepaper)
In the case of a disaster in Datacenter A like a total loss of network connectivity, one can quickly stop the replication and start using Cluster B in Datacenter B as a drop in replacement for Cluster A. Later, when the disaster is over, one can either use Cluster A as an asynchronous replica of Cluster B, or switch back to A and continue with the replication to Cluster A.
In this section we do not want to bore you with the technical details, we will publish a white paper in due course for that. Rather, we want to highlight the challenges of this approach and sketch the measures we took to overcome them.
A single ArangoDB cluster is a distributed system with good horizontal scalability. Both the data capacity and the query performance (read and write) scales linearly with the number of servers used. Automatic sharding leads to the fact that the actual changes to the data occur all over the place in all servers concurrently. In particular, this means that there is – by design – no single place that establishes a total order of all changes. That is, we are dealing with a distributed mess of simultaneously happening updates to huge amounts of data. The rate of change can vary considerably, and we will have to deal with large write bursts.
At the same time, an ArangoDB cluster is fault tolerant. If, for example, a single server fails in a datacenter, the ArangoDB cluster can easily tolerate this loss and – assuming the user has set the replication factor to at least 2 – there is neither a loss of any data, nor a loss in availability. The system simply switches over to using another server, redistributes the data and moves on without impacting query performance. Therefore, any proper replication solution will have to cater for these transparent fail-overs in Cluster A.
On the other hand, security concerns and firewall maintenance imply that we cannot easily have many different processes talking to lots of different processes in the other datacenter, but likewise, we cannot easily move all updates through a bottleneck of a single network connection between two processes in different datacenters.
Obviously, the whole replication system is a distributed system of distributed systems and thus must be scalable and fault-tolerant without a single point of failure.
All these challenges dictated and informed the design of our solution.
How it works
In Datacenter A the ArangoDB Cluster A runs as usual with no modifications to its code base and API, and serves its usual load. Likewise, in Datacenter B, the second ArangoDB Cluster B is deployed but is initially idle.
In both datacenters, we deploy a Kafka message broker, which is a standard high-performance and fault-tolerant queuing system that is able to buffer a lot of data in its message queues. Individual queues are called “topics” in Kafka. These topics may be consumed from the other data center. Kafka has certain guarantees so that in case of network issues, individual outages etc. no message will be lost and the remote datacenter will always have a consistent state.
Furthermore, in each datacenter, there are several instances of a program called “ArangoDB SyncMaster”. In each datacenter, the SyncMasters elect a leader, which talks to the SyncMaster in the other datacenter to organize the replication. “Organize” means here that it plans the individual tasks that have to be performed in both datacenters to get the replication going. Essentially, one has to replicate both the meta information like which databases, collections and users exist, and the actual data in the sharded collections.
In each datacenter, the leading SyncMaster directs a small army of SyncWorkers which perform the actual replication tasks. For example, for each shard of a collection, there is a “send-shard” task in Datacenter A as well as a “receive-shard” task in Datacenter B, and all these shards are assigned to some SyncWorker by the SyncMaster.
These tasks take care of an initial incremental synchronization phase (running the existing shard synchronization protocol we already have in ArangoDB), as well as the later update phase, in which all updates to the shard are replicated in the other datacenter (using WAL tailing in Datacenter A).
The flow of data is as follows: It starts at some DBserver of the ArangoDB cluster, goes to one of the SyncWorkers in Datacenter A, then into Kafka in Datacenter A. From there it will be consumed by the SyncWorkers of Datacenter B which write it into a coordinator in Datacenter B. Obviously, there are some control messages flowing in the opposite direction. These control messages will be picked up by Datacenter A from the Kafka servers in Datacenter B.
This all means for the administrator, that after initial deployment, one can set up the asynchronous replication with one command by just telling the SyncMaster in Datacenter B that it should start following the Cluster A in Datacenter A. Everything from then on is fully automatic, all databases, collections, users and permissions are automatically replicated to the other datacenter. Obviously, there are monitoring and configuration facilities, but essentially, this is it.
This is the first step towards multi-datacenter awareness, so naturally, it comes with limitations. First of all, the replication is asynchronous, so it is always lagging a bit behind the actual events in Datacenter A. Usually, with good connectivity and a write rate that is smaller than the capacity of the cross datacenter link, this delay is very small. Nevertheless, one should be aware that in the case of an abrupt stop of the replication and a manual switch-over to Cluster B, some recently written updates might be lost.
The whole setup is manually configured and works between two datacenters. It is not allowed to write to the replica cluster at this stage. Nevertheless, a replica cluster can at the same time be a source for another datacenter, and a source cluster can have multiple replicas. That is, you can form trees of datacenters.
Finally, switching off the replication and starting to use a replica is so far a manual operation which needs a decision and an action by an administrator.
How to set it up
Find deployment instructions in our documentation. Note that this feature is only contained in the ArangoDB Enterprise version and that the binary packages we publish contains all the necessary pieces.