Every company needs a disaster recovery plan for all important systems. This is true from small units like single processes running in some container to the largest distributed architectures. For databases in particular this usually involves a mixture of fault-tolerance, redundancy, regular backups and emergency plans. The larger a data store, the more difficult is it to come up with a good strategy.
Therefore, it is desirable to be able to run a distributed database in one datacenter and replicate all transactions to another datacenter in some way. Often, transaction logs are shipped over the network to replicate everything in another, identical system in the other datacenter. Some distributed data stores have built-in support for multiple datacenter awareness and can replicate between datacenters in a fully automatic fashion.
This post gives an overview over the first evolutionary step of ArangoDB towards multi-datacenter support, which is asynchronous datacenter to datacenter replication.
What it does
This feature allows you to run two ArangoDB clusters in two different datacenters A and B, and set up asynchronous replication from A to B. This means, that Cluster A in Datacenter A can be used as usual for read and write operations, and all changes to the data are replicated over the network to the other Cluster B in datacenter B. The replication is asynchronous, that is, changes appear on the other side after a short delay, usually within a few seconds. (Read more about ArangoDB cluster architecture in our Whitepaper)
In the case of a disaster in Datacenter A like a total loss of network connectivity, one can quickly stop the replication and start using Cluster B in Datacenter B as a drop in replacement for Cluster A. Later, when the disaster is over, one can either use Cluster A as an asynchronous replica of Cluster B, or switch back to A and continue with the replication to Cluster A.
In this section we do not want to bore you with the technical details, we will publish a white paper in due course for that. Rather, we want to highlight the challenges of this approach and sketch the measures we took to overcome them.
A single ArangoDB cluster is a distributed system with good horizontal scalability. Both the data capacity and the query performance (read and write) scales linearly with the number of servers used. Automatic sharding leads to the fact that the actual changes to the data occur all over the place in all servers concurrently. In particular, this means that there is – by design – no single place that establishes a total order of all changes. That is, we are dealing with a distributed mess of simultaneously happening updates to huge amounts of data. The rate of change can vary considerably, and we will have to deal with large write bursts.
At the same time, an ArangoDB cluster is fault tolerant. If, for example, a single server fails in a datacenter, the ArangoDB cluster can easily tolerate this loss and – assuming the user has set the replication factor to at least 2 – there is neither a loss of any data, nor a loss in availability. The system simply switches over to using another server, redistributes the data and moves on without impacting query performance. Therefore, any proper replication solution will have to cater for these transparent fail-overs in Cluster A.
On the other hand, security concerns and firewall maintenance imply that we cannot easily have many different processes talking to lots of different processes in the other datacenter, but likewise, we cannot easily move all updates through a bottleneck of a single network connection between two processes in different datacenters.
Obviously, the whole replication system is a distributed system of distributed systems and thus must be scalable and fault-tolerant without a single point of failure.
All these challenges dictated and informed the design of our solution.
How it works
In Datacenter A the ArangoDB Cluster A runs as usual with no modifications to its code base and API, and serves its usual load. Likewise, in Datacenter B, the second ArangoDB Cluster B is deployed but is initially idle.
In both datacenters, we deploy a Kafka message broker, which is a standard high-performance and fault-tolerant queuing system that is able to buffer a lot of data in its message queues. Individual queues are called “topics” in Kafka. With Kafka, one can setup a system called “MirrorMaker” to forward a configurable set of Kafka topics from one datacenter to the other 1 one. Everything that is written into one of these topics appears eventually in the corresponding topic with the same name in the Kafka in the other datacenter. This is our main vehicle to move messages and data between datacenters.
Furthermore, in each datacenter there are several instances of a program called “ArangoDB SyncMaster”. In each datacenter, the SyncMasters elect a leader, which talks to the SyncMaster in the other datacenter to organize the replication. “Organize” means here that it plans the individual tasks that have to be performed in both datacenters to get the replication going. Essentially, one has to replicate both the meta information like which databases, collections and users exist, and the actual data in the sharded collections.
In each datacenter, the leading SyncMaster directs a small army of SyncWorkers which perform the actual replication tasks. For example, for each shard of a collection there is a “send-shard” task in Datacenter A as well as a “receive-shard” task in Datacenter B, and all these shards are assigned to some SyncWorker by the SyncMaster.
These tasks take care of an initial incremental synchronization phase (running the existing shard synchronization protocol we already have in ArangoDB), as well as the later update phase, in which all updates to the shard are replicated in the other datacenter (using WAL tailing in Datacenter A).
The flow of data is as follows: It starts at some DBserver of the ArangoDB cluster, goes to one of the SyncWorkers in Datacenter A, then into Kafka in Datacenter A. From there MirrorMaker moves it to Datacenter B, where it is picked up by some SyncWorker and finally written into a coordinator in Datacenter B. Obviously, there are some control messages flowing in the opposite direction, but they also use the two Kafkas and the MirrorMaker transport.
This all means for the administrator, that after initial deployment, one can set up the asynchronous replication with one command by just telling the SyncMaster in Datacenter B that it should start following the Cluster A in Datacenter A. Everything from then on is fully automatic, all databases, collections, users and permissions are automatically replicated to the other datacenter. Obviously, there are monitoring and configuration facilities, but essentially, this is it.
This is the first step towards multi-datacenter awareness, so naturally, it comes with limitations. First of all, the replication is asynchronous, so it is always lagging a bit behind the actual events in Datacenter A. Usually, with good connectivity and a write rate that is smaller than the capacity of the cross datacenter link, this delay is very small. Nevertheless, one should be aware that in the case of an abrupt stop of the replication and a manual switch-over to Cluster B, some recently written updates might be lost.
The whole setup is manually configured and works as between two datacenters. It is not allowed to write to the replica cluster at this stage. Nevertheless, a replica cluster can at the same time be a source for another datacenter, and a source cluster can have multiple replicas. That is, you can form trees of datacenters.
Finally, switching off the replication and starting to use a replica is so far a manual operation which needs a decision and an action by an administrator.
Where we are
As of this writing, the datacenter to datacenter replication is feature complete and we publish the first milestone release that contains the code. We do not call it “ready”, though. The documentation is lacking, and our deployment strategy has some very rough edges. For example, we only have concrete installation instructions for a deployment using individual machines running RedHat or Centos 7 Linux, and installation of all the moving parts is still a bit of a puzzle game. We know that there are still bugs. You can try it, but please expect some adventures.
Our plan is to finish testing and debugging in October, such that the feature will be released in some generally available version some time in November.
How to set it up
So far, we have generic installation instructions HERE and concrete deployment instructions and scripts for RedHat Enterprise 7 and Centos 7 (see HERE, including a README.me with instructions). Note that this feature is only contained in the ArangoDB Enterprise version and that the RPM package we publish contains all the necessary pieces.