It is often difficult and time-consuming to setup a cluster environment for development or production purposes. For this reason, we decided to make an initial setup for you as easy as possible.
Today we’re introducing the first part of our new deployment tool for cloud computing platforms (Edit: now also available: Amazon Web Services and Google Compute Engine):
Part 1: Digital Ocean
We’ve released our first prototype, which deploys an ArangoDB Cluster on Digital Ocean. Just download a single bash script, export your Digital Ocean API Token and watch the tool take care of the rest for you.
chmod 755 DigitalOcean_ArangoDB_Cluster.sh
export your Digital Ocean API-Token:
create an ArangoDB Cluster on Digital Ocean:
This will create a directory called
./digital_ocean that contains information about the cluster. You can easily get rid of the cluster as well as of the virtual machines by doing
Feel free to try it out! See below for information on how to configure things.
Some background information for the curious
This script will use your access token for Digital Ocean to deploy a number of VM instances running CoreOS. If you do not already have one, it will first create an SSH keypair for you and deploy it to Digital Ocean and your ssh-agent. Once the machines are running the script uses Docker images to start up all components of an ArangoDB cluster and link them to each other. In the end, it will print out access information.
No installation of ArangoDB is needed, not on the VM instances, neither on your machine. All deployment issues are taken care of by Docker. You can simply sit back and enjoy.
The whole process will take a few minutes and will print out some status reports.
Some switches to configure a few things
Use the -h switch to get this help page.
The following environment variables are used:
- TOKEN : digital ocean api-token (as environment variable)
- REGION : site of the server (e.g.
- SIZE : size/machine-type of the instance (e.g.
- NUMBER : count of machines to create (e.g.
- OUTPUT : local output log folder (e.g.
- SSHID : id of your existing ssh keypair. if no id is set, a new keypair will be generated and transferred to your created instance (e.g.
- PREFIX : prefix for your machine names (e.g.
Discover more about ArangoDB:
Want to learn more about the possibilities of the ArangoDB?
Take a look at our Documentation, Tutorials and Cookbook recipes.
I wonder how a small cluster of large droplets compares to a large cluster of small droplets, in terms of performance, reliability, and cost. There’s probably a sweet spot somewhere.
Can you provide the starting point for this? Where do you run the .sh script from, and what do you export the TOKEN on? Is that on an existing droplet, that in turns sets up more droplets? Or is this at the setup phase of a new droplet?
I have not analysed cost so far.
I did some analysis for https://mesosphere.com/blog/2015/11/30/arangodb-benchmark-dcos/ but was essentially only interested in maximizing throughput per vCPU of single document operations. I found that for that the sweet spot was using instances with 8 vCPUs and fast local SSDs, all running a primary DBServer, a secondary DBServer (asynchronous replica) and a coordinator. This was for AWS, but I would expect similar results for DO.
The best way to cut costs is almost certainly by not using local SSDs, since they make the instances expensive with all providers. However, this will almost immediately cost throughput, simply because the combined I/O performance of the instances cannot keep up. Note that we try to avoid write amplification as much as possible, but there is always some overhead, in particular since we have to write every document once to the write ahead log of the primary, once to the actual data file, and then the same for the asynchronous replica. If the combined I/O performance is the bottleneck, then one can often buy instances with less CPU power but the same I/O bandwidth and cut costs in this way.
As to reliability, Version 3.0 will greatly increase reliability because of the synchronous replication we are currently putting in.
For a real world application, I would first specify the replication and reliability needs, then specify the needed throughput with some reserves, which essentially will tell you whether or not you need SSDs. Then I would simply compare clusters with different sizes and numbers of droplets with the intended load. This should give you the sweet spot w.r.t. costs for your particular needs.