An Introduction to Geo Indexes and their performance characteristics: Part II

00GeneralTags: ,

Geo Index Implementation

This section will cover the MMFiles based geo-index. The algorithm is optimized for in-memory accesses and optimal CPU cache utilization. The main goal for our geo queries is to reject as many distant possible result points as fast as possible.

One limitation of an approach purely using geostrings is, when one is trying to perform a query to find points near a target (see blog post Part I). Sometimes points close together on the surface might end up with entirely different geostring prefixes and cannot be scanned without seeks. We implemented a type of Metric Tree to optimize for nearest neighbor queries.

To consistently achieve fast queries the Hilbert geostrings are combined with a binary search tree, the current implementation chooses an AVL tree structure. Read more

An Introduction to Geo Indexes and their performance characteristics: Part I

01Architecture, GeneralTags: ,

Starting with the mass-market availability of smartphones and continuing with IoT devices, self-driving cars ever more data is generated with geo information attached to it. Analyzing this data in real-time requires the use of clever indexing data-structures. Geo data in ArangoDB consists of 2 or more dimensions representing (x, y) coordinates on the earth surface. Searching on a single number is essentially a solved problem, but effectively searching on multi-dimensional data can be more difficult as standard indexing techniques cannot be used.

There exist a variety of indexing techniques. In this blogpost Part I, I will introduce some of the necessary background knowledge required to understand the ArangoDB geo index data structure. First I will start by introducing quadtrees and then I will extend this concept to geohashes and space filling curves like the Hilbert curve. Next week, I will publish Part II including details about the ArangoDB geo index implementation and performance benchmarking.
Read more

Performance analysis with pyArango: Part III Measuring possible capacity with usage Scenarios

00General, how to, PerformanceTags: , , , , ,

So you measured and tuned your system like described in the Part I and Part II of these blog post series. Now you want to get some figures how many end users your system will be able to serve. Therefore you define “scenarios” which will be typical for what your users do.
One such a user scenario could i.e. be:

  • log in
  • do something
  • log out

Since your users won’t nicely queue up and wait for other users to finish their business, the pace you need to test your defined system is “starting n scenarios every second”. Many scenarios simulating different users may be running in parallel. If your scenario would require 10 seconds to finish, and you’d start 1 per second, that means that your system needs to be capable to process 10 users in parallel. If it can’t handle that, you will see that more than 10 sessions are running in parallel, and the time required to handle such a scenario will lengthen. You will see the server resource usage go up and up, and finally have it burst in flames.
Read more

Setting up Datacenter to Datacenter Replication in ArangoDB

00Architecture, cluster, General, how to, Releases, ReplicationTags: , ,

Please note that this tutorial is valid for the ArangoDB 3.3 milestone 1 version of DC to DC replication!

This milestone release contains data-center to data-center replication as an enterprise feature. The is a preview of the upcoming 3.3 release and is not considered production ready.

In order to prepare for a major disaster, you can setup a backup data center that will take over operations if the primary data center goes down. For a server failure, the resilience features of ArangoDB can be used. Data center to data center is used to handle the failure of a complete data center.

Data is transported between data-centers using a message queue. The current implementation uses Apache Kafka as message queue. Apache Kafka is a commonly used open source message queue which is capable of handling multiple data-centers. However, the ArangoDB replication is not tied to Apache Kafka. We plan to support different message queues systems in the future.

The following contains a high-level description how to setup data-center to data-center replication. Detailed instructions for specific operating systems will follow shortly. Read more

Performance analysis with pyArango: Part II
Inspecting transactions

00GeneralTags: , , ,

Following the previous blog post on performance analysis with pyArango, where we had a look at graphing using statsd for simple queries, we will now dig deeper into inspecting transactions. At first, we split the initialization code and the test code.

Initialisation code

We load the collection with simple documents. We create an index on one of the two attributes: Read more

Performance analysis using pyArango Part I

00GeneralTags: , , ,

Usually, your application will persist of a set of queries on ArangoDB for one scenario (i.e. displaying your user’s account information etc.) When you want to make your application scale, you’d fire requests on it, and see how it behaves. Depending on internal processes execution times of these scenarios vary a bit.

We will take intervals of 10 seconds, and graph the values we will get there:

  • average – all times measured during the interval, divided by the count.
  • minimum – fastest requests
  • maximum – slowest requests
  • the time “most” aka 95% of your users may expect an answer within – this is called 95% percentile

Read more

Reaching and harnessing consensus with ArangoDB

01Architecture, cluster, GeneralTags: ,
nihil novi nisi commune consensu
nothing new unless by the common consensus

– law of the polish-lithuanian common-wealth, 1505

A warning aforehand: this is a rather longish post, but hang in there it might be saving you a lot of time one day.


Consensus has its etymological roots in the latin verb consentire, which comes as no surprise to mean to consent, to agree. As old as the verb equally old is the concept in the brief history of computer science. It designates a crucial necessity of distributed appliances. More fundamentally, consensus wants to provide a fault-tolerant distributed animal brain to higher level appliances such as deployed cluster file systems, currency exchange systems, or specifically in our case distributed databases, etc. Read more

Updated Sync & Async Java Drivers with ArangoDB 3.1

00Drivers, JavaTags: ,

The upcoming 3.1 release comes with a binary protocol – VelocyStream – to transport VelocyPack (internal storage format of ArangoDB introduced with the 3.0 release) data between ArangoDB and client applications. VelocyPack stores a superset of JSON, is more compact and has a fast attribute lookup. On the other hand, VelocyStream allows to send VelocyPack in an optimized form over the network. We think it would be the right time to update our official Java Driver to modernize it and to let it be the first to fully support VelocyStream. Read more

Do you like ArangoDB?
icon-githubStar this project on GitHub.
Star ArangoDB on GitHub