After more than 30 years of playing around with 8 bit computers, assembler and scripting languages, Jan decided to move on to work in database engineering. Jan is now a senior C/C++ developer with the ArangoDB core team, being there from version 0.1. He is mostly working on performance optimization, storage engines and the querying functionality.
Running complex data queries in a distributed system
With the always-growing amount of data, it is getting increasingly hard to store and get it back efficiently. While the first versions of distributed databases have put all the burden of sharding on the application code, there are now some smarter solutions that handle most of the data distribution and resilience tasks inside the database.
This poses some interesting questions, e.g.
- how are other than by-primary-key queries actually organized and executed in a distributed system, so that they can run most efficiently?
- how do the contemporary distributed databases actually achieve transactional semantics for non-trivial operations that affect different shards/servers?
This talk will give an overview of these challenges and the available solutions that some open source distributed databases have picked to solve them.
The challenges of running distributed database queries
Writing a database engine for running queries on a single machine is challenging, but doable. Building a distributed database engine is even much harder. It is surprisingly hard to make distributed queries perform efficiently, and to make them behave according to the logical semantics of transactions. There are also various trade-offs here between performance and consistency.
In this talk we will overview some of the approaches that different database products, e.g. Google Spanner, CockroachDB, ArangoDB and MongoDB, have chosen to tackle this problem.
This talk targets developers that are interested in database technology in general and running distributed databases in particular.