ArangoDB for Machine Learning
ArangoDB, with its native multi-model capabilities, is a great match for your machine learning workloads. ArangoML Pipeline is now available as a cloud service – ArangoML Pipeline Cloud.
When building a production-grade machine learning infrastructure, ArangoML provides support for common metadata storage across the entire machine learning lifecycle, and there enables reproducibility, monitoring, and auditing for your machine learning models.
ArangoDB offers support for both analytics tasks and multi-model-powered machine learning. It is particularly helpful when dealing with a mixture of structured and unstructured data as ArangoDB can natively and efficiently manage different data models.
ArangoML for Machine Learning Infrastructure
Everyone knows training data is an important prerequisite for training machine learning models. But for building a production-grade machine learning platform, we actually should equally care about another type of data: metadata. Production machine learning platforms consist of a number of different steps and components:
Most of those components produce some kind of metadata including for example references to data sets, and training runs with the associated train and test accuracies, model serving statistics, provenance information linking trained models to the datasets used for training, and many more. Data Scientists and DataOps require common metadata storage to answer questions such as: which model was trained with this dataset, which feature is resulting in the best test accuracy.
Here ArangoML offers a simple interface for access across your favorite machine learning frameworks and tools.
As ArangoML is backed by the multi-model capabilities of ArangoDB it can store unstructured data such as the training statistics of a particular training run (document) as well as the connection (graph) to the associated dataset and the resulting model. So the queries above basically become a graph traversal.
You can also find the associated code here.
ArangoML Pipeline is a powerful yet simple tool to facilitate teamwork between DataOps and Data Science but allows also to provide detailed audit trails for auditors and advanced analytics of the whole machine learning environment.
Multi-Model-Powered Machine Learning
ArangoDB is offering native support for different data models, including graphs, documents, and key-value, and allows queries across all of them using a single query language.
These multi-model capabilities are especially useful in a machine learning platform for feature engineering as they enable users to combine different data aspects into features which can, in turn, be used by machine learning frameworks such as TensorFlow or PyTorch to train models.
Data quantity is crucial, especially for modern Deep Learning. ArangoDB being distributed database can also process datasets (e.g., graphs) which are too large for a single node.
Furthermore, ArangoDB has native support for a large number of graphs algorithms, including PageRank, Vertex Centrality, Vertex Closeness, Connected Components, or Community Detection.
All these capabilities make ArangoDB and native multi-model a great and useful tool for many machine learning use cases.
To get started, we created an interactive demo about building Knowledge Graphs with ArangoDB & Tensorflow.