Multi-Model Machine Learning

This article looks at how a team collaborating on a real-world machine learning project benefits from using a multi-model database for capturing ML meta-data.

The specific points discussed in this article are how:

  • The graph data model is superior to relational for ML meta-data storage.
  • Storing ML experiment objects is natural with multi-model.
  • ArangoML promotes collaboration due to the flexibility of multi-model.
  • ArangoML provides ops logging and performance analysis.
ArangoML Pipeline Complete pipeline - ArangoDB Machine Learning

Be sure to check out the other posts in this series:
ArangoML Part 1: Where Graphs and Machine Learning Meet
ArangoML Part 2: Basic Arangopipe Workflow
ArangoML Part 3: Bootstrapping and Bias Variance
ArangoML Part 4: Detecting Covariate Shift in Datasets
ArangoML Series: Intro to NetworkX Adapter

Why is a graph data model useful for meta-data from machine learning?

The Relational Approach

To understand why using a graph data model can be beneficial for capturing meta-data from machine learning projects, let’s first examine what it is like to use a relational database for this task. Meta-data about machine learning projects within a company or even a group within a company changes from project to project. As we will demonstrate shortly, this can even vary from task to task within a project! The capturing, changing, and updating of data models is the core of machine learning experiments. This frequent updating adds a lot of burden to the data scientist and operations team when working with a relational database. Communicating these observations and intermediate results effectively is an important step in making machine learning projects succeed. This communication can be cumbersome without tools or with an ineffective tool, this is a need ArangoML fulfills by offering flexible meta-data storage along with convenient collaboration by being able to share all of the details of any experiment.

Using a relational solution to capture meta-data from machine learning projects would require you to determine a common data model for all projects and define it before capturing the project meta-data. Anytime you discover new meta-data elements, you then must define them in the data model prior to being captured. 

These extra steps:

  • Are counterproductive to data science work, which by nature results in discoveries.
  • Requires extra time to capture new changes and discoveries.
  • Serve as a distraction from the actual data science tasks.

The Graph Data Model

Data scientists and engineers already use graphs to reason about and construct transformations to data because it is a natural approach to solving and visualizing these problems. Consequently, using a graph data model makes it very easy to implement these ideas; This is the case with machine learning meta-data. 

You benefit from the graph data model when:

  • Relationships between data elements are dynamic.
  • You query data, thanks to a query language built around a flexible data model with graph traversal capabilities (AQL).
  • You have graph relevant questions about data, such as:
    • What’s the quickest way to a node?
    • Can I reach a part of my data from this starting point?
    • What are the communities in my data?
  • Queries you need to perform are not known apriori.
  • You need to scale in production.

The queries we need to perform for this use case are not known apriori, as is the case for many data analytics projects. It is possible to speed up queries with indexes in a relational database only if they are known beforehand. When the queries are not known apriori, you must perform slow ad-hoc queries that require multiple joins. In contrast, ad-hoc queries on a graph benefit from the interconnected nature of the graph data model. You can use built-in graph algorithms to make quick work of large datasets. In ArangoDB, this benefit is taken further with the system’s primary and edge indexes. These indexes are automatically created and make graph functions such as finding neighbors and shortest paths lightning fast. Even when you cannot create additional indexes due to not already knowing the query, a graph traversal is still very performant. You get the most benefit from structuring your data as a graph when needing to find relationships or patterns within the data, such as with Kode Solution’s tweet analytics tool

How does multi-model help with storing meta-data from machine learning? 

Model development is intrinsically an experiment-driven iterative process. The nature of information from a machine learning project activity is widely varied. The information a data science team member would want to communicate to a team member about a hyperparameter tuning experiment is very different from the nature of the information he conveys about an exploratory data analysis experiment. Therefore, the solution to capture meta-data must be able to capture information from both of these scenarios in a seamless manner.

ArangoDB provides a graph data model as part of its multi-model capabilities, making it an ideal choice to capture machine learning meta-data.

Team Collaboration Scenario

Let’s consider the following hypothetical example:

ACME real estate has house price data for a particular region. In order for their agents to understand how house prices depend on the characteristics of a house, the company is interested in developing a regression model. 

We will look at how the data science team approaches this task. An outline of some of the steps the team might take include: 

  • First, explore the data to understand its characteristics and form some ideas about modeling choices. 
  • Then, run experiments to understand the effectiveness of these choices. 
  • Based on the results, pick a choice that works best, and deploy that model. 
  • Finally, once the model is deployed, evaluate its validity since market conditions may be changing.

Communication of results and findings is a critical aspect of how progress is made in a machine learning project. To underscore this, we will show examples of important conversations between the members of the data science team involved in our example project.

Data Exploration and Analysis

A benefit of using ArangoML for machine learning experiments is how easy it is for fellow data professionals to look up and obtain all of the relevant details for the experiment.

Throughout this email chain, the experiment is referred to with simple descriptors such as a project label. This convenience saves a ton of time for everyone currently involved and makes it easy for future members to find projects and get up to speed.

“Hi Rajiv,

I have done the exploratory data analysis, the project is labeled as “Housing_Price_Estimation_Project” and if you want to pull the dataset directly from the project, I labeled it as “cal_housing_dataset”.


Based on the data distributions, let’s try a simple model and a sophisticated model. Can you check how linear regression and random forest performs on this data?

Thanks,
Chris”

Here is the notebook that the first researcher (Chris) used to perform his analysis and generate a visualization.

https://colab.research.google.com/github/arangodb/interactive_tutorials/blob/master/notebooks/ML_Collab_Article/example_output/Arangopipe_Generate_TF_Visualization_output.ipynb

This notebook is an example of the second researcher (Rajiv) being able to pull up that dataset, stats, and visualize it, simply by using the “cal_housing_dataset” label.

https://colab.research.google.com/github/arangodb/interactive_tutorials/blob/master/notebooks/ML_Collab_Article/example_output/Arangopipe_View_TF_Visualization_output.ipynb

This shows how easy it is for one data scientist to share their findings with another, thanks to ArangoML’s arangopipe pipeline. 

Note that the first researcher actually stored a third party Tensorflow object without needing to do any additional steps. 

Hyper-opt Experimentation

Let’s continue on to see how this functionality continues throughout the entire workflow.

Hi Joerg,

I have determined the best regression model to use for the California housing project. I am continuing the project Chris started; it is tagged with “Housing_Price_Estimation_Project.” I saved the Best result to the project and the usual parameters and performance stats.

Best Regards,
Rajiv

Here is the notebook Rajiv used for his experiments to determine candidate modeling choices and the best choice of hyper-parameters.

https://colab.research.google.com/github/arangodb/interactive_tutorials/blob/master/notebooks/ML_Collab_Article/example_output/ML_Collaboration_Hyperopt_Integration_output.ipynb

Model Building and Deployment

Thanks Rajiv & Chris,

I reviewed the “Housing_Price_Estimation_Project,” the linear model will be the best, and it is ready to ship!

@chris Let’s check back at the end of the quarter when the next batch of data comes through to ensure that the performance is still where we want it.

VG
Joerg

This notebook shows the work involved in building the model and storing all relevant meta-data for the project. This meta-data will be invaluable when the time comes to evaluate the performance of the model.

https://colab.research.google.com/github/arangodb/interactive_tutorials/blob/master/notebooks/ML_Collab_Article/example_output/ML_Collaboration_Model_Building_output.ipynb

Deployment and Log Serving Performance

This notebook contains experiments to evaluate model performance after the model has been deployed to evaluate if the new data is different from the one used to develop the model. This is a necessary step to make sure our model is still accurate and providing expected results.

https://colab.research.google.com/github/arangodb/interactive_tutorials/blob/master/notebooks/ML_Collab_Article/example_output/ML_Collaboration_Dataset_Shift_output.ipynb

Hi Rajiv & Joerg,

Following up on the “Housing_Price_Estimation_Project,” and everything is looking great with no discernable shifts in the data! I stored the results in the project if you would like to see them for yourself.

Have a good weekend!
Chris

Conclusion

In this article we:

  • Discussed how multi-model makes storing and retrieving data seamless for data scientists.
  • Followed along as Rajiv, Chris, and Joerg shared data for different stages of a machine learning experiment.
  • Showed how meta-data being collected can improve your projects by being able to compare performance stats and when performing maintenance tasks, such as checking for dataset shift.

If you would like to learn more about ArangoML and its suite of tools visit the ArangoML repository and be sure to check out the other posts in the ArangoML series.
ArangoML Part 1: Where Graphs and Machine Learning Meet
ArangoML Part 2: Basic Arangopipe Workflow
ArangoML Part 3: Bootstrapping and Bias Variance
ArangoML Part 4: Detecting Covariate Shift in Datasets
ArangoML Series: Intro to NetworkX Adapter