This post is the second in a series of posts about machine learning and showcasing the benefits ArangoML adds to your machine learning pipelines. In this post we:

  • Introduce machine learning concepts
  • Demonstrate basic model building
  • Log a model building activity with arangopipe
ArangoML Pipeline Complete pipeline - ArangoDB Machine Learning

Open In Colab

ArangoML Part 2: Basic Arangopipe Workflow

This post is the second in a series of posts about machine learning and showcasing the benefits ArangoML adds to your machine learning pipelines. In this post we:

  • Introduce machine learning concepts
  • Demonstrate basic model building
  • Log a model building activity with arangopipe

These posts will hopefully appeal to two audiences:

  • The first half of each post is for beginners in machine learning
  • The second half for those already using machine learning

We decided to do it this way to try and provide a jumping off point for those interested in machine learning, while still showing useful examples for those that already have a machine learning pipeline.

Intro

The primary objective of these posts is to showcase the benefits of using ArangoML. However, a secondary objective is to introduce machine learning concepts to those just starting their machine learning journey. They aren’t meant to be an exhaustive explanation of every concept but a short intro to the most relevant concepts for each post.

This post focuses on how you can use arangopipe throughout a machine learning project to capture meta-data. Since this example is meant to show how to use arangopipe throughout your entire project, attempting to cover all the concepts mentioned is well outside the scope of this post. Instead, it focuses on model building.

What is model building?

A key goal of machine learning is to build applications that can analyze data to make predictions about data that is supplied to it in the future. Such as the scenario we will use for a few of the posts in this series, predicting house prices. There are machine learning use cases in nearly every aspect of our daily lives:

  • Voice Assistants (Siri, Alexa, etc..)
  • Recommendation Engines
  • Image Recognition
  • Fraud Detection
  • Email Filtering
  • So many more!

There are quite a few steps involved in creating an application that can learn from data and then make accurate predictions about other data. One of the first tasks in developing machine learning models is to understand the characteristics of the data as they pertain to the model development task at hand.One of the steps involves creating or choosing from algorithms and then testing how it behaves against your data. In order for our application to be able to determine what a house will cost, it needs to be trained with example data. This data is usually labeled to provide clues that normally wouldn’t exist in the data, such as the prices of the houses that the application will eventually learn to predict.

For example, here is a list of the variables included with our data.

In [1]:
import pandas as pd
data_url = "https://raw.githubusercontent.com/arangoml/arangopipe/arangopipe_examples/examples/data/cal_housing.csv"
df = pd.read_csv(data_url, error_bad_lines=False)

df.head() #prints the first 5 rows of data with headers
Out[1]:
lat long housingMedAge totalRooms totalBedrooms population households medianIncome medianHouseValue
0 -122.22 37.86 21 7099 1106 2401 1138 8.3014 358500.0
1 -122.24 37.85 52 1467 190 496 177 7.2574 352100.0
2 -122.25 37.85 52 1274 235 558 219 5.6431 341300.0
3 -122.25 37.85 52 1627 280 565 259 3.8462 342200.0
4 -122.25 37.85 52 919 213 413 193 4.0368 269700.0

Our data has multiple variables that describe the type of house that is being evaluated including:

  • The house configuration & location
  • The median house values and ages
  • The general population & number of households
  • The median income for the area

In order to properly train a model you need to start with splitting your dataset into a few randomly selected groups.

  • Training Data: A portion of the dataset that will be the data used during model training
  • Test Data: Data used to evaluate the selected model after training

Splitting the dataset is necessary to get a more accurate view on how the model performs on completely new data. The process of evaluating data to find a representative sample is another topic on its own but libraries like Scikit-learn offer this functionality as a simple one liner with the train_test_split function which we use in our example below. There is a phenomenon called overfitting where a machine learning model could potentially ‘memorize’ the data in a dataset and this is why we split the dataset. Once split we can test the model against the test data to confirm it is performing as good as it was on the training data.

There is a variable that we want to predict and there is a set of variables that we can use to predict it. The variable that we want to predict is called the target. The variables that we use to predict the target are the predictor variables.The other variables are the predictor variables and they are the variables the data uses to reason over and find patterns, in order to understand how they add up to reach the target. Finally, you have parameters which are used by the training algorithms to describe how the variables are related to each other.

A common task is to test multiple algorithms against the dataset passing through a list of parameters for the algorithms to try. Once each algorithm has evaluated the data their performance can be compared to determine which is the most accurate. A great library to get started with this process is Scikit-learn, specifically their Model selection guide.

Here is a great map of the process of finding the right algorithm (referred to as estimator) from scikit-learn. https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

Image of machine learning map

The model can be thought of as the answers of the training algorithm. Each entry in the test dataset is passed through the algorithm of choice, in our case LASSO regression), and then the results of those calculations are stored in the model.

So to recap the steps involved in model building:

  1. Split data into train and test samples
  2. Evaluate data to determine target, predictors, and parameters
  3. Evaluate data with algorithms using the defined parameters to determine the best performing algorithm
  4. You now have a model which consists of the best performing algorithm and the parameters unique to your data
  5. Test your model against your test sample data

This was hopefully a helpful breakdown of some of the motivations for machine learning and what exactly a machine learning model is.

The next section is the start of the interactive notebook, feel free to run your own by clicking open in colab. This notebook covers a simple ML project workflow and shows how arangopipe can be dropped in to your existing projects to capture valuable meta-data. It is meant as a super simple “Hello World” of sorts for using arangopipe. By the end of this notebook you will know how to:

  • Create and register an ML project with ArangoML
  • Develop a simple model with Sklearn
  • Log the model building activity with arangopipe

Installation Prerequisites

In [2]:
%%capture
!pip install python-arango
!pip install arangopipe==0.0.6.9.3
!pip install pandas PyYAML==5.1.1 sklearn2
!pip install jsonpickle

Using Arangopipe

Metadata describes the components and actions involved in building the machine learning pipeline. The steps involved in constructing the pipeline are expressed as a graph by most tools, making ArangoDB a natural fit to store and manage machine learning application metadata. Arangopipe is ArangoDB’s tool for managing machine learning pipelines.

Creating a Project

To use Arangopipe to track meta-data for projects, projects have to be registered with Arangopipe. For purposes of illustration, we will use the california housing dataset from UCI machine learning repository. Our project entails developing a regression model with this dataset. We will first register this project with Arangopipe as shown below.

The following code block also generates a test database as well as setup the arangopipe connection.

*Note: If you receive an error creating the temporary database, please run this code block again.

In [3]:
from arangopipe.arangopipe_storage.arangopipe_api import ArangoPipe
from arangopipe.arangopipe_storage.arangopipe_admin_api import ArangoPipeAdmin
from arangopipe.arangopipe_storage.arangopipe_config import ArangoPipeConfig
from arangopipe.arangopipe_storage.managed_service_conn_parameters import ManagedServiceConnParam
mdb_config = ArangoPipeConfig()
msc = ManagedServiceConnParam()
conn_params = { msc.DB_SERVICE_HOST : "arangoml.arangodb.cloud", \
                        msc.DB_SERVICE_END_POINT : "createDB",\
                        msc.DB_SERVICE_NAME : "createDB",\
                        msc.DB_SERVICE_PORT : 8529,\
                        msc.DB_CONN_PROTOCOL : 'https'}
        
mdb_config = mdb_config.create_connection_config(conn_params)
admin = ArangoPipeAdmin(reuse_connection = False, config = mdb_config)
ap_config = admin.get_config()
ap = ArangoPipe(config = ap_config)
proj_info = {"name": "Housing_Price_Estimation_Project"}
proj_reg = admin.register_project(proj_info)
mdb_config.get_cfg()

# If you receive an error creating the temporary database, please run this code block again.
API endpoint: https://arangoml.arangodb.cloud:8529/_db/_system/createDB/createDB
/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)
Host Connection: https://arangoml.arangodb.cloud:8529
Out[3]:
{'arangodb': {'DB_end_point': 'createDB',
  'DB_service_host': 'arangoml.arangodb.cloud',
  'DB_service_name': 'createDB',
  'DB_service_port': 8529,
  'arangodb_replication_factor': None,
  'conn_protocol': 'https',
  'dbName': 'ML7cfso9exer7rx3jdnsn5t',
  'password': 'MLd1zxgiohnzuxrqc0oyj1a',
  'username': 'ML24boypr78xpfbimhx67vlr'},
 'mlgraph': {'graphname': 'enterprise_ml_graph'}}

Try it out!

Once the previous block has successfully executed you can navigate to https://arangoml.arangodb.cloud:8529 and sign in with the generated credentials to explore the temporary database.

Model Building

In this section, the procedure for capturing meta-data with Arangopipe as part of the model building activity will be illustrated. Model selection is an important activity for data scientists. Data scientists consider many candidate models for a task and then the best performing model is chosen. An example of this can be found in the hyperopt guide to capture metadata from a hyper-parameter tuning experiment, (see hyperopt). We will use a simpler setting for this notebook. We will assume model selection has been completed and that a LASSO regression model is the best candidate for the task. Having made this decision, we capture information about the model and its parameters. This information is stored in Arangopipe. The details of performing these tasks are shown below. Before model building, we capture information related to the dataset and the features used to build the model.

Register Dataset

Here we register the dataset that we imported in the intro section. This dataset is available from the arangopipe repo and was originally made avaialble from the UCI ML Repository. The dataset contains data for housing in california, including:

  • The house configuration & location
  • The median house values and ages
  • The general population & number of households
  • The median income for the area
In [4]:
df.head() #prints the first 5 rows of data with headers
Out[4]:
lat long housingMedAge totalRooms totalBedrooms population households medianIncome medianHouseValue
0 -122.22 37.86 21 7099 1106 2401 1138 8.3014 358500.0
1 -122.24 37.85 52 1467 190 496 177 7.2574 352100.0
2 -122.25 37.85 52 1274 235 558 219 5.6431 341300.0
3 -122.25 37.85 52 1627 280 565 259 3.8462 342200.0
4 -122.25 37.85 52 919 213 413 193 4.0368 269700.0

This step registers the dataset we are using with our project. This ability to register information about the project, dataset used, and the other relevant machine learning project metadata is the benefit that arangopipe can bring to your workflow. Once your project is complete you can quickly pull up the data for your project and review or compare it with ease.

There is also an ArangoML custom user interface that provides additional management and visualization options for your ML projects, this is currenlty only available when running your own projects locally. We have provided a pre-built docker image to make starting your local arangopipe easy, see this guide for getting started with your own local instance.

In [5]:
ds_info = {"name" : "california-housing-dataset",\
            "description": "This dataset lists median house prices in Califoria. Various house features are provided",\
           "source": "UCI ML Repository" }
ds_reg = ap.register_dataset(ds_info)

Register Featureset

Register the features used to develop the model.

  • Note that the response variable has been log transformed
  • Note that when the featureset is registered, it is linked to the dataset
In [6]:
import numpy as np
df["medianHouseValue"] = df["medianHouseValue"].apply(lambda x: np.log(x))
featureset = df.dtypes.to_dict()
featureset = {k:str(featureset[k]) for k in featureset}
featureset["name"] = "log_transformed_median_house_value"
fs_reg = ap.register_featureset(featureset, ds_reg["_key"]) # note that the dataset and featureset are linked here.

Develop a Model

As discussed in the introduction it is important to have a training set and a test set to be able to evaluate our model with 'new' data. Here we use the train_test_split functionality of sklearn to split the data.

Note that we also set Y to be the medianHouseValue, Y here is our target.

In [7]:
from sklearn.model_selection import train_test_split
preds = df.columns.to_list()
preds.remove('medianHouseValue')
X = df[preds].values
Y = df['medianHouseValue'].values
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=42)

Developing the model

Here we have taken some of the guess work out of model training and decided to go with Lasso regression.

In [8]:
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
clf = linear_model.Lasso(alpha=0.001)
clf.fit(X_train, y_train)
train_pred = clf.predict(X_train)
test_pred = clf.predict(X_test)
train_mse = mean_squared_error(train_pred, y_train)
test_mse = mean_squared_error(test_pred, y_test)

print(train_mse)
print(test_mse)
0.11656651052933448
0.11375349737667079

To get some insight into what model parameters actually are here are the basic parameters used with this experiement.

While they won't make much sense to someone not familiar with them, they might offer a starting spot if you would like to look more into what exactly model parameters are.

In [9]:
print(clf.get_params())
{'alpha': 0.001, 'copy_X': True, 'fit_intercept': True, 'max_iter': 1000, 'normalize': False, 'positive': False, 'precompute': False, 'random_state': None, 'selection': 'cyclic', 'tol': 0.0001, 'warm_start': False}

Register the Model

  • Note that project and model are linked
  • The notebook associated with the model can be retreived from github. This can be part of the meta-data associated with the model
In [10]:
import io
import requests
url = ('https://raw.githubusercontent.com/arangoml/arangopipe/master/examples/Arangopipe_Feature_Examples.ipynb')
nbjson = requests.get(url).text

The model information can contain the name you would like to assign to the model, the task, and notebook information.

Once you create the model info properties object you register it with the project.

In [11]:
model_info = {"name": "Lasso Model for Housing Dataset",  "task": "Regression", 'notebook': nbjson}
model_reg = ap.register_model(model_info, project = "Housing_Price_Estimation_Project")

Log Model Building Activity

In this section we look at the procedure for capturing a consolidated version of this model building activity. The execution of this notebook, or any ML activity, is captured by the 'Run' entity in the Arangopipe schema (see schema). To record the execution, we need to create a unique identifier for it in ArangoDB.

After generating a unique identifier, we capture the model parameters and model performance and then record the details of this experiment in Arangopipe. Each of these steps is shown below.

Note that capturing the 'Run' or execution of this cell captures information that links

  1. The dataset used in this execution (ds_reg)
  2. The featureset used in this execution (fs_reg)
  3. The model parameters used in this execution (model_params)
  4. The model performance that was observed in this execution (model perf)
In [12]:
import uuid
import datetime
import jsonpickle

ruuid = str(uuid.uuid4().int)
model_perf = {'training_mse': train_mse, 'test_mse': test_mse, 'run_id': ruuid, "timestamp": str(datetime.datetime.now())}

mp = clf.get_params()
mp = jsonpickle.encode(mp)
model_params = {'run_id': ruuid, 'model_params': mp}

run_info = {"dataset" : ds_reg["_key"],\
                    "featureset": fs_reg["_key"],\
                    "run_id": ruuid,\
                    "model": model_reg["_key"],\
                    "model-params": model_params,\
                    "model-perf": model_perf,\
                    "tag": "Housing_Price_Estimation_Project",\
                    "project": "Housing_Price_Estimation_Project"}
ap.log_run(run_info)

Optional: Save the connection information to google drive so that this can used to connect to the instance that was used in this session.

Once you have a database created and a project filled with data, you can save yrou connection configuration to a file to be able to easily reconnect.

In [ ]:
from google.colab import drive
drive.mount('/content/drive')
fp = '/content/drive/My Drive/saved_arangopipe_config.yaml'
mdb_config.export_cfg(fp)

Using Arangopipe with Common Tools in a Machine Learning Stack

This notebook provides the details of working with Arangopipe to capture meta-data from a machine learning project activity. If you would like to see Arangopipe can be used with some common tools in a machine learning stack:

  1. See TFX for the details of using Arangopipe with TFX
  2. See Pytorch for details of using Arangopipe with Pytorch.
  3. See Hyperopt for details of using Arangopipe with Hyperopt
  4. See MLFlow for details of using Arangopipe with MLFlow.