This post is the fourth in a series of posts introducing ArangoML and showcasing its benefits to your machine learning pipelines. Until now, we have focused on ArangoML’s ability to capture metadata for your machine learning projects, but it does much more.
In this post we:
- Introduce the concept of covariate shift in datasets
- Showcase the built-in dataset shift detection API
Posts in this series:
ArangoML Part 1: Where Graphs and Machine Learning Meet
ArangoML Part 2: Basic Arangopipe Workflow
ArangoML Part 3: Bootstrapping and Bias Variance
ArangoML Part 4: Detecting Covariate Shift in Datasets
ArangoML Series: Intro to NetworkX Adapter
%%capture !pip install python-arango !pip install arangopipe==0.0.6.9.3 !pip install pandas PyYAML==5.1.1 sklearn2 !pip install jsonpickle
We use a few terms throughout this notebook that will be helpful to be aware of, and this section explains some of them. Machine learning is frequently applied to perform supervised learning. Supervised learning involves training a model to predict an entity of interest, given other entities that we think can be used to predict it.
The entity of interest is called the target. The entities that we use to predict it are called predictors.
When the target can take a continuous range of values, the learning task is called regression. Examples of targets that could take a continuous range of values include:
- House prices
- Cholesterol level in a blood sample
- Customers online shopping budget
In contrast, there are applications where the target can only take one of a fixed set of values. In this case, the target is discrete or categorical in contrast to being continuous. Examples of discrete targets would be:
- The threat level of a request that is received (intruder/benign)
- Disease status of an individual (infected/not-infected)
When the target is discrete or categorical, the learning task is called classification. We will start our machine learning series with a regression example. In part 2 of this series, we developed a regression model using LASSO regression.
When developing a model, we determine the model’s parameters as part of the training process. The model parameters are estimated from the data and are the coefficients associated with the regression line.
However, the data in the training sample may not be representative of the data the application is receiving now; for reasons such as:
- The data we collected initially may be subject to sampling biases.
- We may have intentionally or unintentionally made mistakes in collecting the data.
- Market and business conditions have changed and the pricing of the same houses is different.
It is possible to account for variances between the years with your predictors, but the model becomes inaccurate if the underlying data distribution changes too much. This potential inaccuracy due to the changing or updating of data is a covariate shift. As you can imagine, it is crucial to make sure your model provides accurate results as you receive more data in production. That is why we will look at detecting covariate shifts in datasets throughout the rest of this notebook.
We will continue with the dataset we have been using in this series, the California housing dataset. The following code block imports a random sample of 500 lines from the data and prints just a snapshot to visualize the dataset’s information.
This dataset is available from the arangopipe repo and was initially made available from the UCI ML Repository. The dataset contains data for housing in California, including:
- The house configuration & location
- The median house values and ages
- The general population & number of households
- The median income for the area
import pandas as pd import numpy as np data_url = "https://raw.githubusercontent.com/arangoml/arangopipe/arangopipe_examples/examples/data/cal_housing.csv" df = pd.read_csv(data_url, error_bad_lines=False) df = df.sample(n = 500) df.head()
Let's Get Started¶
Ok, so all that is well and good, but how do we detect that the underlying dataset distribution changed? We can employ a simple technique to do that.
This post will illustrate how a dataset shift can occur and how you can identify it. As discussed earlier, a dataset shift can occur due to sampling errors. For this example, we will deliberately make a sampling error when we acquire the training data used to develop the model. After model development, we receive a new batch of data. The question we will answer is:
Is the current batch of data different in a distributional sense from the one used to build the current model?
We will use machine learning to solve this problem!
Here are some of the steps we will take.
- Tag the data from the batch used to build the current production model as 0.
- Tag the batch of data that we have received since then as 1.
- Develop a model to discriminate between these two labels.
- Evaluate the results and adjust the model if necessary.
If the model we develop can discriminate too well between the two batches of data, then a covariate shift has occurred, and we need to revisit modeling.
If the model cannot discriminate well between these two batches, for example, the classifier we develop produces an accuracy of about 0.5, then this classifier is not very discriminatory. It only performs as well as tossing a fair coin. If we observe such a result, then we conclude that a sufficient dataset shift has not occurred, and our current model will serve us well.
We illustrate this idea with the California housing dataset. The machine learning task associated with the dataset is to predict the median house value given a set of predictors. The rest of the notebook illustrates the idea discussed above.
Exploring the Data¶
Let's take some time to explore the data a bit. First, we look at the types and shapes of the data to see what is available.
req_cols = df.columns.tolist() req_cols.remove("medianHouseValue") df = df[req_cols] df.dtypes
lat float64 long float64 housingMedAge int64 totalRooms int64 totalBedrooms int64 population int64 households int64 medianIncome float64 dtype: object
count 500.000000 mean -119.538660 std 2.032065 min -124.300000 25% -121.830000 50% -118.440000 75% -118.027500 max -115.370000 Name: lat, dtype: float64
When we plot the histogram of the lat variable, we see two populations (see below):
- A group with lat values less than -119
- A group with lat values greater than -119
import matplotlib.pyplot as plt %matplotlib inline df["lat"].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x7f29d102c748>
Let’s pretend that the current batch of data used to develop our regression model is the first one, the group with lat values less than -119. By not selecting any houses with lat values greater than -119, we have introduced a sampling error. When selecting data for model development, you would want to use the entire range of lat values. Our second group consists of houses with lat values greater than -119.
df1 = df.query("lat <= -119") df2 = df.query("lat > -119")
from arangopipe.arangopipe_analytics.rf_dataset_shift_detector import RF_DatasetShiftDetector rfd = RF_DatasetShiftDetector() score = rfd.detect_dataset_shift(df1, df2) print ("Detaset shift score : %2.2f" % (score))
Detaset shift score : 1.00
Interpretation of the score reported by the shift detector¶
The API uses a classifier to discriminate between the datasets provided to it. The score reported by the API is the accuracy of the classifier to discriminate between the datasets. Values close to 0.5 indicate that the classifier is not able to discriminate between the two datasets. This could be interpreted as a situation where no discernable shift has occurred in the data since the last model deployment. Values close to 1 indicate that the dataset shift is detectable, and we may need to revisit modeling.
How the dataset shift affects the performance of the deployed model is problem-dependent. So we must assess the score in the context of a particular application. Usually, we perform experiments to determine a threshold value of the dataset shift score; the score represents an acceptable level of drift.
This post introduced covariate shifts in datasets and provided an example of how easy it is to test this with ArangoML’s built-in API. The Introduction to ArangoML series will continue, so be sure to sign up for our newsletter to be notified of the next release!
You can also join us on the ArangoML Slack channel if you have any questions or comments.
If you have something cool you are working on with ArangoML or ArangoDB in general we would love to learn about it. We now have an interactive tutorials repository where you can share your own notebook with the community, check it out!