Two Hidden Pitfalls in Machine Learning Lifecycle: Data Leak and One more!

Addressing the common hidden mispractices that can hurt the results of the machine learning systems when everything else was done well!

Image by author

In almost any step involved in building a machine learning project, there is a chance that something can be done incorrectly. There is going to be a small mispractice that is hard to notice but can completely ruin everything.

Here is what I want to mean…

The failure of a machine learning project can be caused by many factors but the two common pitfalls are data leakage and inconsistent data preprocessing functions. In this article, I will talk about these two challenges and how to avoid them.

Data Leakage

There are many steps involved in building a machine learning project. Data leakage can happen in any of those steps. What is data leakage?

Let’s understand it with an example. Let’s say that you own an online shopping site and you want to learn more about the gender of your customers in order to offer them the relevant products that can be appreciated by any gender type. That’s a good use of machine learning.

You have many features in the training data such as the latest purchased product,..etc. Among these features, there is another feature called ‘Group’ and it includes gender and other numbers (something like M-12, F-45). For all females gender, the group feature starts with F, and M for males.

You train a model and it works perfectly well on the training and testing data. When moved to production, it failed. What is the reason? Well, you fed the data to the model that you should not have fed. The group feature you used already included gender, and so by training the model, it only relied on that single feature to predict the gender and as a result, it can not generalize well on the future data. That is data leakage.

Data leakage can happen in many others ways and most of the time we don’t know if we leaked the data. Let’s take an example in data preprocessing, specifically scaling features. It is true that we should preprocess test data in the same way that the training data was preprocessed, but sometimes, we can leak data if we are not careful.

Let’s take an example…I am going to use Sklearn, a classical ML framework to scale the training and test data. I assumed that everything else was done and this was the next step before training a model since it’s a good practice to scale the numerical features.

from sklearn.preprocessing import StandardScalerscaler = StandardScaler()X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

Have you noticed something wrong above? I have fitted the scaler to the testing data yet I only had to transform the test data. This is another way that data can be leaked. I did this mistake many times :( before I learn that the test set is only transformed.

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

These were two examples in which data is leaked. In the first example, it is a feature leak. In the latter example, the leak happened during feature engineering. Leakage can also happen during random partition/splitting of data and data augmentation. As a result of splitting/augmentation, If you have a duplicate of data examples/instances between train and test set, that is also leakage.

It might seem that data leakage is inevitable. Here is the summary of what we can do to minimize the chance of leaking the data:

  • Before any data preprocessing, split the data into training and testing sets, and don’t touch the test set until you have improved the model.
  • When using tools like Sklearn, do not use the fit method to the test data. If you have to preprocess the test set, only use the transform method.
  • Remove all features that contain the same information similar to the target feature. If you’re predicting the annual income of an employee, get rid of the monthly income feature in the training features.
  • Get rid of duplicates of data between training and test set.
  • As much as possible, use pipelines in transforming data.

Inconsistent Data Preprocessing Functions

The results of machine learning are not friendly when it comes to any mispractice in data preprocessing. At any point, a small mistake can completely ruin the whole project. Let’s take an example, training a linear regression model. Try to notice what’s going on…

from sklearn.datasets import load_boston
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# Getting the data from Sklearn datasetsX, y = load_boston(return_X_y=True)# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Scaling the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)# Building and training a model
lin_reg = LinearRegression(), y_train)
# Evaluating the model on the train prediction_train = lin_reg.predict(X_train_scaled)
mse_train = mean_squared_error(y_train, prediction_train)
# Evaluating the model on the test prediction_test = lin_reg.predict(X_test)
mse_test = mean_squared_error(y_test, prediction_test)

It’s very likely that you have caught me. I scaled the training data and trained a model with the scaled data but evaluated it on the test data which is not scaled. As you can see, the difference between predictions on the training and testing set is very high, which shows that I messed up. The following are the result if I would have scaled the test data.

X_test_scaled = scaler.transform(X_test) # Note I don't fit_transform on test set
prediction_test = lin_reg.predict(X_test_scaled)
mse_test = mean_squared_error(y_test, prediction_test)

There are times I don’t know what is being wrong with my results, but checking, it turns out to be a similar mistake and it can happen to many people as well.

The above scenario can happen in many ways. If you have scaled the training data with a given scaling technique (say normalization, a.k.a Minmax scaling), then you should never standardize the test set. It should also be normalized.

Did you see a programming error above? I didn’t see any error that I didn’t scale the test set other than getting poor results. And this is why ML is hard. So understanding things like this become helpful when it comes to diagnosing the results of machine learning models.

This is the end of the article. The key takeaways are:

  • Although data leakage can seem inevitable, its effect can be minimized. Always remember to split your data into train and test sets and don’t use the test set in the model training. Also, remove leak features and duplicates.
  • Use consistent data preprocessing functions across training and test sets. Using data preprocessing pipelines can help not only to have seamless data transformations but also to avoid data leakage.

Further Learning

Here are two resources that I found helpful to learn more about data leakage:

Thank you for reading!

Every week, I write one article about machine learning techniques, ideas or best practices that can help you to build effective learning systems. Connect with me on Twitter!

Writing about Machine Learning!