ML Model is 5% — What should we be doing?
A new way of diagnosing machine learning systems!
The standard way of doing Machine Learning has been focusing on choosing the best learning algorithm and tweaking hyperparameters of such particular learning algorithm to get good accuracy. It is inarguable that the goal of such an approach yields some increments in the accuracy or other desired metrics. However, a recent trend in the ML community suggests different things — that is understanding that the model is a small fraction of what to be done to build an effective and working machine learning system and instead spend time improving the data. In other words, the data-centric approach.
In this article, I want to talk about the data-centric approach while nailing it down to ML workflow. In the end, I will point to the resources to learn more.
Overview
- Data-centric approach vs model-centric approach
- ML code is 5%, Zooming into a typical ML system
- Starting the model development quickly
- Iterating on reducing the error
- Iterating on the data improvement
- Conclusion
Data-centric approach vs model-centric approach
The standard approach (most revealed in competitions and academic researches) is to build a model which can generalize well on the training data and any improvement will come from tuning hyperparameters. That is the model-centric approach. The data is kept fixed, and the rest of the work is to find the best learning algorithm. Whether it’s sophisticated or not, it doesn’t matter as long as it generalizes on the available training data.
On the flip side, the data-centric approach focuses on keeping the model fixed (the possible baseline you can get started with) and let the rest work be on improving the data.
To summarize, an ML system is made of code and data. The model-centric approach improves code while data-centric improves the data.
ML code is 5%, Zooming into a Typical ML System
To emphasize the point that ML code is 5% of the whole system, I will borrow this image here.
As you can see, a model is a tiny fraction of the real-world ML systems. Other 95% is things like data collection, labeling, cleaning, model evaluation, deploying, maintaining, and monitoring the model. There are additional engineering works as you can see in the above workflow.
Moving any part of 95% quickly can introduce technical debt. But of course, it will not hurt to get the 5% (model) done quickly…
Starting the model development quickly
Moving into the data-centric approach, we are going to start the modeling work with the data that we have at the moment, or a fraction of it in order to build the simple model (or baseline). In most cases working with real-world datasets, there is no guarantee that the first model will be good anyways. So, it’s good to start simple. A baseline is not limited to something you can build. It can also be something that already works well such as a pretrained network. The key idea is to start the development quickly to even see if the work is worth pursuing.
Model building is an iterative process, and this is a single step, you can come back to improve the model. After we have established a baseline, the next goal is to iterate on reducing the error.
Iterating on reducing the error
Once you have a reasonable model, it’s time to train it on the available dataset. The key idea here is not to achieve 99% accuracy, even if that could be better. We basically want to see the errors the learning algorithm is making. As we previously said, often, a learning algorithm will not generalize at the first training and this emphasizes again that machine learning is an iterative process.
Here are questions that we can use to guide us in the error reduction phase:
- Is the model doing poorly on all classes?
- Is the model doing poorly on one specific class?
- Is it because there are not enough data points for that particular class compared to other classes?
- There are trade-offs and limits on how much you can do to reduce the error. Is there room for improvement on what you are trying to improve?
These questions are important to be addressed, and perhaps you would have more domain-specific questions to add. While answering these questions, you can learn if it is possible to add increments to your performance metric.
To summarize this, do not be caught into thinking that 99% accuracy is the best indicator that your learning algorithm is doing well. A quick wrapping up example: If you train a cat and dog classifier on 1000 images of cats and 10 images of dogs, you can have 99% accuracy but the classifier will be worthless because it will always predict a cat even if the input image is a dog.
Iterating on the data improvement
Taking a step back, most of the things we have talked about are all pointing to data problems: class imbalances or skewed datasets, or not enough training examples. That might be also the bias of this article, but I have mentioned earlier that the goal of this article is to dig into ML workflow but focusing on data. “Good model is good data”, this notion has been stated a lot in the machine learning community.
There are two ways to improve the data. We can either add more data or improve the quality of the existing data. One of the techniques to create or synthesize data without having to hunt it is data augmentation. Although data augmentation can drastically produce more training examples, it’s important to be strategic. A quick example in a cat and dog classifier we used early: adding more cat images won’t do much since we already have enough cat images. Instead, we would want to create more dog images.
Here are more ideas when doing data augmentation:
- Create realistically looking images, and realistic enough that you too can recognize them without extra sense.
- Remember to keep a balance and aim to create the images that the model did poorly on.
Often, data augmentation will surely increment your accuracy. Unstructured datasets (images, sounds, etc), this is inarguable. For structured datasets(usually in table like format) however, you may want to do conventional feature engineering where you can try to reproduce new features from existing features. This will work well than adding new data points.
Once we have expanded training examples, we can also again train the learning algorithm, check the error again, and see if we might also work on improving the data.
To summarize, a good model comes from good data. When creating more training examples, quality is away good than quantity.
Conclusion
Building an effective machine learning system is an iterative process. Error correction is iterative, data improvement is and model training is iterative too (by default). If you can always aim to improve the ingredients (data), there will be less to worry about the recipes (model), and the results will tell.
References and Further Learning
- Machine Learning Engineering for Production (MLOps) Specialization — Deeplearning.AI.
- A Chat with Andrew on MLOps: From Model-centric to Data-centric AI.
- Hidden Technical Debt in Machine Learning Systems.
Extra Notes
- Machine Learning model, ML code, and learning algorithm are used interchangeably.