Data markup often turns out to be the biggest obstacle to machine learning – collecting large amounts of data, processing it and marking it up to create a sufficiently performant model can take weeks or even months. Active Learning allows you to train machine learning models on much less labeled data. The best AI companies like
… We believe that you need it too.
Traditional vs. Active Learning: Building a Spam Filter
Imagine you need to create a spam filter for your emails. The traditional approach (
) consists in collecting a large number of emails, marking them as “spam” and “not spam”, followed by training a machine learning classifier to distinguish these two classes. The traditional approach assumes that all data is equal, but most datasets have class imbalance, noisy data, and severe redundancy.
In the traditional approach, time is wasted on data markup that does not improve the performance of your model. And you won’t even know if your model is working until you complete the data markup.
People don’t need thousands of randomly labeled examples to understand the difference between spam and regular mail. If you teach a person to solve this problem, then you expect that he can be given several examples of what you need, he will learn quickly, and then he will ask questions if he is unsure.
Active learning uses the same principle – it uses a trainable model to search and markup only the most valuable data.
In active learning, you first pass a small number of tagged examples. The model is trained on this “generative” dataset. The model then “asks questions” by picking unlabeled examples of data that it has no confidence in so that humans can “answer” them by labeling those examples. The model is updated again and the process is repeated until good enough accuracy is achieved. Due to the fact that a person trains the model iteratively, it is possible to improve it in less time and with much less labeled data.
How does the model find the next sample data that needs the markup? Here are the most common ways:
- choosing the example in which the predictive distribution has the highest entropy
- selection of an example in which the selected forecast of the model has the least certainty
- training multiple models and selecting those examples where they disagree.
We use our own methods using Bayesian deep learning to get better estimates of uncertainty.
Three benefits of using active learning
1. You spend less time and money on data markup
Active learning has been proven to provide great savings in data tagging across a wide range of tasks and datasets, from computer vision to NLP. Since data markup is one of the most expensive aspects of training modern machine learning models, this in itself is an important factor!
Using machine learning leads to higher model accuracy with less labeled data
2. You get much faster feedback on model performance.
Typically, data markup is done
how the training of any models begins or feedback arises. It often takes days or weeks of reworking the annotation and markup guidelines, after which it turns out that the performance of the model is underperforming, or that differently annotated data is required. Since active learning trains the model many times
data markup process, you can get feedback and fix problems that might surface much later.
3. Ultimately, the accuracy of the model turns out to be much higher.
People are often surprised that models trained through active learning not only learn faster, they converge to a better finished model (with less data). We are often told that the more data the better, so it’s easy to forget that data quality is just as important as quantity. If the dataset contains conflicting examples that are difficult to label accurately, this can degrade the performance of the finished model.
The order in which the model sees the examples is also important. There is a whole subsection of machine learning called curriculum learning that looks at improving the performance of models by teaching them simple concepts first and then more complex ones. It’s like learning arithmetic first and then algebra. Active Learning naturally creates a curriculum for models and helps them achieve increased accuracy.
If active learning is so good, why isn’t everyone using it?
Most of the tools and processes for creating machine learning models have been developed without active learning in mind. In companies, different departments are involved in data markup and model training, and active training requires the combination of these processes. But if you can manage to get these departments to work together, you still need a large infrastructure to provide a link between model training and annotation interfaces. Most of the software libraries used assume that all the data is already marked up before training the model, so to apply active learning, you need to rewrite a bunch of boilerplate. You also need to figure out how best to host the model, ensure its interaction with the annotator team, and ensure that it is updated when receiving data from different annotators asynchronously.
In addition, modern deep learning models are very slow to update, so frequent re-training is a painful process. Nobody wants to mark up a hundred examples and then wait 24 hours to fully re-train the model before marking up the next hundred. In addition, deep learning models typically have millions or billions of parameters, and deriving quality estimates from such models is still an actively explored scientific challenge.
If you read scientific articles on active learning, then you can decide that it allows you to save a little on markup, but requires a lot of work. However, these articles are confusing as they work with scientific datasets that are usually balanced and cleaned up. They almost always mark up one example at a time, and article authors forget that not every example of data is equally easy to mark up. For more realistic problems with large class imbalances, noisy data, and varying markup costs, the benefits can be far greater than the literature suggests. In some cases, the cost savings for markup can be up to tenfold.
How to use active learning today
One of the most important barriers to active learning is the issue of having the right infrastructure. Just as tools like Keras and PyTorch have significantly reduced the pain of getting gradients, new tools are emerging to make active learning much easier.
There are open source Python libraries such as
taking over most of the boiler plate. ModAL is built on top of scikitlearn and allows you to combine different models with whatever machine learning strategy you want. It takes care of most of the work on the implementation of various metrics, is open source and has a modular structure. The advantages of ModAL are the wide range of methods it provides and the openness of the code. The disadvantage of libraries like modAL is that they do not contain annotation interfaces and leave it up to developers to host the model and how it relies on annotation interfaces.
This brings us to the question of annotation interfaces:
Prodigy is probably the most popular tool for solo data Scientists. This is an annotation interface created by the authors of Spacy; naturally it can be combined with their awesome NLP library for simple active learning. Its source code is closed, but you can download it as pip wheel and install it locally. While Prodigy is well suited for singles, it is not designed to work with annotator teams and only implements the most basic forms of active learning. To make it work with teams, you still have to host Prodigy yourself, and building such a system can be a lot of work.
Labelbox provides interfaces for many image annotations and recently added text support. Unlike Prodigy, Labelbox was designed with interaction with annotator teams in mind and has more tools for checking the correctness of labels. It does not natively support active learning or model learning, but it does allow loading predictions from the model into the annotation interface via the API. This means that if you have implemented a selection for active learning and are training a model, you can create an active learning loop. However, the bulk of the work will still have to be done by you.
- Reduces the amount of data required for markup, significantly reducing one component of the cost.
- Provides faster feedback on model performance.
- Creates models with increased accuracy.