Positive-Unlabeled learning and where to find it

Who am I
My name is Dmitry Ivanov. I am a 2nd year student postgraduate studies in economics at the St. Petersburg HSE. I work in the research group “Agent Systems and Reinforcement Learning” in JetBrains Researchas well as in the international laboratory Game Theory and Decision Making at HSE… Besides PU learning, I am interested in reinforcement learning and research at the intersection of machine learning and economics.
Preamble

Fig. 1a. Positive data
Figure 1a shows a set of points generated by some distribution, which we will call positive. All other possible points that do not belong to the positive distribution will be called negative. Try to mentally draw a line separating the positive points presented from all possible negative points. By the way, this task is called “anomaly detection”.

Fig. 1b. Ellipse separates positive data from possible negative data
You may have imagined something like Figure 1b: circled the data in an ellipse. In fact, this is how many anomaly detection methods work.
Now let’s change the problem a little: let us have additional information that a straight line should separate positive points from negative ones. Try to draw it in your mind.



Fig. 1c. Straight lines separate positive data from possible negative ones (straight lines are predicted by the One-Class SVM anomaly detection algorithm)
In the case of a straight dividing line, there is no single answer. In which direction to draw a straight line is not clear.
Now I will add some unallocated points to Figure 1d that contain a mixture of positive and negative. For the last time, I will ask you to make a mental effort and imagine a line separating the positive and negative points. But this time you can use unlabeled data!

Fig. 1d. Positive (blue) and unlabeled (red) points. Unmarked points are composed of positive and negative

Fig. 1d. The straight line separates positive points from negative points based on unlabeled data
It became easier! Although we do not know in advance how each particular unmarked point is generated, we can roughly mark them by comparing them with positive ones. Red dots that look like blue are probably positive. On the contrary, dissimilar ones are probably negative. As a result, despite the fact that we do not have “clean” negative data, information about them can be obtained from the unlabeled mixture and used for a more accurate classification. This is what the Positive-Unlabeled learning algorithms, which I want to talk about, do.
Introduction
Positive-Unlabeled (PU) learning can be translated as “learning from positive and unlabeled data.” In fact, PU learning is an analogue of a binary classification for cases when there is labeled data of only one of the classes, but an unlabeled mixture of data from both classes is available. In general, we do not even know how much of the data in the mixture corresponds to a positive class and how much to a negative one. Based on such datasets, we want to build a binary classifier: the same as in the presence of pure data of both classes.
As a toy example, consider a classifier for pictures of cats and dogs. We have some pictures of cats and many more pictures of both cats and dogs. At the output, we want to get a classifier: a function that predicts for each picture the probability that a cat is depicted. At the same time, the classifier can mark our existing mixture of pictures for cats and dogs. At the same time, it can be difficult / expensive / time consuming / impossible to manually mark up pictures for training the classifier, you want to do without it.
PU learning is a pretty natural task. The data found in the real world is often dirty. The ability to learn from dirty data seems to be a useful life hack with relatively high quality. Despite this, I have met paradoxically few people who have heard about PU learning, and even fewer people who would have known about any specific methods. So I decided to talk about this area.

Fig. 2. Jürgen Schmidhuber and Yann LeCun, NeurIPS 2020
Application of PU learning
I will informally divide the cases in which PU learning can be useful into three categories.
First categoryperhaps the most obvious one: PU learning can be used instead of the usual binary classification. In some tasks, the data is inherently slightly dirty. In principle, this contamination can be ignored and a conventional classifier can be used. But you can change the loss function quite a bit when training the classifier. (Kiryo et al., 2017) or even the predictions themselves after training (Elkan & Noto, 2008), and the classification will become more accurate.
As an example, consider the identification of new genes responsible for the development of disease genes. The standard approach is to treat disease genes already found as positive and all other genes as negative. It is obvious that disease genes not yet found may be present in this negative data. Moreover, the task itself is to find such disease genes among the “negative” data. This internal contradiction is noticed here: (Yang et al., 2012)… The researchers deviated from the standard approach and considered genes not related to already discovered disease genes as an unlabeled mixture, and then applied PU learning methods.
Another example is reinforcement learning based on expert demonstrations. The challenge is to train the agent to act in the same way as an expert. This can be achieved using the generative adversarial imitation learning (GAIL) method. GAIL uses a GAN-like (generative adversarial networks) architecture: the agent generates trajectories so that the discriminator (classifier) cannot distinguish them from the expert’s demonstrations. Recently researchers from DeepMind noticed that in GAIL the discriminator solves the PU learning problem. (Xu & Denil, 2019)… Usually, the discriminator considers the expert data to be positive, and the generated data to be negative. This approximation is accurate enough at the beginning of training when the generator is not capable of producing data that looks like positive. However, as the training progresses, the generated data looks more and more like expert data until it becomes indistinguishable to the discriminator at the end of the training. For this reason, researchers (Xu & Denil, 2019) Consider the generated data as unlabeled data with a variable mixing ratio. Later, the GAN was modified in a similar way when generating images (Guo et al., 2020)…

Fig. 3. PU learning as an analogue of the standard PN classification
In the second category PU learning can be used for anomaly detection as an analogue of one-class classification (OCC). We have already seen in the Preamble how exactly untagged data can help. All OSS methods, without exception, are forced to make an assumption about the distribution of negative data. For example, it is wise to circle positive data into an ellipse (a hypersphere in the multidimensional case) outside of which all points are negative. In this case, we assume that the negative data is evenly distributed (Blanchard et al., 2010)… Instead of making such assumptions, PU learning methods can estimate the distribution of negative data based on the unlabeled data. This is especially important if the distributions of the two classes overlap strongly. (Scott & Blanchard, 2009)… One example of anomaly detection using PU learning is fake review detection. (Ren et al., 2014)…

Fig. 4. An example of a fake review
Detecting corruption in Russian public procurement auctions
In the third category PU learning can be applied to tasks in which neither binary nor single-class classification can be used. As an example, I will tell you about our project to detect corruption in the auctions of Russian government procurement. (Ivanov & Nesterov, 2019)…
According to the legislation, complete data on all public procurement auctions are posted in the public domain for everyone who wants to spend a month parsing them. We have collected data from more than a million auctions held from 2014 to 2018. Who put, when and how much, who won, during what period the auction took place, who held, what purchased – all this is in the data. But, of course, there is no label “corruption is here”, so you cannot build an ordinary classifier. Instead, we applied PU learning. Basic assumption: Participants with an unfair advantage will always win. Using this assumption, the losers in the auctions can be considered as fair (positive), and the winners as potentially dishonest (unmarked). When set up in this way, PU learning methods can find suspicious patterns in data based on subtle differences between winners and losers. Of course, in practice, difficulties arise: an accurate design of features for the classifier, an analysis of the interpretability of its predictions, and statistical verification of assumptions about data are required.
According to our very conservative estimate, about 9% of the auctions in the data are corrupt, as a result of which the state loses about 120 million rubles a year. The losses may not seem very large, but the auctions we study occupy only about 1% of the public procurement market.


Fig. 5. The share of corrupt public procurement auctions in different regions of Russia (Ivanov & Nesterov, 2019)
Final remarks
In order not to create the impression that PU is the solution to all the troubles of humanity, I will mention the pitfalls. In a conventional classification, the more data we have, the more accurately we can build a classifier. Moreover, increasing the amount of data to infinity, we can approach the ideal (according to Bayesian formula) classifier. So, the main catch of PU learning is that it is an ill-posed problem, that is, a problem that cannot be solved unambiguously even with an infinite amount of data. The situation becomes better if the proportion of the two classes in the unlabeled data is known, but determining this proportion is also an ill-posed task. (Jain et al., 2016)… The best we can determine is the spacing the proportion is in. Moreover, PU learning methods often do not offer ways to estimate this proportion and consider it to be known. There are separate methods that estimate it (the task is called Mixture Proportion Estimation), however, they are often slow and / or unstable, especially when the two classes are very unevenly represented in the mixture.
In this post, I talked about the intuitive definition and application of PU learning. In addition, I can talk about the formal definition of PU learning with formulas and their explanations, as well as about classical and modern PU learning methods. If this post arouses interest, I will write a sequel.