- Multidimensional data – what is it?
- Why visualize them and what can we understand from visualization?
- What ways can you reduce the dimension in such a way that the main data structure is preserved and what properties should be taken into account when designing?
At the Innopolis ML Meetup, we discussed the answers to questions together with Elizaveta Batanina ML Engineer at Provectus. Before starting work, she studied Data Science, completed courses in Machine Learning, Deep Learning and Computer Vision.
So let’s start simple: What is Data?
In machine learning, data is represented as arrays of numbers, for example: pixels in images. It is convenient to represent this data not as arrays of numbers, but as vectors in a multidimensional space. Then, a dataset is a cloud of points in a high-dimensional space. But, unfortunately, in 99.99% of machine learning problems, our dimension is much larger than 3. Usually, it is several hundred. Of course, it will be very difficult for a person to imagine a space of 512 in size. Therefore, of course, I would like to reduce this dimension. But, before we do that, let’s figure out why you need to visualize data at all?
Visualization is part of the exploration of training data. This way we can figure out which classes we have best separated. For example, if some samples are poorly marked, incorrectly marked, or it is better to remove them from the training.
It is useful to visualize not only training data, but also production data. Suppose, if our production data shifts a little, data drift occurs, then the boundaries of our model, which separate the classes, will no longer be correct and we will receive completely different predictions. This is relevant for seasonal data.
The next aspect for which you need to visualize data is to find outliers. For example, if something went wrong in the backend, and you don’t get customer reviews into the review classification model, but words from sales receipts for a completely different model. It seems that these are also words, but the domain is already different, and some statistical metrics may not reveal the difference.
With the motivation sorted out, now the question arises: how to reduce the dimension from several hundred to 2D or 3D, so that the main structure of our data is preserved?
Of course, the answer to this question already exists, there are many algorithms, including linear ones. However, for data of very large dimensions, linear algorithms do not work, since it is very difficult to interpret different dimensions. Therefore, nonlinear dimensionality reduction algorithms or algorithms for Manifold Learning are used. These include MDC, t-SNE, LargeVis, UMAP, TRIMAP and others.
In this variety of algorithms, their hyper parameters, it is clear that the projection of data on a lower dimension will be completely different and it will tell us a completely different story. A good example: projecting our land onto a plane. The map that we all know from our geography lessons in school is very distorted at the edges, and in reality, Russia is not as big as our teachers told us.
How can you avoid this? The answer is simple – you need to introduce some kind of metric that will show how correct our projection of the data in the lesser plane is. And based on this metric, we will be able to pick up the hyper-parameters that we need.
Projection quality metrics
Before choosing a metric, you need to understand what properties and projection factors we need.
We would like our visualization or projection to display the global structure as clearly as possible. To assess the global structure, a metric such as the Global Score was invented.
The authors argue that it is possible to take the methods of the main components of the PCA as an ideal method to preserve the global structure, since the PCA does not use anything other than the global structure.
If we take PCA as a benchmark, then we can compare all other algorithms with it. How do we do it? Let’s just take and calculate the error of restoring from a lower dimension to a higher dimension for our data. And let’s calculate the same error for the data that we have already downgraded with the principal component method. Based on this, we will calculate the Global Score, where the maximum will be for the data of reduced dimension by the method of principal components, and all other methods will have a little less.
In this case, the interconnection of points within small neighborhoods is taken into account. What to use for this? The most straightforward method is to take and run KNN on our data with a reduced dimension and choose the method that has the highest accuracy test. But, the problem with this method is that it is supervised, we must have access to the labels of our data. And also, KNN will give more privileges to the visualization that separates the clusters, the more the better.
You can try another straightforward method – calculate the Sammon’s error. The idea is very simple, for each point we calculate the distance to any other point in the original and lowered space and calculate the difference between these distances. In the lowered space, the distance to all points was the same as in the original.
The problem with this method is that it does not take into account the curse of dimension: in high dimension, the distance between each point is much greater than in low dimension.
There is an additional method according to which we must take into account not the distance, as a number, but the rank, that is, the order of points in terms of distance to a certain selected sample. We want that when decreasing the dimension, around a specific point, its neighborhood was exactly the same as in the original space. At the same time, we punish points that come too close to the local neighborhood, which they should not enter.
For this, we introduce such a metric as reliability (Trustworthiness). It guarantees us, when choosing any sample, that all the nearest points will be the closest ones and in the original space.
Another metric is discontinuity, which ensures that points from the local neighborhood do not go too far, to some completely different edge of the rendering.
This metric is interesting in that it works on the global structure and the local structure. The metric is used to compare manifolds with each other, and also, to compare the same manifolds with different dimensions (in the original and reduced)
Stability is a non-obvious factor that also needs to be taken into account. When we run the algorithms through our data, approximately the same picture should be displayed. To assess stability, a metric such as Procrustes distance was invented.
For example, we have initial data in three-dimensional space, we project it, and then we take and project not all of our data, but only a sample, we get a projection.
Then we try to map Y ‘into our original render. This problem is called the Orthogonal Procrustes Problem. We use Procrustes Transformations: displacement, rotation, uniform scaling. Here it is necessary to understand how different these projections are and how similar they are.
Next, you need to measure and calculate the optimal transformation, measure the distance between each corresponding point, and the average distance from the entire dataset will be our Procrustes distance.
The example below shows very clearly the effect of stability on the display of data. We visualize one dataset using UMAP and t-SNE. If we consider UMAP, the visualization turned out to be quite stable, you can notice approximately the same structure of the manifold, approximately the same clusters.
Thus, we identified some metrics that can be used in practice to assess visualization: Global Score, KNN accuracy, Trustworthiness & Discontinuity, IMD.
The material was worked out in conjunction with the IT library Innevia