Why and how to combine Gaussian Mixture and DBSCAN?

Clustering is one of the most common applications of unsupervised machine learning methods. The clustering task arises when you need to divide data (or objects, each described by a certain set of parameters/coordinates) into groups of objects that are similar/close to each other.

There are many types of clustering, and it is worth highlighting two main types: metric methods based on the concept of distance between objects, and model methods based on constructing a model of the distribution of object parameters in the parameter space.

Another popular method of dividing clusterings is into flat (which receive one division into clusters) and hierarchical methods (which receive a whole family of clusterings). Hierarchical methods mainly include agglomerative and divisional methods.

Each clustering method is based on some model of mutual arrangement of objects (although this model is not always specified). Therefore, applying different clustering methods to the same data, you will get different division of the same objects into clusters. An interesting (and often cited) picture can be seen here: https://scikit-learn.org/stable/modules/clustering.html.

The simplest and most understandable algorithms are flat algorithms: Gaussian Mixture and DBSCAN. Let's look at them in a little more detail to understand why they are so often used, what their pros and cons are, and why I wanted to combine them.

Most clustering methods have at least one hyperparameter that determines how the algorithm works. By changing the hyperparameter value, you can control the number and shape of the resulting clusters. One of the most common hyperparameters is the number of clusters.

There are various methods that can be used to find the value of the clustering hyperparameter. You can simply choose its value based on your understanding of what objects and how many types there are in your dataset. This is called an external criterion or using an expert. Or you can choose the hyperparameter value in such a way as to optimize (minimize or maximize) the value of a certain function – the clustering quality metric. There are also many different quality metrics, and each of them is based on a certain model of the mutual arrangement of objects. Optimization of the clustering quality metric is the so-called internal criterion for assessing quality.

Distance Based Methods: DBSCAN+Silhouette

The basis of the DBSCAN method is the distance between objects (points in N-dimensional space). The Sklearn library has two implementations of this algorithm – through the analysis of the coordinates of objects (the DBSCAN function) and through the analysis of the matrix of distances between them (the dbscan function).

One of the common metrics is Silhouette – in fact, it is the ratio of the distance between the outer boundaries of clusters to the distance between the centers of clusters. The smaller the cluster sizes, the closer the silhouette is to one. It is usually good when all clusters are spherical (i.e. the linear size of the cluster does not depend on the direction in which it is calculated) or when these sizes are significantly smaller than the distances between clusters (clusters can be enclosed in non-intersecting spheres).

Let's see how DBSCAN+Silhouette works

from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import silhouette_score
from sklearn.cluster import DBSCAN 
S,N=1000,5
X,Y=make_blobs(n_samples=S, n_features=2, centers=N, cluster_std=0.3, random_state=42)
fig,axs=plt.subplots(1,3,figsize=(7,2.5),constrained_layout=True)
axs[0].set_title('source clusters')
axs[0].scatter(X[:,0],X[:,1],c=Y,s=1)
arr=[]
for eps in np.linspace(0.1,10,100):
 clf=DBSCAN(eps=eps)
 Yp=clf.fit_predict(X)
 ssc=silhouette_score(X,Yp)
 arr.append([eps,ssc])
arr=np.array(arr)
axs[1].set_title('Силуэт')
axs[1].set_xlabel('eps')
axs[1].set_ylabel('Silhouette')
axs[1].plot(arr[:,0],arr[:,1])
eps=arr[np.argmax(arr[:,1]),0]
clf=DBSCAN(eps=eps)
Yp=clf.fit_predict(X)
axs[2].set_title('Silh-optimal clusters')
axs[2].scatter(X[:,0],X[:,1],c=Yp,s=1);

As a result we get the following picture:

On the left are the original dataset points, the colors indicate different clusters, in the center is the dependence of the silhouette coefficient on \eps – the DBSCAN clustering hyperparameter, on the right is the clustering result with the optimal \eps value corresponding to the maximum of the silhouette coefficient. It is clear that clustering separates isolated clusters quite well.

Let's complicate the task: increase the size of the clusters (replace the parameter cluster_std=0.3 with cluster_std=0.7 in the 7th line of the code) and run the code again:

Fig.2. DBSCAN+Silhouette case with touching clusters

It can be seen that two closely located (touching) clusters have stopped separating.

This problem can be demonstrated even more effectively by adding noise:

X,Y=make_blobs(n_samples=S, n_features=2, centers=N, cluster_std=0.3, random_state=42)
Xn=(np.random.rand(S,2)-0.5)*(X.max()+abs(X.min()))
Yn=np.array([Y.max()+1]*S)
X=np.concatenate([X,Xn])
Y=np.concatenate([Y,Yn])
fig,axs=plt.subplots(1,3,figsize=(7,2.5),constrained_layout=True)
axs[0].set_title('source clusters')
axs[0].scatter(X[:,0],X[:,1],c=Y,s=1)
arr=[]
for eps in np.linspace(0.1,10,100):
 clf=DBSCAN(eps=eps)
 Yp=clf.fit_predict(X)
 if np.unique(Yp).shape[0]>1:
  ssc=silhouette_score(X,Yp)
  arr.append([eps,ssc])
arr=np.array(arr)
axs[1].set_title('Силуэт')
axs[1].set_xlabel('eps')
axs[1].set_ylabel('Silhouette')
axs[1].plot(arr[:,0],arr[:,1])
eps=arr[np.argmax(arr[:,1]),0]
clf=DBSCAN(eps=eps)
Yp=clf.fit_predict(X)
axs[2].set_title('Silh-optimal clusters')
axs[2].scatter(X[:,0],X[:,1],c=Yp,s=1);

In this case, the clusters stopped separating altogether:

Fig.3. DBSCAN+Silhouette case with noise

Methods based on the distribution model: GM+BIC

The basis of the GM (Gaussian Mixture) method is the model of distribution of points in one cluster – multidimensional normal (i.e. Gaussian) distribution. The model assumes that the probability of a point appearing in space is described by the sum of normal distributions, each of which is taken with its own weight and with its own parameters (distribution width, axis slope). Each such Gaussian distribution in a multidimensional coordinate system can intuitively be represented as an ellipsoid.

One of the common metrics of the quality of a Gaussian mixture is the BIC (Bayesian information criterion) – in fact, this is the inverse of the likelihood – the total probability of a correct description of all experimental points by the model (the higher the probability, the lower the BIC). The minimum of this function will correspond to the best model. Obviously, if the number of model parameters is comparable to the number of objects, then such a model can describe the objects absolutely accurately. Therefore, the BIC criterion has a small additive supplement (penalty) – the more parameters in the model, the greater the penalty for using such a model. This does not allow the use of models with too many parameters. This supplement can be interpreted as Occam's razor: it does not allow the number of model parameters to be increased unnecessarily and the use of models with too many parameters. This supplement can be selected in various ways, and there is even a scientific article [?]where it is proved that under certain constraints all these additions lead to the same number of optimal clusters.

Let's see how GM+BIC works:

from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
import numpy as np
from sklearn.mixture import GaussianMixture 
S,N=1000,5
X,Y=make_blobs(n_samples=S, n_features=2, centers=N, cluster_std=0.3, random_state=42)
fig,axs=plt.subplots(1,3,figsize=(7,2.5),constrained_layout=True)
axs[0].set_title('source clusters')
axs[0].scatter(X[:,0],X[:,1],c=Y,s=1)
arr=[]
for n_cl in range(1,20):
 clf=GaussianMixture(n_components=n_cl)
 clf.fit(X)
 bic=clf.bic(X)
 arr.append([n_cl,bic])
arr=np.array(arr)
axs[1].set_title('BIC')
axs[1].set_xlabel('n_components')
axs[1].set_ylabel('BIC')
axs[1].plot(arr[:,0],arr[:,1])
n_cl=arr[np.argmin(arr[:,1]),0]
clf=GaussianMixture(n_components=int(n_cl))
Yp=clf.fit_predict(X)
axs[2].set_title('BIC-optimal clusters')
axs[2].scatter(X[:,0],X[:,1],c=Yp,s=1)

The GM hyperparameter is the number of clusters, and in the case of simple isolated clusters we get a good result:

Fig.4. GM+BIC case with isolated clusters

In case of contiguous clusters or noisy data it will also give good results:

Fig.5. GM+BIC case with adjacent and noisy clusters

If GM is so stable, why not use it all the time?

Because it only works with elliptical clusters. If the clusters are not elliptical, problems begin. Let's generate a more complex dataset:

def MakeCircles():
 NN=1000
 r=np.random.rand(NN)*10+50
 r2=np.random.rand(NN)*10
 r3=np.random.rand(NN)*10+80
 phi=np.random.rand(NN)*6.28
 phi2=np.random.rand(NN)*6.28
 phi3=np.random.rand(NN)*6.28
 X1=np.zeros((NN,2))
 Y1=np.zeros(NN)
 X1[:,0]=r*np.cos(phi)
 X1[:,1]=r*np.sin(phi)
 Y1[:]=1

 X2=np.zeros((NN,2))
 Y2=np.zeros(NN)
 X2[:,0]=r2*np.cos(phi2)
 X2[:,1]=r2*np.sin(phi2)
 Y2[:]=0

 X3=np.zeros((NN,2))
 Y3=np.zeros(NN)
 X3[:,0]=r3*np.cos(phi3)
 X3[:,1]=r3*np.sin(phi3)
 Y3[:]=0

 X=np.concatenate([X1,X2,X3],axis=0)
 Y=np.concatenate([Y1,Y2,Y3],axis=0)
 idx=np.array(range(X.shape[0]))
 np.random.shuffle(idx)
 X=X[idx]
 Y=Y[idx]

 Nn=300
 Xn1=np.random.rand(Nn,1)*(X[:,0].max()-X[:,0].min())+X[:,0].min()
 Xn2=np.random.rand(Nn,1)*(X[:,1].max()-X[:,1].min())+X[:,1].min()
 Xn=np.concatenate([Xn1,Xn2],axis=1)
 Yn=np.zeros(Nn)
 Yn[:]=Y.max()+1
 X=np.concatenate([X,Xn],axis=0)
 Y=np.concatenate([Y,Yn],axis=0)
 return X,Y

And let's try to cluster it:

from sklearn.datasets import make_moons
S,N=1000,5
X,Y=MakeCircles()
fig,axs=plt.subplots(1,3,figsize=(7,2.5),constrained_layout=True)
axs[0].set_title('source clusters')
axs[0].scatter(X[:,0],X[:,1],c=Y,s=1)
arr=[]
for n_cl in range(1,30):
 clf=GaussianMixture(n_components=n_cl)
 clf.fit(X)
 bic=clf.bic(X)
 arr.append([n_cl,bic])
arr=np.array(arr)
axs[1].set_title('BIC')
axs[1].set_xlabel('n_components')
axs[1].set_ylabel('BIC')
axs[1].plot(arr[:,0],arr[:,1])
n_cl=arr[np.argmin(arr[:,1]),0]
clf=GaussianMixture(n_components=int(n_cl))
Yp=clf.fit_predict(X)
axs[2].set_title('BIC-optimal clusters')
axs[2].scatter(X[:,0],X[:,1],c=Yp,s=1)

We get a very strange clustering:

Fig.6 GM+BIC case - nested rings with noise — Fig.6 GM+BIC case – nested rings with noise

Because the shape of the clusters (rings) differs from the model (ellipsoids), we ended up with more than 20 clusters.

The result of DBSCAN+Silhouette is no better:

from sklearn.datasets import make_moons
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import silhouette_score
from sklearn.cluster import DBSCAN 
S,N=1000,5
X,Y=MakeCircles()
fig,axs=plt.subplots(1,3,figsize=(7,2.5),constrained_layout=True)
axs[0].set_title('source clusters')
axs[0].scatter(X[:,0],X[:,1],c=Y,s=1)
arr=[]
for eps in np.linspace(0.1,10,100):
 clf=DBSCAN(eps=eps)
 Yp=clf.fit_predict(X)
 if np.unique(Yp).shape[0]>1:
  ssc=silhouette_score(X,Yp)
  arr.append([eps,ssc])
arr=np.array(arr)
axs[1].set_title('Силуэт')
axs[1].set_xlabel('eps')
axs[1].set_ylabel('Silhouette')
axs[1].plot(arr[:,0],arr[:,1])
eps=arr[np.argmax(arr[:,1]),0]
clf=DBSCAN(eps=eps)
Yp=clf.fit_predict(X)
axs[2].set_title('Silh-optimal clusters')
axs[2].scatter(X[:,0],X[:,1],c=Yp,s=1);

But here the problem is caused by noise and poor performance of the silhouette on nested clusters (the inter-cluster distance is significantly smaller than their own size):

Fig.7 DBSCAN+Silhouette case - nested rings with noise — Fig.7 DBSCAN+Silhouette case – nested rings with noise

The disappointing conclusion is: If your clusters are complex and noisy, neither DBSCAN+Silhouette nor GM+BIC can help you. But what if your data is exactly like that? It turns out that in this case you can try to combine GM and DBSCAN and get a good result.

Superclustering by GMsDB method

We see that the quality metric and the clustering type are related to each other, the clustering type and the effectiveness of the quality metric for choosing the hyperparameter value depend on the expected shape of the clusters.

An interesting problem is to build a clustering algorithm that will be, on the one hand, resistant to noise (like GM), and on the other hand, be able to separate clusters of complex shape (like DBSCAN).

The idea of such a method is quite simple and consists of successive use of these two methods: first we divide our data into clusters using GM, and then, considering each GM cluster as a point in some new space, we combine them into superclusters using DBSCAN.

The difficulties in this case will be the following:

how to choose a hyperparameter in GM;
how to select a new space in which each GM cluster will be a point and how to enter the distance between such points;
How to choose hyperparameter in DBSCAN.

The first problem is solved very simply – use the BIC criterion. From Fig. 6 it is obvious that in the case of complex clusters we will get more clusters than we need, and they can later be combined into superclusters.

The second problem is somewhat more complicated, but can be solved using the well-known Mahalanobis distance. The Mahalanobis distance (https://ru.wikipedia.org/wiki/Mahalanobis_distance) can be considered as the distance from a point to the center of some ellipsoid divided by the width of the ellipsoid in the direction of this point. For those familiar with the theory of testing statistical hypotheses, this distance has a very interesting interpretation. In the one-dimensional case, 95% of the points of the ellipsoid are located at a Mahalanobis distance less than 1.96. This means that the Mahalanobis distance can be used as a statistical measure of whether a point belongs to a given cluster – if the distance from it to the cluster is below a given threshold, then the point belongs to it, if above, then it does not. Based on this principle, we can introduce the distance between clusters – construct a distribution of Mahalanobis distances between all pairs of points and take the lower limit of the 95% confidence interval – this will be the distance between clusters. This distance will be non-Euclidean, since it is associated not with the Cartesian distance between points, but with the probabilities of these points belonging to Gaussian clusters. By symmetrizing this distance (taking the largest of the two distances between two ellipsoids), we can construct a distance matrix from these distances and try to combine the original Gaussian clusters of points into superclusters using the DBSCAN method. Thus, we have moved to a new space, using not the coordinates (we do not know them), but the distances between the points. Each point in this space is a whole GM cluster, and the distance between them is the probability that two clusters intersect (in the sense of the Mahalanobis distance).

The third task is even more difficult. As we have already shown, each quality metric depends on the shape of the clusters. Since the distance is non-Euclidean, we cannot use Silhouette, and since the original clusters are Gaussian and built on the basis of the BIC criterion, it would not be very reasonable to reuse the BIC criterion. Therefore, a new criterion based on probabilistic methods is needed. Moreover, the distance in our space has a completely transparent meaning of probability.

Here we can return again to the magic number 1.96, which defines the boundary distance at which two clusters can be considered to be touching. This magic number depends on two quantities:

– what percentage of data should be within the confidence interval (usually it is determined by the statistical significance level \alpha, traditionally taken in statistics to be equal to 0.05, and those same 95% of points = 1 – \alpha);

– what is the dimension of the original space D (how many coordinates are used to describe the object).

If we take these two values into account, then the magic threshold R for Mahalanobis distances for given \alpha,D can be calculated using a relatively simple formula through the quantiles of the chi-square distribution:

$R=\sqrt{2\cdot Q_{1-\alpha}}$ $Q_{p} : \int_{-\infty}^{Q_p}\chi^2(x,df=D)dx=p$

Where Q is the corresponding (1-\alpha) quantile of the chi-square distribution with the number of degrees of freedom D.

To combine clusters into superclusters, we will use DBSCAN, and as a distance matrix – the Mahalanobis intercluster distance matrix calculated in the previous step. By iterating over the DBSCAN hyperparameter (distance \eps) from the smallest (giving the maximum number of superclusters) to the largest (giving only one supercluster), we will obtain different numbers of superclusters. What value of the DBSCAN hyperparameter should we stop at? It is best to stop when we have found the maximum number of non-intersecting superclusters.

If we assume that we have somehow combined clusters into superclusters, then in order for them not to intersect, they must all be pairwise at distances greater than R, calculated by us for a given significance level \alpha and data dimension D.

This criterion was called the matrix criterion (MC):

MC=number of disjoint superclusters / number of superclusters

The code and the result of the algorithm are shown below (requires installation of the gmsdb package via pip).

import matplotlib.pyplot as plt
import numpy as np
from gmsdb import GMSDB 
X,Y=MakeCircles()
fig,axs=plt.subplots(1,2,figsize=(5,2.5),constrained_layout=True)
axs[0].set_title('source clusters')
axs[0].scatter(X[:,0],X[:,1],c=Y,s=1)
arr=[]
clf=GMSDB(n_components=40)
Yp=clf.fit_predict(X)
axs[1].set_title('GMSDB clusters')
axs[1].scatter(X[:,0],X[:,1],c=Yp,s=1);

The disadvantages of the method include its low speed and the probabilistic nature of the resulting partitions, but it solves the problems I need (I also deal with clustering experimental data).

A more detailed description of the method and examples of its use can be found here: https://arxiv.org/abs/2309.02623 and the code itself can be found on GitHub: https://github.com/berng/GMSDB