# Cluster analysis of the text corpus

Sometimes it becomes necessary to analyze a large amount of text data without knowing the content of the texts. In this case, you can try to break the texts into clusters and generate a description of each cluster. Thus, as a first approximation, conclusions can be drawn about the content of the texts.

## Test data

As test data, a fragment of a news dataset from RIA was taken, from which only news headlines were processed.

## Getting embeddings

The LaBSE model from @cointegrated was used to vectorize the text. Model available on huggingface.

## Vectorization code

```
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("cointegrated/LaBSE-en-ru")
model = AutoModel.from_pretrained("cointegrated/LaBSE-en-ru")
sentenses = ['мама мыла раму']
embeddings_list = []
for s in sentences:
encoded_input = tokenizer(s, padding=True, truncation=True, max_length=64, return_tensors="pt")
with torch.no_grad():
model_output = model(**encoded_input)
embedding = model_output.pooler_output
embeddings_list.append((embedding)[0].numpy())
embeddings = np.asarray(embeddings_list)
```

## Clustering

The k-means algorithm was chosen as an algorithm for clustering. It was chosen for clarity; you often have to play around with data and algorithms to get adequate clusters.

To find the optimal number of clusters, we will use a function that implements the “elbow rule”:

## Search function for the optimal number of clusters:

```
from sklearn.cluster import KMeans
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics.pairwise import cosine_similarity
def determine_k(embeddings):
k_min = 10
clusters = [x for x in range(2, k_min * 11)]
metrics = []
for i in clusters:
metrics.append((KMeans(n_clusters=i).fit(embeddings)).inertia_)
k = elbow(k_min, clusters, metrics)
return k
def elbow(k_min, clusters, metrics):
score = []
for i in range(k_min, clusters[-3]):
y1 = np.array(metrics)[:i + 1]
y2 = np.array(metrics)[i:]
df1 = pd.DataFrame({'x': clusters[:i + 1], 'y': y1})
df2 = pd.DataFrame({'x': clusters[i:], 'y': y2})
reg1 = LinearRegression().fit(np.asarray(df1.x).reshape(-1, 1), df1.y)
reg2 = LinearRegression().fit(np.asarray(df2.x).reshape(-1, 1), df2.y)
y1_pred = reg1.predict(np.asarray(df1.x).reshape(-1, 1))
y2_pred = reg2.predict(np.asarray(df2.x).reshape(-1, 1))
score.append(mean_squared_error(y1, y1_pred) + mean_squared_error(y2, y2_pred))
return np.argmin(score) + k_min
k = determine_k(embeddings)
```

## Extraction of information about the resulting clusters

After clustering the texts, we take for each cluster several texts located as close as possible to the center of the cluster.

## Search function for texts close to the center of the cluster:

```
from sklearn.metrics.pairwise import euclidean_distances
kmeans = KMeans(n_clusters = k_opt, random_state = 42).fit(embeddings)
kmeans_labels = kmeans.labels_
data = pd.DataFrame()
data['text'] = sentences
data['label'] = kmeans_labels
data['embedding'] = list(embeddings)
kmeans_centers = kmeans.cluster_centers_
top_texts_list = []
for i in range (0, k_opt):
cluster = data[data['label'] == i]
embeddings = list(cluster['embedding'])
texts = list(cluster['text'])
distances = [euclidean_distances(kmeans_centers[0].reshape(1, -1), e.reshape(1, -1))[0][0] for e in embeddings]
scores = list(zip(texts, distances))
top_3 = sorted(scores, key=lambda x: x[1])[:3]
top_texts = list(zip(*top_3))[0]
top_texts_list.append(top_texts)
```

## Sammarisation of central texts

The resulting central texts can be tried to be molded into a general description of the cluster using a model for sammarizing the text. I used the ruT5 model by @cointegrated for this. Model available on huggingface…

## Sammarization code:

```
from transformers import T5ForConditionalGeneration, T5Tokenizer
MODEL_NAME = 'cointegrated/rut5-base-absum'
model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME)
tokenizer = T5Tokenizer.from_pretrained(MODEL_NAME)
def summarize(
text, n_words=None, compression=None,
max_length=1000, num_beams=3, do_sample=False, repetition_penalty=10.0,
**kwargs
):
"""
Summarize the text
The following parameters are mutually exclusive:
- n_words (int) is an approximate number of words to generate.
- compression (float) is an approximate length ratio of summary and original text.
"""
if n_words:
text="[{}] ".format(n_words) + text
elif compression:
text="[{0:.1g}] ".format(compression) + text
# x = tokenizer(text, return_tensors="pt", padding=True).to(model.device)
x = tokenizer(text, return_tensors="pt", padding=True)
with torch.inference_mode():
out = model.generate(
**x,
max_length=max_length, num_beams=num_beams,
do_sample=do_sample, repetition_penalty=repetition_penalty,
**kwargs
)
return tokenizer.decode(out[0], skip_special_tokens=True)
summ_list = []
for top in top_texts_list:
summ_list.append(summarize(' '.join(list(top))))
```

## Conclusion

The presented approach does not work on all domains – news is a pleasant exception here, and texts of this type are separated quite well using conventional methods. But conditional twitter will have to tinker with – an abundance of grammatical errors, jargon, and lack of punctuation can become a nightmare for any analyst. Comments, corrections and additions are welcome!

## Links

With a laptop, you can play in Colaba, the link in the repository is at github…