Cluster analysis of the text corpus

Sometimes it becomes necessary to analyze a large amount of text data without knowing the content of the texts. In this case, you can try to break the texts into clusters and generate a description of each cluster. Thus, as a first approximation, conclusions can be drawn about the content of the texts.

Test data

As test data, a fragment of a news dataset from RIA was taken, from which only news headlines were processed.

Getting embeddings

The LaBSE model from @cointegrated was used to vectorize the text. Model available on huggingface.

Vectorization code
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("cointegrated/LaBSE-en-ru")
model = AutoModel.from_pretrained("cointegrated/LaBSE-en-ru")

sentenses = ['мама мыла раму']

embeddings_list = []

for s in sentences:
    encoded_input = tokenizer(s, padding=True, truncation=True, max_length=64, return_tensors="pt")
    with torch.no_grad():
        model_output = model(**encoded_input)
    embedding = model_output.pooler_output

embeddings = np.asarray(embeddings_list)


The k-means algorithm was chosen as an algorithm for clustering. It was chosen for clarity; you often have to play around with data and algorithms to get adequate clusters.

To find the optimal number of clusters, we will use a function that implements the “elbow rule”:

Search function for the optimal number of clusters:
from sklearn.cluster import KMeans
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics.pairwise import cosine_similarity

def determine_k(embeddings):
    k_min = 10
    clusters = [x for x in range(2, k_min * 11)]
    metrics = []
    for i in clusters:
    k = elbow(k_min, clusters, metrics)
    return k

def elbow(k_min, clusters, metrics):
    score = []

    for i in range(k_min, clusters[-3]):
        y1 = np.array(metrics)[:i + 1]
        y2 = np.array(metrics)[i:]
        df1 = pd.DataFrame({'x': clusters[:i + 1], 'y': y1})
        df2 = pd.DataFrame({'x': clusters[i:], 'y': y2})
        reg1 = LinearRegression().fit(np.asarray(df1.x).reshape(-1, 1), df1.y)
        reg2 = LinearRegression().fit(np.asarray(df2.x).reshape(-1, 1), df2.y)

        y1_pred = reg1.predict(np.asarray(df1.x).reshape(-1, 1))
        y2_pred = reg2.predict(np.asarray(df2.x).reshape(-1, 1))    
        score.append(mean_squared_error(y1, y1_pred) + mean_squared_error(y2, y2_pred))

    return np.argmin(score) + k_min

k = determine_k(embeddings)

Extraction of information about the resulting clusters

After clustering the texts, we take for each cluster several texts located as close as possible to the center of the cluster.

Search function for texts close to the center of the cluster:
from sklearn.metrics.pairwise import euclidean_distances

kmeans = KMeans(n_clusters = k_opt, random_state = 42).fit(embeddings)
kmeans_labels = kmeans.labels_

data = pd.DataFrame()
data['text'] = sentences
data['label'] = kmeans_labels
data['embedding'] = list(embeddings)

kmeans_centers = kmeans.cluster_centers_
top_texts_list = []
for i in range (0, k_opt):
    cluster = data[data['label'] == i]
    embeddings = list(cluster['embedding'])
    texts = list(cluster['text'])
    distances = [euclidean_distances(kmeans_centers[0].reshape(1, -1), e.reshape(1, -1))[0][0] for e in embeddings]
    scores = list(zip(texts, distances))
    top_3 = sorted(scores, key=lambda x: x[1])[:3]
    top_texts = list(zip(*top_3))[0]

Sammarisation of central texts

The resulting central texts can be tried to be molded into a general description of the cluster using a model for sammarizing the text. I used the ruT5 model by @cointegrated for this. Model available on huggingface

Sammarization code:
from transformers import T5ForConditionalGeneration, T5Tokenizer
MODEL_NAME = 'cointegrated/rut5-base-absum'
model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME)
tokenizer = T5Tokenizer.from_pretrained(MODEL_NAME)

def summarize(
    text, n_words=None, compression=None,
    max_length=1000, num_beams=3, do_sample=False, repetition_penalty=10.0, 
    Summarize the text
    The following parameters are mutually exclusive:
    - n_words (int) is an approximate number of words to generate.
    - compression (float) is an approximate length ratio of summary and original text.
    if n_words:
        text="[{}] ".format(n_words) + text
    elif compression:
        text="[{0:.1g}] ".format(compression) + text
    # x = tokenizer(text, return_tensors="pt", padding=True).to(model.device)
    x = tokenizer(text, return_tensors="pt", padding=True)
    with torch.inference_mode():
        out = model.generate(
            max_length=max_length, num_beams=num_beams, 
            do_sample=do_sample, repetition_penalty=repetition_penalty, 
    return tokenizer.decode(out[0], skip_special_tokens=True)

summ_list = []
for top in top_texts_list:
    summ_list.append(summarize(' '.join(list(top))))    


The presented approach does not work on all domains – news is a pleasant exception here, and texts of this type are separated quite well using conventional methods. But conditional twitter will have to tinker with – an abundance of grammatical errors, jargon, and lack of punctuation can become a nightmare for any analyst. Comments, corrections and additions are welcome!


With a laptop, you can play in Colaba, the link in the repository is at github

Similar Posts

Leave a Reply