News analysis using time series segmentation and clustering

IN Otuse I took the course ML Advanced and discovered interesting topics related to time series analysis, namely their segmentation and clustering. I decided to borrow the knowledge I gained for my university thesis on event analysis of social phenomena and events and describe part of this research in this article.

Step 1: Data Collection

I took an information and news resource as a source of data Lenta.ru, since it is easy to parse data from it, the news is varied and is updated in large quantities daily. For the test, I parsed news for the last year (March 2023 – March 2024) using Python BeautifulSoup And requests.

The code contains the procedure for collecting the title, date and subject of news:

dates = []
topics = []
titles = []
for day in range(1,366):
  num_pages = 1
  while True: #получить все новости за дату разом нельзя, как и узнать количество страниц, поэтому пришлось прибегнуть к while:
    date = ((datetime.datetime.today() - datetime.timedelta(days=day)).strftime('%Y/%m/%d')
    link = requests_session.get('https://lenta.ru/{}/page/{}/'.format(date, num_pages))  
    soup = BeautifulSoup(link.text, 'lxml')
    news_list = soup.find_all('a', {"class": "card-full-news _archive"}, href=True)
    if len(news_list) == 0: #если на странице нет новостей, переходим к следующей дате
      break
            
    for news in news_list: #собираем данные по каждой новости
      dates.append(date)
      title = news.find_all('h3', {"class": "card-full-news__title"})
      titles.append((titles[0].get_text() if len(title) > 0 else 'None'))
      topic = news.find_all('span', {"class": "card-full-news__info-item card-full-news__rubric"})
      topics.append((topic[0].get_text() if len(topic) > 0 else 'None'))
    num_pages+=1
                
df = pd.DataFrame({'dates': dates, 'topics': topics, 'titles': titles })

In total, after removing unnecessary categories, we got a dataset of ~93000 news:

Fragment of a dataset with news

Fragment of a dataset with news

Step 2. Formation of a dataset

I’ll digress a little into the event analysis itself. As a method of political science, it originated in the 1960s in the scientific writings of Charles McClelland. Event analysis is a qualitative research method that is used to describe and explain social behavior and interactions.

First step I have already done something from the classical methodology – collected data. The second step a system of so-called classifiers is being defined: in other words, categories into which this data must be classified for further analysis. At first, I took ready-made news categories from the Feed as classifiers, but looking at their list, you can see that they are not particularly representative of the social sphere and it is unlikely that interesting results can be obtained from them: Economy, Science and technology, Travel, Law enforcement agencies, National projects, Habitat, Self-care, Sports, Internet and media, Russia, Business, Former USSR, Values, World, Culture, Home, Weapons.

Therefore, I decided to create my own classifiers. First, I received lists of preliminary, narrower categories for basic rubrics, having trained Word2Vec on news headlines. This helped to determine the topics that are covered on a given news resource, rather than choosing at random. Fragment of the results of determining the top words with the closest vectors:

Original category

List of the most contextually similar words

Economy

crisis, industry, inflation, oil, industry, energy, default, market, energy crisis, budget, unemployment, gdp, public debt, poverty, crop failure, gas, investor, recession, export, birth rate

Policy

strategy, disagreement, democracy, ally, crisis, sanctions, Russophobia, sovereignty, impeachment, constitution, alliance, collapse, reform, conspiracy, default, revolution, war, state, crime

mass media

TV channel, journalist, publication, oligarch, hacker, politician, special service, diplomat, Roskomnadzor, propaganda, telegram, facebook, opposition, censorship, website, meta, Ministry of Justice, foreign agent, blocking, social network

Technologies

innovation, intelligence, science, algorithm, import substitution, corporation, neural network, medicine, industry, investment, tourism, space, research, education, monitoring, ecosystem, Skolkovo, modernization, virtual, microelectronics

Culture

national, folk, student, musical, youth, patriotic, architecture, art, festival, literature, heritage, society, theater, museum, film festival, language, revival, concert, photo exhibition, digitalization

Next, I formed a list of topics from the received words so that they had a pronounced emotional coloring – this way the analysis will be more interesting and informative. Then I ran it several times and semi-automatically modified the lists of new classifier categories on the finished model Zero-shot classification mDeBERTa-v3-base-mnli-xnliuntil I got the average model speed for each classifier > 0.8.

From this we can conclude that I more or less covered a basic set of categories that can be linked to the social sphere and their formulations were clear to the model; there are no obvious “rogue” news for which there was no suitable class. Of course, some classifiers turned out to be broader and more comprehensive, while others were more narrowly focused, but there is still room for growth)

On the final list of classifiers, I already received the following frequency distribution:

Frequency distribution of classifier categories

Frequency distribution of classifier categories

As you can see, they still like to cover problems in the news more) But here one should make allowances for the fact that the thematic classification model does not always perfectly determine the context, plus some messages have a neutral emotional connotation, but to get the general picture this is a good and quick option.

Step 3: Time Series Segmentation

Let's move on, in fact, to time series, namely, to their segmentation. I built time series separately for each classifier, where along the axis x – date, along the axis y – the number of news according to this classifier for this date. An algorithm was used for segmentation PELT. The algorithm searches for a set of points “bending” for a given time series in such a way that their number and location minimize the given “price” segmentation.

The main steps of the algorithm are to define the function “cost” for a segment, then iterate through all possible starting and ending points of the segment and testing whether dividing into new segments reduces the value of the cost function compared to a segment without dividing.

The general form of the loss function is:

General form of the loss function in PELT

General form of the loss function in PELT

Here C – segment “cost” function, t – “inflection” point, m – total number of “inflection” points, βf(m) – regularizer to prevent overfitting.

It was found that minimum points and periods of reduced news activity in the time series coincide with holidays and weekends. Therefore, data for these days was removed from the sample so that the time series were smoother and no false correlations were found.

To solve the problem I used the library ruptures.

  • For parameter “model” value was selected “l1”since the results turned out visually better, other options are “l2” And “rbf”.

  • Parameter min_size – the minimum segment length, in our case – the minimum number of days that can be combined into one segment. Accordingly, the larger it is, the fewer segments you will get. In this example, we chose to combine into segments of at least a week in size.

  • Parameter pen is a regularization parameter that is selected experimentally to prevent overtraining of the algorithm. One popular approach is to take the regularization as two logarithms of the length of the original series. The smaller the value of the regularization parameter, that is, the smaller the “penalty,” the more segments are allocated.

Part of the code for preparing the dataset is omitted, since you just need to get the following columns:

  • date (dates)

  • category (category)

  • number of news for this date in this category (news_count)

import ruptures as rpt
import matplotlib.pyplot as plt
%matplotlib inline

for category in category_list:
  points = np.array(df_time_series[df_time_series[‘category’]== category]['news_count'])
  algo = rpt.Pelt(model="l1", min_size=7).fit(points)
  result = algo.predict(pen= np.log(np.log(len(points))))
  result.append(len(points))
  result.pop(0)
  fig, ax_arr = rpt.display(points, result, result, figsize=(35, 3))
  plt.title(f"{ topics_name }", size=20, fontweight="bold")
  plt.xticks(result, [df_time_series[df_time_series[‘category’]== category]['dates'].tolist()[i-1] for i in result], rotation=90, fontsize=20)
  plt.show()

Here you can see which segments were created in which categories, a detailed analysis of the resulting data with conclusions – this is more for scientific work). In the future, it will be interesting to highlight keywords for segments, inflection points and extreme values.

Time series for all categories with segmentation

This task also has its pitfalls, which can introduce noise and lead to false correlations. All events can be divided into two global groups:

  1. First group – these are events related to each other from one chain that has a life cycle, such as, for example, events from a situation with a special military operation.

  2. Second group – these are, so to speak, “one-time” events that are not particularly significant for analysis, which are weakly dependent on others and can introduce the nature of randomness into the frequency distribution by category. Ideally, learn to identify and delete messages about such events, as well as those that cannot be given a pronounced emotional connotation.

In general, the method showed interesting results, although you should still experiment with the parameters. I looked at some of the events that led to the outlier segments. For example, the sharp jump in the business and trade development category after November 2023 was mainly due to the start of sales of new Chinese cars, which led to reports of the development of the automotive industry in Russia.

To see if there are correlations of this event with messages from other categories, and in general, which categories correlate with each other, you need to consider another, no less interesting method – time series clustering.

Step 4. Clustering time series

To carry out clustering, it is best to take time series not by day, but by month, since the correlations will be better visible. First, we transform the original dataset into a form suitable for analysis, and also normalize the time series:

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from tslearn.clustering import TimeSeriesKMeans, silhouette_score

df_clust = pd.DataFrame()
for category in category_list:
  df_tmp = df_time_series[df_time_series['category']==category][['news_count', 'year_month']].reset_index(drop=True)
  df_reshaped = pd.DataFrame(df_tmp['news_count'].values.reshape(1,-1), columns=df_tmp['year_month'])
  df_reshaped = df_ reshaped.rename_axis(None, axis=1)
  df_reshaped.insert(0, 'category', t)
  df_clust = pd.concat([df_clust, df_reshaped], axis=0 , ignore_index=True)
        
scaler = StandardScaler()
df_clust_scaled = scaler.fit_transform(df_clust.iloc[:, 1:].T).T
Dataset converted for time series clustering

Dataset converted for time series clustering

For clustering I took the classic one k-means. For this task, this method is better suited, since it is the closeness in the frequency distribution by month that is important to us. More advanced DTW looks for rows with similar patterns. By k-means The Euclidean distance between embeddings of unbiased time series is calculated, centroids are searched for them, and finally clusters are determined as a result of moving the centroids over the number of iterations.

Since the optimal number of clusters is not known in advance, this value must be determined using the elbow method and the silhouette metric.

  • Elbow method shows the optimal number of clusters according to the following principle: if after the visual “elbow” on the graph there is a sharp decrease in the total error, then this number is considered optimal, but if there are many clusters, then the error will be minimized, but there will be no point in clustering in principle. The sum of squared distances from objects to the center of the cluster (in other words, errors) is calculated.

  • Using the silhouette method the optimal number of clusters is the peak value on the graph, after which there is a sharp decline. The metric calculates for each object the average distance between it and objects within the cluster (a) and between it and objects in the nearest cluster (b). The more normalized baall the better.

def optimal_clusters(df_clust_scaled , metric):
  distortions = []
  silhouette = []
  n_clusters = range(2, 10)
  for n in n_clusters:
    kmeanModel = TimeSeriesKMeans(n_clusters=n, metric=metric, n_jobs=6, max_iter=10, random_state=0)
    kmeanModel.fit(df_clust_scaled )
    distortions.append(kmeanModel.inertia_)
  silhouette.append(silhouette_score(df_clust_scaled, kmeanModel.labels_, metric=metric, random_state=0))
  fig, ax1 = plt.subplots()
  ax2 = ax1.twinx()
  ax1.plot(K, distortions, 'b-')
  ax2.plot(K, silhouette, 'r-')
  ax1.set_xlabel('K clusters')
  ax1.set_ylabel('Elbow Method', color="b")
  ax2.set_ylabel('Silhouette', color="r")
  plt.show()
    
optimal_clusters(df_clust_scaled, 'euclidean')
Selecting the optimal number of clusters using the “elbow” method and the silhouette metric

Selecting the optimal number of clusters using the “elbow” method and the silhouette metric

For both metrics, the best fit is a cluster number of six. We train a clustering model on our time series to divide them into 6 clusters, and also build average time series for each of the clusters.

n_clusters = 6
ts_kmeans = TimeSeriesKMeans(n_clusters=n_clusters, metric="euclidean", max_iter=5)
ts_kmeans.fit(df_clust_scaled)

plt.figure(figsize=(20,10))
for n in range(n_clusters):
  plt.plot(ts_kmeans.cluster_centers[cluster_number, :, 0].T, label=n)
  
plt.xticks(range(len(df_clust.columns[1:].tolist())), labels=df_clust.columns[1:].tolist(), rotation=90, fontsize=10)
plt.legend()
plt.show()
Averaged time series for clusters

Averaged time series for clusters

Next, we visualize the resulting clusters and see which categories are in the same group.

df_clust['cluster_kmeans'] = ts_kmeans.predict(df_clust_scaled)
def plot_clusters(current_cluster):
  fig, ax = plt.subplots(int(np.ceil(current_cluster.shape[0]/4)),3,figsize=(15, 3*int(np.ceil(current_cluster.shape[0]/4))), sharex=True)
  fig.autofmt_xdate(rotation=90)
  ax = ax.reshape(-1)
  for index, (_, row) in enumerate(current_cluster.iterrows()):
    ax[index].plot(row.iloc[1:-1])
    ax[index].set_title(f"{row.category}")
    plt.xticks(rotation=90)
  if current_cluster.shape[0]%3 == 2: #костыль, чтобы не выводились пустые сабплоты
    fig.delaxes(ax[-1])
  if current_cluster.shape[0]%3 == 1:
    fig.delaxes(ax[-1])
    fig.delaxes(ax[-2])
  plt.tight_layout()
  plt.show()
    
for n in range(n_clusters):
  print(f"Кластер №{n+1}")
  plot_clusters(df_clust[df_clust['cluster_kmeans']==n])

Cluster No. 1

Cluster No. 2

Cluster No. 3

Cluster No. 4

Cluster No. 5

Cluster No. 6

Most of the clusters look quite logical, and the relationships are visible precisely in time series with a period of months, not days. Let's remember our example with Chinese cars belonging to the category “business and trade development”and we will see that a sharp swing after November 2023 also occurred in the categories “economic development” And “innovation and import substitution”. Whether these events are related is another matter, but there are definitely correlations in the time series of these topics.

Clusters are also interesting No. 2, 3, 4: series with are logically similar “unemployment and poverty” And “social problems”With “sanctions and Russophobia”, “crimes” And “political problems”With “disasters and cataclysms” And “war and weapons”With “propaganda and agitation” And “international conflicts and disagreements”.

There are, of course, time series that deviate from the general logic, but this, most likely, can be attributed to ordinary coincidences. In any case, to obtain more interesting results, it is necessary to use additional methods, including analysis of the linguistic component of the news.

Conclusion

So I looked at the problems segmentation And time series clustering in relation to news analysis. The results turned out to be quite interesting and logical, but to obtain more informative conclusions in the task of analyzing social phenomena and processes, it is necessary to collect more data, define more categories, add additional checks and stages, in particular related to NLP.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *