Topology in data analysis?

Often, when you hear about mathematics in ML, you hear only Bayesian methods, derivatives, interpolations, and sometimes tensors… But the mathematical apparatus in machine learning can go deep into the roots of even seemingly completely fundamental and abstract areas of this science.

Today we'll touch a little on TDA, topological data analysis. Let's try to write simply. So that even the most inexperienced student can understand it. The purpose of the article is to interest, because TDA is an avant-garde thing. But you need to start from the very base: “Why and for what, and what is… this topology of yours?”

Topology deals with the study of properties of spaces that are preserved under continuous deformations. The first here is the theory of “gravity”, the distortion of space-time… Unlike geometry, where dimensions and shapes are important, topology concentrates on properties that are resistant to changes without breaks or splices.

To understand what this means, you can imagine a rubber sheet that can be bent, stretched, compressed, but cannot be torn or glued together – those properties that are preserved under such changes are of interest to topologists. Classic example.

The important concept here is “continuity.” If two objects can be transformed into one another through continuous deformations, they are said to be topologically equivalent. It's simple. A plasticine ball is equivalent to a plasticine cube.

Topological space is a basic concept in topology that expands the intuitive understanding of space. And this is the complexity of the topology. This is a structure in which the concepts of “proximity” or “neighborhood” of points are defined, although this proximity may differ from the usual Euclidean geometry.

For example, on a sphere or torus, the concept of neighborhood of points will be different from what we see in flat space. Moreover, in topological space we can talk about such concepts as open and closed sets.

Open sets are those in which each point is surrounded by other points in that set, while closed sets include their boundaries. Abstract level, of course…

The difficulty here is that it is difficult for us to imagine the ball as space – we can only imagine the ball in three-dimensional space. And this is the key difference and why this level of mathematics is rather in the formulas. Which we will try to avoid.

Homotopy is a concept that describes the deformation of one object into another. Strictly speaking, two objects are homotopy if one can be continuously transformed into the other. Let's imagine that we have a loop on the plane – if we can contract it into one point without breaks or knots, then such a loop is homotopy to the point.

In more complex cases, such as on a torus (donut), the loop going around the hole can no longer be contracted to a point, and this is a fundamentally important difference between spaces with different topological structures. Homotopy helps us classify objects and spaces according to their “deformability.”

Homology is a more abstract way of studying topological spaces that relates algebra to topology. It allows you to study “holes” in space of different dimensions.

Homological groups are algebraic structures that describe the number and dimension of such holes.

For example, in a two-dimensional sphere there are no holes either in the center or on the surface, but in a torus there is one “hole” in the center and one formed by the hole.

Thus, homology allows us to reduce topological problems to algebraic ones, making it possible to calculate topological invariants – characteristics that do not depend on the shape of the object.

With the development of computer technology, topology has found its place in data analysis. The application of topology to data is based on the fact that topological methods can reveal hidden structures in high-dimensional or complex data sets, a.k.a. spaces that traditional analytical methods may simply not capture.

Topological data analysis

We will briefly go over the basic principles and methodologies of data analysis, without coding nuances and jokes for now…

Topological data analysis (TDA) is opposed to traditional statistical and geometric approaches. It all comes down to identifying components of connectivity, holes or emptiness, and their persistence as the scale or resolution of the data changes.

For example, persistent homology (i.e. the study of discontinuities in scaling spaces, our multidimensional data) is a fundamental TDA method and serves for the analysis of topological invariants (multiple implementations) at different scales.

How do topological structures such as connected components, cycles, and voids change as the scale is gradually increased? – that’s the main question.

Of course, we're talking about computational topology, so we need tools to speak the language of discrete geometry, points and line segments…

Therefore, verification is performed through data filtering, where a sequence of simplicial complexes is created – abstract/discrete objects representing connections between data points. This is conditional.

For example, several triangles can be connected to form a polygon or even more complex shapes. We abstract a given space or collection of data into understandable, discrete, and quantifiable objects that can be counted.

If the question arises: “So the data is already discrete and quantifiable, why translate anything?” – the answer is that topology is working with sets without discretization.

Therefore, to bring data and topology together, you need to turn to computational topology, which allows you to work with this discrete data while remaining within the topological logic.

At each moment of time, the simplicial complex has certain topological characteristics, but as the scale changes, some structures appear and others disappear.

These “events” are recorded in so-called persistent diagrams, where the time of appearance and disappearance of topological invariants is plotted on the axis. Roughly speaking, we simply track events, certain artifacts as we gradually increase the scale of our data.

The diagram visualizes the life cycle of each invariant, allowing you to distinguish significant (persistent) structures from noise (temporary) ones.

Thus, persistent homology helps to identify the most stable topological characteristics of the data, which are preserved across different scales and, therefore, are the most informative.

Persistence diagrams can be visualized using barcodes – these are collections of segments, each of which indicates the time interval of the existence of a particular topological structure.

Longer segments indicate more persistent and significant topological invariants, while shorter segments can be interpreted as noise fluctuations in the data.

Barcodes provide a visual representation of which topological structures dominate at different resolution levels and help researchers gain a deeper understanding of the internal organization of the data.

Visualizing persistence charts through barcodes is a powerful tool because it allows you to easily interpret complex multidimensional data structures, reducing them to simple and intuitive visual images.

But this is all cool, but how can we calculate transparently and clearly and, of course, indirectly

The closure theorem, which is also used in TDA, describes an important topological characteristic of data related to the density and compactification of spaces.

In the context of data analysis, the theorem helps determine which parts of the data form closed sets and which remain open. This is important for understanding how structured the data is and how susceptible its topological form is to change.

The closure theorem plays a key role in identifying cluster boundaries and connectivity components, allowing for more accurate interpretation of the topological properties of data sets.

The Mapper method is another significant TDA tool that serves to reduce the dimensionality of data while preserving the topological and geometric properties of the original space.

Mapper creates a simplified topological model of data by breaking it into clusters and displaying it as a graph, where nodes represent clusters and edges represent connections between them. This method is useful for visualizing and understanding the structure of complex data sets because it allows you to see the relationships between different clusters, their internal connections and holes.

Libraries for topological analysis

To calculate persistent homology or topological clustering, understandable algorithms are used. We have already indicated that to solve problems with topology we need simplicial complexes.

Simplicial complexes are built for different scales (filtering levels), which makes it possible to study data at different levels of detail. Next, homology groups are calculated for each scale, which helps to identify stable topological structures.

One of the popular algorithms is the construction algorithm Vietoris-Rips complexes, which starts with the creation of simplexes (groups of data points that can be thought of as generalizations of triangles to higher dimensions) and continues through filtering them.

As the scale increases, simplices are added to the complex and topological structures, connections or cycles, begin to appear.

The algorithm records the moments of appearance and disappearance of these structures, which are then interpreted through persistent diagrams.

This approach is widely used for high-dimensional data analysis because it is robust to noise and can be adapted to high-dimensional data.

Work with this algorithm, for example, GUDHI, Dionysus and Ripserprovide convenient tools for implementing these algorithms.

An example code using GUDHI might look like this:

import gudhi as gd

# Создание простого симплициального комплекса
rips_complex = gd.RipsComplex(points=data_points, max_edge_length=2.0)
simplex_tree = rips_complex.create_simplex_tree(max_dimension=2)

# Вычисление персистентной гомологии
simplex_tree.compute_persistence()

# Получение персистентных баркодов
diag = simplex_tree.persistence()
gd.plot_persistence_barcode(diag)

Here we first create a simplicial complex based on data_points, then calculate its persistent homology and visualize barcodes that show the life cycles of topological invariants at different filtering levels.

Dionysus is another library that provides flexible tools for computing persistent homology.

Unlike GUDHI, Dionysus focuses on algorithms related to data filtering and management of simplicial complexes. It supports working with big data and multidimensional spaces, providing opportunities to customize algorithms for specific tasks.

Ripser is a specialized library for fast and efficient implementation of persistent homology calculations. For mobile phones 🙂

To speed up calculations in the context of big data, optimization methods are often used, such as reducing the size of simplicial complexes and using special data structures: KD-trees and grid methods.

For example, to avoid computing simplexes for all possible combinations of points, you can use sampling methods or approximate algorithms to filter the data.

KD trees allow you to efficiently search for nearest neighbors, which is important for constructing rip complexes.

This is especially true for high-dimensional data, where even small optimizations can significantly reduce computation time.

In addition, parallel computing and distributed systems are used, which makes it possible to process very large data sets by breaking them into parts and performing calculations in parallel.

For example, the Ripser library supports time and memory optimization by not storing all simplexes in memory at the same time, which allows you to process data sets that would not fit in memory with a conventional approach.

Specific example of analysis

For example, we will use a dataset of stock prices over a certain period. We will apply the persistent homology method to identify stable topological structures in the data.

At the beginning of the program, the necessary libraries for working with financial data and their subsequent analysis are imported.

We load historical data about a company's stock prices from a CSV file and use the pandas library to process the data.

We specify that the date column will be used as an index, and select only the column containing the closing stock price information.

The data is then normalized to monthly values ​​using the resample function, which averages the values ​​for each month. Gaps in the data are removed using the dropna method to avoid errors in further analysis.

import pandas as pd
import numpy as np
import gudhi as gd
import matplotlib.pyplot as plt

# Загрузка данных о ценах акций
# Например, используем данные о ценах акций компании Apple за последние 5 лет
data = pd.read_csv('AAPL.csv', parse_dates=['Date'], index_col="Date")
data = data[['Close']]  # Используем только столбец с ценами закрытия
data = data.resample('M').mean()  # Приводим данные к месячным интервалам
data.dropna(inplace=True)  # Убираем пропуски

Now we will reduce the data to a format suitable for constructing simplicial complexes.

To do this, you can use the “shift” method to create a sequence of time windows. We will use fixed-length windows to form points in multidimensional space.

The window size is set, for example, 12 months, and for each window an array of stock closing price values ​​is generated. This allows time series to be transformed into sets of points in a multidimensional space.

This approach is useful because topological analysis works with multidimensional data, and the use of windows allows you to extract information about price movements over different periods.

# Определяем длину окна
window_size = 12  # 12 месяцев
data_windows = []

for i in range(len(data) - window_size):
    window = data['Close'].iloc[i:i + window_size].values
    data_windows.append(window)

data_windows = np.array(data_windows)  # Преобразуем в массив NumPy

In the next step, we use the GUDHI library to create a rip complex based on the obtained points. A Rips complex is created using the RipsComplex function, which uses the distances between points to construct simplicial complexes.

The max_edge_length parameter specifies the maximum distance between points at which they are considered connected, and the max_dimension parameter indicates that we will consider simplexes up to the second dimension, that is, segments, triangles and other simple multidimensional structures.

# Создаем рипсовый комплекс
rips_complex = gd.RipsComplex(points=data_windows, max_edge_length=5.0)  # max_edge_length можно подбирать
simplex_tree = rips_complex.create_simplex_tree(max_dimension=2)  # Рассматриваем до 2D симплексов

We will now calculate the persistent homology for the resulting simplicial complex and visualize the results.

# Вычисляем персистентную гомологию
persistence = simplex_tree.persistence()

# Визуализируем персистентные диаграммы
gd.plot_persistence_diagram(persistence)
plt.title("Persistent Diagram")
plt.show()

# Визуализируем баркоды
gd.plot_persistence_barcode(persistence)
plt.title("Persistence Barcode")
plt.show()

After running the above code, we get persistent charts and barcodes that visualize persistent topological structures in stock price data.

Persistence diagrams show which topological invariants were found at different filtering levels. Each segment in the barcode represents an interval of time during which a certain topological structure existed.

If, for example, we see long stretches in the barcode, this could indicate persistent trends or patterns in the stock price data.

If such patterns appear on different time scales, this may signal the presence of fundamental factors influencing price dynamics.

Through topological analysis, we can identify important structures such as stable price trends, cycles and anomalies.

For example, if we observe that certain trends persist over time (as seen by long bars on a persistence chart), this may indicate fundamental factors affecting the market, such as changes in supply or demand, economic events, or changes in the company .

Anomalies that may appear in short bursts may indicate market volatility or short-term price fluctuations caused by speculation.

Where else can topological analysis be used?

In bioinformatics, TDA helps analyze biological data, which is usually complex and high-dimensional in nature.

For example, protein structures, DNA, or transcriptomics data can be represented as multidimensional objects. Using persistent homology methods, it is possible to study how the topological characteristics of these objects change as filtering changes, which makes it possible to identify stable structures such as loops or voids.

In genomics, topological analysis helps to study the relationships between genes, which can have a complex network nature, for example, genes responsible for certain biological processes can form stable topological structures.

In image analysis, topological analysis helps recognize complex shapes and patterns in data.

Imagine an image as a set of points in a multidimensional space, each of which is characterized not only by its coordinates, but also by color, texture and other features. TDA helps to highlight the main topological properties: the contours of objects, their boundaries, as well as stable structures that are preserved when the scale or noise in the image changes.

Persistent homology allows you to estimate how long certain patterns exist in an image when filtering changes, which helps improve image recognition or object classification algorithms.

For example, in medical diagnostics, TDA helps to accurately identify boundaries between different tissue types or pathological formations.

Financial data also has a complex, multidimensional nature. Stock prices, exchange rates, trading volumes are all time series that can be analyzed using topological methods.

Using TDA, you can identify hidden trends, cycles or anomalies in price movements that may go undetected using traditional methods.

Topological analysis, for example, can identify consistent price patterns that may indicate long-term trends in the market, as well as anomalous structures that may indicate volatility or sudden price changes.

Persistent homology helps analyze how these patterns change over time scales, which can be useful when developing trading strategies or managing risk.

And where would we be without research?

Neural networks have a complex topological structure, especially in the context of deep learning, where the model may contain many hidden layers and parameters.

Using TDA methods, it is possible to study the topological properties of data that passes through the network, analyzing how neural activations form complex structures at various levels of abstraction.

For example, in image classification problems, persistent homology can be used to analyze the output layers of a neural network to understand how the network interprets input data, as well as to identify errors or gaps in the generalization of the model…

Additional literature for the most curious

  • “Topological Data Analysis for Machine Learning: A Survey”

  • “Topological Deep Learning: A Review of an Emerging Paradigm”

  • “Explaining the Power of Topological Data Analysis in Graph Machine Learning”
    The article explores the role of TDA in graph analysis, demonstrating how alpha complexes and filtering can improve understanding of the shape and connectivity of data in graph structures. This is useful for analyzing social networks and protein interaction networks.

  • “A Topological Machine Learning Pipeline for Classifying Disease Subtypes”
    The article shows how topological data analysis can be used in bioinformatics to classify disease subtypes by examining the persistence of topological features in biological data.

  • “Towards Efficient Machine Learning with Persistence Diagrams”
    The research focuses on optimizing the use of persistence diagrams, a key tool in TDA, to improve machine learning algorithms.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *