The mysterious world of information geometry. Introduction



Image created by the author using artificial intelligence

For many of us in high school, our relationship with geometry was like an unhappy love turning into hatred. It was especially difficult when it came to coordinates and volume. Even calculations using geometry caused rejection. And ahead was a boom in information technology and the hype around machine learning, AI and data science. All this prompted many to dive into the dark depths of mathematics, where, among other disciplines, geometry was again waiting for us. Information geometry is applicable in statistical multivariate learning (statistical manifold learning), which has recently proven itself in training on high-dimensional datasets without a teacher. Also, information geometry allows you to calculate the distance between two probability measures, which is used in the selection of patterns, the construction of alternative loss functions for training a neural network, a belief propagation network, and solving optimization problems. Details – to the start of our flagship course in Data Science.

Information geometry is a mathematical tool for exploring the world of data through geometry. Information geometry is also called Fisherian Geometry.

Information geometry is a geometric approach to decision making that can also include pattern matching, modeling, and more. But why the geometric approach? Geometry makes it possible to study invariance within a coordinate-free framework, provides a tool for intuitive thinking, and allows one to study equivariance. For example, the centroid of a triangle is equivariant under an affine transformation.

Let’s discuss some basics to understand the essence of information geometry.

To understand differential geometry, and therefore information geometry, we need to understand what a manifold is. In my previous article manifold alignment I already told you what a manifold is, but I repeat:

An n-dimensional manifold is the most general mathematical space with limits, continuity, and regularity that also allows the existence of a continuous inverse function with n-dimensional Euclidean spaces. Manifolds locally resemble Euclidean space, but may not. Essentially, a manifold is a generalized form of Euclidean space.

Topological space

Let the space X set by a set of points x ∈ X and many subsets Xwhich are called neighborhoods N(x) for every point. We have the following options:

  1. If U – neighborhood x, x ∈ U, V⊂X and U⊂Vthen V is also a neighborhood x.
  2. Intersection of two neighborhoods x is also a neighborhood x.
  3. Any neighborhood U for x includes neighborhood V for xand wherein U is a neighborhood of all points V.

Any space Xwhich satisfies the specified properties, we can call topological.

Homomorphism

In an article about manifold alignment I touched on the question of homomorphism, which, as an axiom, can be represented as follows:

Consider f XY as a function of two topological spaces, then X and Y are homomorphic provided that f continuous, one-to-one, and inverse function f is also continuous. Consider the variety . At all points x , where U is a neighborhood x, and for integer values nwhere U homomorphic to ℝⁿ, small n is the dimension of the manifold.

Diagram

The homomorphism denoted by the function κ Uκ (U), is called a chart, where U may be an open subset of . There are many ways to construct a graph to determine . The collection of such cards is called atlas. Graphically, this idea is shown in Figure 1. Mathematically, an atlas is defined by Equation 1. A specific example of a graph is a coordinate system, which can be a function representing points on a manifold.



Rice. 1. Diversity and Maps (Image credit: Image courtesy of the author)



Equation 1. Atlas

At this stage it is easy to define a differentiable manifold as a manifold for which the transition maps are differentiable.

That is all that can be said about manifolds in abstract mathematics. What about understanding statistics and data science? As we remember, you are statistics, we are dealing with probabilities. This leads to the concept of a statistical manifold. In a statistical manifold, each point p ∈ corresponds to the probability distribution in the region . This can be explained by a specific example of a manifold, which is formed by a family of normal distributions.

Vectors and tangents on a manifold, their definition in curved spaces

In ordinary geometry, a vector is a straight line connecting two points. But not for curved space. Here the vectors are tangents to the curve at a certain point on the manifold. If the parameter u varies along the entire length of the curve, the curve can be defined as x(u). Wherein u often omitted and we just write x. The vector in curved spaces is expressed as follows:



Equation 2. Vector in curved spaces

Vector defined locally at a point pwhere u = 0. Note that the vector itself does not live in a manifold. He came here with Euclidean geometry. As in the case of maps, at the point p many tangents can be constructed. Yes, on a two-dimensional plane it is already difficult to imagine, however, even a three-dimensional object like a sphere has many tangents at a point. Therefore, we can speak of a tangent plane at a point on the sphere (Fig. 3). Similarly, we can speak of a tangent plane at a point p diversity.



Equation 3. Sphere with a tangent plane (image created by the author)

The transition from one map to another is equivalent to the transition, for example, from the Cartesian to the polar coordinate system. Suppose we have a transform function ϕ for translate x from one card to another. Then it can be written as x′ = ϕ(x).

dual space

Dual space V space vectors V contains all linear functionals V. That is, for all cards T VFwhere F is the field of the space of vectors V. Then the dual space contains all linear transformations from V in F. In search of a better explanation of modal space, I came across this example on the net³

Imagine a 2D real vector space. Let’s define two functions: one returns the x-coordinate of any vector, and the other returns the y-coordinate. Let’s call the first function f1and the second f2. But let’s look at things more broadly.

Consider these two functions as vectors. Let these be basis vectors in some ridiculous vector space. These vectors can be added and counted f1 + f2 a function that returns the sum of the value of any vector over x and y. They can also be multiplied by numbers. For example, we can count 7*f1 a function that returns seven times the value of any vector along the x-axis. You can create a linear combination of similar actions, for example, the function 3.5×f1 – 5*f2 returns 3.5 vector x values ​​minus 5 y values. This is how the dual space works._

Tensor

Tensors are a key concept in the mathematics of manifolds. So what are these animals and what do they eat? Beasts are multi-linear, and they eat vectors from tangent spaces and their dual space, and spit out real numbers. The total number of eaten vectors of the tangent space and its dual space is called the rank (rank) of the tensor. The number of eaten vectors from the dual space is called the contravariant rank, and from the tangent space it is called the covariant rank.

Let us omit a more formal description of tensors. To be honest, I myself do not fully understand it yet. But in essence, manifolds are geometric constructions, and tensors are their corresponding algebraic constructions.

Metrics

A metric is a tensor field that induces an inner product on the tangent space at each point of the manifold. Any tensor field of the second covariant rank can be used to define a metric. Some sources call such a metric Riemannian Metric⁷.

Now, after a long and confusing list of terms, let’s look at something really useful in information geometry.

Information geometry arose at the intersection of statistics and differential geometry. With its help, we consider the probability distribution in terms of geometry.

To do this, we need an information metric (Information metric). It is also called the Fisher Information metric.

Fisher information metric

To find a suitable metric tensor at a point θwhere θ is a distribution from the family p(xθ)we need to determine the distance between p(xθ) and its infinitesimal perturbation p (xθ + dθ). Then the relative difference is expressed by equation 3. It is obvious that:

Equation 3. Relative difference of perturbation at a point θ.

The relative distance depends on a random variable x. If calculated correctly, then the mathematical expectation is ∆, i.e. (∆) = 0. What about the variance? As it turns out, the variance is non-zero. We can define dl²=[Δ²]. From the first principle, the length of the infinitesimal displacement between θ* and θ** for the metric is given as dl2 = dθdθ. Solution for dl²=[Δ²] = dθdθ gives the following expression:



Equation 4. Fisher information metric

This value is called Fisher information metric (FIM). It measures how much information an observation of a random variable carries. θ on average if xp(x∣θ). There is another way to turn equation 4 up to a factor of 12. Read more about this at link 1. But if you care, the rapid development of quantum computer science has made it possible to find applications for Fisher’s quantum metric. I will have a new series of articles on this very soon. If you want to know about their release, subscribe! Another use is for non-a priori information in Bayesian inference. But maybe I got carried away and climb too deep into the jungle.

In conclusion of the article, I will say that the Fisher information matrix , which is a matrix variant of , in the case of several parameters, can be used in optimization similarly to gradient descent with an update rule in the form:



Equation 5. Update Rule for Differential Manifold Optimization

where η is the parameter being studied, and ∇J is the divergence of the scalar field J.

In this article, I tried to give a brief overview of information geometry and the terms used in it. For the sake of clarity, I have omitted much of the working material in this review. Therefore, the reader is also invited to refer to additional materials, links to which I have given below.

  1. http://www.robots.ox.ac.uk~lsgsposts2019-09-27-info-geom.html
  2. https://math.stackexchange.comquestions240491what-is-a-covector-and-what-is-it-used-for
  3. https://qr.aepv34JS
  4. https://franknielsen.github.ioSPIG-LesHouches2020Geomstats-SPIGL2020.pdf
  5. https://www.cmu.edubiolphysdesernopdfdiff_geom.pdf
  6. https://math.ucr.eduhomebaezinformationinformation_geometry_1.html
  7. https://mathworld.wolfram.comRiemannianMetric.html

Please note that the links at the end are archived. http://web.archive.org.

We will teach you how to work with data carefully so that you can upgrade your career and become a sought-after IT specialist.

Brief catalog of courses

Data Science and Machine Learning

Python, web development

Mobile development

Java and C#

From basics to depth

As well as

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *