Basics of probability theory for future programmers and data analysts

Probability theory is an important tool that helps create random numbers for simulations and cryptography, improve algorithms and data structures, and develop accurate models for machine learning.

Together with Ilya Kotov, a Data Science specialist at MTS Big Data, we examine the basic concepts of probability theory and tell you where and how it is used.

Content

  1. What is probability theory

  2. Probability

  3. Random event

  4. Probability distributions

  5. Random value

  6. What programming languages ​​are used to work with probability theory?

  7. How are random numbers generated?

  8. Where is probability theory used?

  9. How to study probability theory

What is probability theory

Probability theory is a branch of mathematics that studies random events and looks for patterns in them.

It uses mathematical models to describe random processes, such as lottery draws or weather changes, allowing experts to analyze and predict outcomes under uncertainty.

Let's look at the basic concepts of probability theory.

Probability

It is a numerical measure that shows the degree of probability that a particular event will occur. It is expressed as a number from 0 to 1, where 0 – the event is impossible, 1 — the event will definitely happen.

For example, when tossing a coin, the probability of landing on each side is 0.5, or 50%.

Random event

This is a possible outcome of a random experiment that may or may not happen. Such an experiment is an event or action whose outcome cannot be accurately predicted due to the influence of random factors.

For example, meteorologists determine the probability of precipitation by analyzing statistical data.

Weather Widget from Microsoft

Microsoft Weather Widget

Probability distributions

Shows the probability with which different outcomes of a random event can occur.

There are different mathematical models, or types of distributions, which describe this probability. The choice of a particular type depends on what data is being analyzed or what situation is being modeled. They are divided into two groups: discrete and continuous.

Discrete distributions

Used to describe events that have a certain number of possible outcomes. That is, when the results of an event can be listed point by point.

Example: A roll of a die can produce one of six possible results, from 1 to 6. Each result can be clearly defined and counted.

Examples of discrete distributions:

Uniform distribution – one of the simplest distributions in which all outcomes have the same probability.

For example, when tossing a coin, the probability of getting heads or tails is 0.5.

Uniform distribution histogram.  Source

Binomial distribution – used to model the number of successful outcomes in a series of independent trials with two possible outcomes: success or failure.

Example: The probability of getting heads when tossing a coin is 3 out of 5 times.

Histogram of the binomial distribution. Source

Histogram of the binomial distribution. Source

Poisson distribution – used to determine the probability of how many times a certain event will occur in a fixed period of time or in a given space.

This distribution is used when events occur independently of each other and with a constant average frequency.

Example: how many calls a call center receives per hour.

Histogram of the Poisson distribution. Source

Histogram of the Poisson distribution. Source

Geometric distribution – describes how many attempts are required to achieve the first success. The probability of success must be the same.

For example, How many times do you have to toss a coin before it lands on heads?

Histogram of geometric distribution.  Source

Histogram of geometric distribution. Source

Hypergeometric distribution — the probability of finding the exact number of specific objects (K) in a random sample of fixed size (n) from a finite population (N).

Example: What is the probability of drawing exactly two aces if we select five cards from a deck.

Histogram of hypergeometric distribution.  Source

Histogram of hypergeometric distribution. Source

Continuous distributions

Used to describe events that can have an infinite number of outcomes. In a continuous distribution, it is impossible to accurately list all possible outcomes because they are not limited to specific values, but can be any number within a given interval.

These distributions are used to analyze data and model situations where outcomes are measured by continuous quantities such as weight, height, time, temperature, etc.

Examples of continuous distributions:

Normal (Gaussian) distribution – The data is distributed in a bell shape. Most values ​​are clustered around the average.

For example, if we take the height of people, then the majority will be about average, and there will be few very tall or very short people.

Normal distribution histogram.  Source

Histogram of normal distribution. Source

Exponential distribution – a form of Gaussian distribution in which the data describe the time between events that occur at a constant average interval.

For example, the average travel time for a bus from one point to another.

Exponential distribution histogram. Source

Histogram of exponential distribution. Source

Chi-square distribution – used to analyze differences between expected and observed data.

Example: A company surveys customers to find out which of two product packaging designs they prefer and assumes that 70% of customers prefer the new design. If the differences between expected and actual values ​​are minimal, the company will release the product only in new packaging.

The probability density function is the probability of getting each of the numbers on the horizontal axis in the sample. Source

The probability density function is the probability of getting each of the numbers on the horizontal axis in the sample. Source

Fisher distribution (F-distribution) — helps to compare groups of data and check whether their differences are significant.

Example: A university uses two different teaching methods and wants to find out whether there are statistically significant differences between the results of two groups of students, each of whom studied using a different method.

F-distribution density function.  Source

F-distribution density function. Source

To contents

Random value

A variable that takes on one value depending on the outcome of a random experiment. Can be of two types:

Characteristics of a random variable

Expected value

The average value of a random variable that it takes over the long term when the experiment is repeated many times.

  1. For a discrete random variable

    Formula:

    M(X)=∑i=1nxi​*P(X=xi​)

    Where:
    M(X) — mathematical expectation of a random variable X,
    ∑i=1nxi — the sum of all possible values ​​of a random variable,
    P(X=xi​) is the probability that a random variable X will take the value xi​,
    n — the number of different values ​​of the random variable.

    Example: Let's say we have a six-sided die. The mathematical expectation of a roll on the die is equal to the average of all possible rolls.

    It looks like this:

    M X=161+162+ 163+ 164+ 165+ 166= (1 + 2 + 3 + 4 + 5 + 6)6 = 3.5

    Since each roll is independent and the probabilities of each value being rolled are the same, then in the long run over a large number of rolls we expect the average value to tend towards 3.5.

    In this example, the probabilities of getting each value are equal, since distribution is uniform. Anyway, P(X=xi) must be calculated based on the type of probability distribution.

  2. For a continuous random variable

    The formula for calculating the mathematical expectation is slightly different. Instead of a sum, an integral is used, since continuous random variables take values ​​on a continuous interval.

    Formula:

    EX=-∞x​f(x)dx

    Where:
    x is the value of a random variable,
    f(x) – probability density function (depends on the type of probability distribution),
    dx — designation of an infinitesimal increment (change) of x.

    Example: we study the waiting time for a bus at a bus stop. Let's assume that the waiting time until the next bus arrives is described by exponential distribution with intensity parameter λ=0.2. In other words, on average a bus arrives every 5 minutes.

    The probability density function for the exponential distribution is:

    fx;λ=λe-λx=0.2e-0.2x

    Now let’s calculate the mathematical expectation E(X), where X is the waiting time until the next bus arrives:

    EX= 0x0.2e-0.2xdx= 10.2 =5

    The mathematical expectation of time until the next bus arrives is 5 minutes.

Dispersion

A measure of the dispersion of the values ​​of a random variable from its mean value.

D(X) = M (X − M(X)) ²

Example: we have a random variable – the number of visitors to the store per day. On average, 100 people come to the store per day, and the number of visitors can vary from 80 to 120 per day.

If we imagine that the probability of attendance for 80, 100 and 120 people is the same, then we can calculate the variance as follows:

DX= ((80-100)2P80+100-1002P100+120-1002*P(120))/n=800P/3

What programming languages ​​are used to work with probability theory

Python

Python — is one of the most popular languages ​​for working with probability theory. It is easy to use and contains many libraries and frameworks that make it easy to work with probabilistic and statistical models.

Main libraries:

  • NumPy — basic mathematical and statistical functions;

  • SciPy – advanced mathematical and statistical functions;

  • Pandas — working with tabular data and time series;

  • Matplotlib And Seaborn — data visualization;

  • Statsmodels — statistical modeling;

  • Scikit-learn – machine learning and modeling.

R

R specifically designed for statistical analysis and data visualization.

Main libraries:

MATLAB

MATLAB used in engineering and scientific research for mathematical calculations. It contains various functions for probability calculation, statistical analysis and machine learning. This is a commercial platform that can be used if you buy an expensive license, so beginner programmers may want to consider more affordable options.

Main tool — library for statistical analysis and machine learning Statistics and Machine Learning Toolbox.

To contents

How random numbers are generated

To represent the result of a random variable, random numbers are needed. The process of creating them is called generation. Random number generation is divided into two main types: truly random and pseudo-random numbers.

Truly random numbers

They are created using special devices that use unpredictable physical processes: atmospheric noise, radioactive decay, and so on.

Examples of generating true random numbers (TRNG):

  • Intel RDRAND is a hardware instruction built into Intel processors that generates numbers using a digital thermal noise converter.

  • Random.org — a web service that uses atmospheric noise to generate truly random numbers. It provides an API for accessing their services.

Example of using Random.org via API in Python

Let's get 5 truly random numbers in the range from 1 to 100:

import requests

def get_true_random_numbers(count, min_value, max_value):

    url = f"https://www.random.org/integers/?num={count}&min={min_value}&max={max_value}&col=1&base=10&format=plain&rnd=new"

    response = requests.get(url)

    if response.status_code == 200:

        numbers = list(map(int, response.text.strip().split()))

        return numbers

    else:

        raise Exception("Error fetching random numbers")

true_random_numbers = get_true_random_numbers(5, 1, 100)

print(true_random_numbers)

Where is GISH used: in areas with very high security requirements, such as data encryption or precision experiments in scientific research.

Pseudorandom numbers

Generated by algorithms and appear random, but can be reproduced if the original value, or “seed”, is known.

Examples of pseudorandom number generation (PRNG):

  • Linear Congruent Generator (LCG) The simplest generator with high speed of obtaining values.

    Useful for modeling and mathematical statistics. Not safe for use in cryptography. Cryptanalysis specialists will be able to reconstruct the sequence of random numbers.

    Example usage in Python:

    In Python, you can implement an LCG manually using the basic syntax of the language. You can define a function that will generate a sequence of pseudo-random numbers.

    from typing import Generator

    def lcg(modulus: int, a: int, c: int, initial_value: int) -> Generator[int, None, None]:

        """Линейный конгруэнтный генератор (LCG)."""

        current_value = initial_value

        while True:

            current_value = (a * current_value + c) % modulus

            yield current_value

    # Генерируем 10 псевдослучайных чисел

    lcg_generator = lcg(modulus=2**32, a=1664525, c=1013904223, initial_value=42)

    for _ in range(10):

        print(next(lcg_generator))

    Also in the library NumPy there is a function numpy.random.RandomStatewhich uses LCG as one of the random number generation methods:

    import numpy as np

    # Создаем экземпляр генератора

    rng = np.random.RandomState(seed=42)

    # Генерируем 10 случайных чисел

    random_numbers = rng.rand(10)

    print(random_numbers)

  • Mersenne TwisterOne of the most popular generators with good randomness quality and long period before sequence repetition.

    Also not intended for use in cryptography, but effective in modeling, simulation, testing.

    How to use in Python: Mersenne Twister is provided in the random standard library. The functions available are random.random(), random.randint(), random.uniform() and others.

Areas of application of PRNG: used in computer games, data science, modeling, simulation and other fields.

Cryptographically secure random number generators

Cryptography uses special cryptographically secure random number generators (SSRNGs) to create strong keys and passwords.

The KSGPSH undergo various tests to ensure their reliability and safety:

  • Analysis of randomness of numbers: statistical tests to check for uniform distribution of random numbers, absence of correlations and predictability.

  • Score in cryptographic protocols: generators are tested in the context of cryptographic protocols to ensure their resistance to attacks.

  • Data encryption and decryption: The reliability of the SGPRNG is further verified through data encryption and decryption to assess the preservation of the integrity and confidentiality of information.

  • Entropy Analysis: checking the randomness and unpredictability of the numbers generated by the KSGNSS. The higher the entropy, the harder it is to predict the next number in the sequence, making the system more secure.

Examples of KSGPCH generators:

  • Yarrow's algorithm – uses multiple entropy pools and a combining mechanism to ensure a high degree of randomness and security.

    Yarrow's algorithm can be implemented in Python using the library Python Cryptography Toolkit.

    We create a random number generator Yarrow and generate and output random bits in an infinite loop:

    from PythonCryptographyToolkit import Yarrow

    y = Yarrow()

    while True:

        print(y.next_bit())

  • Fortuna — addresses some of Jarrow's shortcomings and offers a more flexible and secure architecture.

    Can be implemented in Python using the library Fortuna:

    from Fortuna import RandomValue, front_linear, back_linear

    random_car = RandomValue(("Toyota", "Honda", "Ford", "BMW", "Audi"), zero_cool=front_linear)

    random_vehicle = RandomValue((lambda: f"Car, {random_car()}", "Bicycle", "Motorcycle", "Bus", "Train"), zero_cool=back_linear)

    print(random_vehicle())

    This code will output a random value from a set of vehicles and car brands. For example, possible outputs could be “Car, Toyota”, “Bicycle”, “Motorcycle”, “Bus”, or “Train”.

Briefly: What are the differences between a SIS and a PRNG?

Truly random

Pseudo-random

Source of randomness

Physical processes

Mathematical algorithms

Predictability

Unpredictable and non-reproducible

Reproducible with a known “seed”

Performance

Generates faster and requires less computing resources

Slower and less productive

Reproducibility

You can repeat the same scenarios

Cannot reproduce specific scenario

Degree of randomness

Can be controlled for greater predictability

Can't control

To the table of contents

Where is probability theory used?

Cryptography

Probability theory helps protect data from hacking. Here are a few ways it is used:

Encryption and decryption. Source

Machine learning

Probability theory is used in reinforcement machine learning, to find the best ways to act in an environment where results are unpredictable.

Example – learning environment CoinRun. In it, the model tries out different strategies to learn how to achieve goals, such as collecting coins.

An example of how the model works in the CoinRun learning environment.  Source

An example of the model running in the CoinRun training environment. Source

Probabilistic models can also predict the results of certain actions based on input data, as well as assess whether an object belongs to a certain class and recognize speech.

How it works: For example, speech recognition systems model a sequence of sounds as a series of states with probabilistic transitions between them. This way the system learns to predict the next sound based on the previous ones.

Data Analysis and Data Science

Analysis of user data on social networks. Probability theory helps us study people's behavior online to understand how they interact with each other and what content they will be interested in.

For example, using the model random walk you can understand how information moves from one user to another.

Random walk simulator.  Source

Time series forecasting in business analytics. These are data analysis methods that predict future prices or values ​​of instruments such as stocks, bonds or currencies based on their past performance. Examples include the ARIMA model and the Monte Carlo method.

ARIMA model for making forecasts.  Source

ARIMA model for forecasting. Source

Analysis of customer activity in online retail. Tracking customer activity on your website helps improve your website experience and increase sales. For example, a logistic regression model estimates the likelihood that a customer will buy a certain product.

Video games

Many games use random numbers to make things unpredictable, such as enemies spawning, items dropping, or random gameplay scenarios. pseudorandom number generator.

Such events can be random even before the player makes a decision. For example, the location of objects in sandbox games. Other random events occur after the player has made a choice. For example, the chance to hit an enemy or dodge an attack.

Construction simulator RimWorld with random location of objects.  Source

RimWorld construction simulator with random object placement. Source

To contents

How to learn probability theory

Advanced topics such as stochastic processes and modeling are usually taught in engineering and economics departments. During their studies, students not only learn the basics of theory, but also learn to work with data, which is important for working in the field of artificial intelligence.

Also, studying at a university is an opportunity to participate in scientific projects. This helps to better understand the subject and gain experience for a future career in science and technology.

Useful materials on the topic:

  • Probability and Statistics, Khan Academy – free well from Khan Academy with interactive videos and quizzes. It consists of 16 modules and includes topics such as conditional probability, combinatorics, probability distributions, descriptive statistics.

  • Introduction to Probability, Dimitri P. Bertsekas and John N. Tsitsiklis Introductory manual in English, which is used at the Massachusetts Institute of Technology. It covers both the basics of probability theory: probabilistic models, discrete and continuous random variables, as well as more advanced topics such as Bayesian inference, Bernoulli's and Poisson's theorems.

  • “Probability theory and mathematical statistics”, N. Sh. Kremer textbook on the fundamentals of probability theory and mathematical statistics with guidance on solving problems. There are examples of use in financial market models and production processes.


Deep knowledge of probability theory can be obtained on the new Master's Degree in Business Informatics RUDN and Skillfactory. Students will learn to apply mathematical models to analyze data and solve real problems. And also independently design corporate systems and customize them for clients' business processes.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *