We get the curve of the probability distribution of the probability of a random process … faster and more accurate

values ​​of a random variable X, sampled from some continuous distribution. It is necessary to estimate the probability density function of a random variable for a given sample. Only native python functions can be used to solve the problem; matplotlib is used for plotting; Pandas DataFrame is used as a data container according to the original article (although in general you can do without it).

Data preparation

We will test the methods using data generated for 4 distributions for which analytical density expressions are known, as in the original article: Rayleigh, gamma, Weibull and exponential. For this we use the code MilashchenkoEA with minor changes (for the convenience of further use, the signatures of the functions that return the values ​​of the analytical probability density function on a given array of values ​​of a random variable have been changed):

Code
import random
import math
import matplotlib.pyplot as plt
import pandas
import pandas as pd
import numpy as np  # Понадобится для расчета метрик
from time import perf_counter  # Понадобится для определения времени выполнения


def rel_pdf(rel_sigma: float, X: list) -> pandas.DataFrame:
    """
    Вычисляет Релеевскую кривую плотности распределения вероятности по известной формуле
    :param rel_sigma: среднеквадратическое отклонение
    :param X: координаты по оси абсцисс
    :return: pandas.DataFrame
    """
    pdf_y = []  # Координаты по оси ординат
    for x in X:
        pdf_y.append(((2 * x) / rel_sigma) * math.exp((-x ** 2) / rel_sigma))
    return pd.DataFrame({'x': X, 'y': pdf_y})


def rel_rand(n: int, rel_sigma: float) -> list:
    """
    Генерирует случайные числа с Релеевской плотностью распределения вероятности
    :param n: количество отсчетов
    :param rel_sigma: среднеквадратическое отклонение
    :return: list
    """
    rel_list = []
    for i in range(n):
        rel_list.append((rel_sigma / math.sqrt(2)) * math.sqrt(-2 * math.log(random.uniform(0, 1))))
    return rel_list


def gam_pdf(v: float, b: float, X: list) -> pandas.DataFrame:
    """
    Вычисляет кривую плотности Гамма распределения вероятности по известной формуле
    :param v: параметр формы
    :param b: масштабный коэффициент
    :param X: координаты по оси абсцисс
    :return: pandas.DataFrame
    """
    pdf_y = []  # Координаты по оси ординат
    for x in X:
        pdf_y.append(((b ** v) / math.gamma(v) * (x ** (v - 1)) * math.exp(-b * x)))
    return pd.DataFrame({'x': X, 'y': pdf_y})


def gam_rand(n: int, v: float, b: float) -> list:
    """
    Генерирует случайные числа с гамма распределением плотности вероятности
    :param n: количество отсчетов
    :param v: параметр формы
    :param b: масштабный коэффициент
    :return: list
    """
    gam_list = [random.gammavariate(v, 1 / b) for i in range(n)]
    return gam_list


def weib_pdf(a: float, b: float, X: list) -> pandas.DataFrame:
    """
    Вычисляет кривую плотности распределения вероятности Вейбулла по известной формуле
    :param a: масштабный коэффициент
    :param b: параметр формы
    :param X: координаты по оси абсцисс
    :return: pandas.DataFrame
    """
    pdf_y = []  # Координаты по оси ординат
    for x in X:
        pdf_y.append((b / a) * ((x / a) ** (b - 1)) * math.exp(-(x / a) ** b))
    return pd.DataFrame({'x': X, 'y': pdf_y})


def weib_rand(n: int, a: float, b: float) -> list:
    """
    Генерирует случайные числа с распределением плотности вероятности Вейбулла
    :param n: количество отсчетов
    :param a: масштабный коэффициент
    :param b: параметр формы
    :return: list
    """
    wei_list = [random.weibullvariate(a, b) for i in range(n)]
    return wei_list


def exp_pdf(l: float, X: list) -> pandas.DataFrame:
    """
    Вычисляет кривую экспоненциальной плотности распределения вероятности по известной формуле
    :param l: обратный коэффициент масштаба
    :param n: количество рассчитанных точех
    :param X: координаты по оси абсцисс
    :return: pandas.DataFrame
    """
    pdf_y = []  # Координаты по оси ординат
    for x in X:
        pdf_y.append(l * math.exp(-l * x))
    return pd.DataFrame({'x': X, 'y': pdf_y})


def exp_rand(n: int, l: float) -> list:
    """
    Генерирует случайные числа с экспоненциальным распределением плотности вероятности
    :param n: количество отсчетов
    :param l: обратный коэффициент масштаба
    :return: list
    """
    exp_list = [random.expovariate(l) for i in range(n)]
    return exp_list

Let’s generate sets of random variables for each distribution, and organize them into a dictionary:

random_series = {
    'rrand': rel_rand(100000, 1),  # генерируем случайные числа с распределением Релея
    'grand': gam_rand(100000, 0.5, 0.5),  # генерируем случайные числа с гамма распределением
    'wrand': weib_rand(100000, 1, 5),  # генерируем случайные числа с распределением Вейбулла
    'exprand': exp_rand(100000, 1.5)  # генерируем случайные числа с экспоненциальным распределением
}

Further, these datasets will be used for testing as a whole, or as slices containing Nvalues ​​of a random variable.

Optimizing the algorithm based on data binarization

As an algorithm for finding the distribution density, the author proposes a fairly widespread method based on dividing the interval of variable values Xby a given number kbins of equal width and counting the number of occurrences of variable values ​​in each bin, implemented in a number of libraries [numpy, scipy, pandas]and also its own implementation in accordance with the conditions (original comments preserved):

def pdf_original(k: int, rnd_list: list) -> pandas.DataFrame:
    """
    Получает кривую плотности распределения вероятности
    :param k: количечиво интервалов разбиения гистограммы
    :param rnd_list: случайный процесс
    :return: pandas.DataFrame
    """
    pdf_x = []  # Координаты по оси абсцисс
    pdf_y = []  # Координаты по оси ординат
    n = len(rnd_list)  # количество элементов в рассматриваемой выборке
    h = (max(rnd_list) - min(rnd_list)) / k  # ширина одного интервала
    a = min(rnd_list)  # минимальное значение в рассматриваемой выборке
    for i in range(0, k):  # проход по интервалам
        count = 0
        for j in rnd_list:  # подсчет количества вхождений значений из выборки в данный интервал
            if (a + i * h) < j < (a + (i * h) + h):
                count = count + 1
        pdf_x.append(a + i * h + h / 2)  # координата по оси абсцисс полученной кривой плотности распределения
        # вероятности
        pdf_y.append(count / (n * h))  # координата по оси ординат полученной кривой плотности распределения
        # вероятности
    d = {'x': pdf_x, 'y': pdf_y}
    return pd.DataFrame(d)

It can be seen that the complexity of this algorithm is O (N  cdot k)since the outer loop runs through kvalues, the internal one, in the worst case, runs through Nvalues. Let us roughly estimate the execution time of the function for various values kand N

for k in [100, 1000, 10000]:
    tic = perf_counter()
    _ = pdf_original(k, rrand)
    print('%.3f s' % (perf_counter() - tic))

for N in [10000, 20000, 30000]:
    tic = perf_counter()
    _ = pdf_original(1000, rrand[:N])
    print('%.3f s' % (perf_counter() - tic))

Output:

0.726 s
7.006 s
69.418 s

0.783 s
1.429 s
2.145 s

Of course, such depressing results do not allow using this implementation of the method, since there are ready-made implementations. However, it is possible to achieve linear complexity of the algorithm with respect to Nand constant over k, significantly, thereby reducing the computational time, just adding a preliminary sorting of values ​​and replacing the inner loop over all values ​​of the random variable with a loop only for those values ​​that fall into the bin:

def pdf_optimized(k: int, rnd_list: list) -> pandas.DataFrame:
    """
    Получает кривую плотности распределения вероятности
    :param k: количечиво интервалов разбиения гистограммы
    :param rnd_list: случайный процесс
    :return: pandas.DataFrame
    """
    pdf_x = []  # Координаты по оси абсцисс
    pdf_y = []  # Координаты по оси ординат
    n = len(rnd_list)  # количество элементов в рассматриваемой выборке
    h = (max(rnd_list) - min(rnd_list)) / k  # ширина одного интервала
    a = min(rnd_list)  # минимальное значение в рассматриваемой выборке
    rnd_list = sorted(rnd_list)  # сортируем значения
    j = 0  # индекс значения левой границы интервала 
    for i in range(0, k):  # проход по интервалам
            count = 0
        while j < n and (a + i * h) <= rnd_list[j] < (a + (i * h) + h):  # подсчитываем количество значений в k-м интервале
                count = count + 1
                j += 1
        pdf_x.append(a + i * h + h / 2)  # координата по оси абсцисс полученной кривой плотности распределения
        # вероятности
        pdf_y.append(count / (n * h))  # координата по оси ординат полученной кривой плотности распределения
        # вероятности
    d = {'x': pdf_x, 'y': pdf_y}
    return pd.DataFrame(d)

Let’s perform a similar test of the execution time of the optimized function for different values kand Nadding some number of repetitions:

N_repeats = 100
for k in [100, 1000, 10000]:
    tic = perf_counter()
    for _ in range(N_repeats):
        _ = pdf_optimized(k, random_series['rrand'])
    print('%.3f s' % ((perf_counter() - tic) / N_repeats))

for N in [25000, 50000, 100000]:
    tic = perf_counter()
    for _ in range(N_repeats):
        _ = pdf_optimized(10000, random_series['rrand'][:N])
    print('%.3f s' % ((perf_counter() - tic) / N_repeats))

Output:

0.035 s
0.034 s
0.039 s

0.013 s
0.021 s
0.038 s

It is much more pleasant to deal with such results and such a native implementation can be used in conditions when it is not possible to install additional packages.

Note:
The above implementations use loose equality to compare real numbers, which, generally speaking, is incorrect. As a result, boundary effects may occur, since the occurrence of the minimum / maximum values ​​of the random variable in the first / last bins is not guaranteed. You should slightly modify the implementation by adding the default consideration of the minimum and maximum values ​​of the random variable in the first and last bins, respectively, and loop through the remaining values.

We consider the distribution density through the distribution function

Next, we will consider a slightly different approach to obtaining an estimate for the distribution density of a random variable. Recall that the distribution density of a random variable is the first derivative of the distribution function, which, in turn, is the probability of finding a value of the random variable less than or equal to the given one. In order to obtain an estimate of the probability distribution function, and from it, in turn, the distribution density, it is proposed to perform the following steps:

  1. sorting the values ​​of the variable Xascending, we get a set of sorted values  hat X;

  2. we assign each value in the array of sorted values ​​to its ordinal number istarting from zero – up to a factor, i ( hat X_i)is an estimate of the distribution function of a random variable (Figure 1, blue curve);

  3. build a uniform scale from k + 2values ​​in the range from X_ {min}before X_ {max}where k Is the desired number of points on the distribution density curve (k <N) – analog of the number of bins in the histogram;

  4. interpolate the values ​​of the numbers of variables from the scale of the ordered array of values ​​of the variable into the uniform scale obtained in step 3 (Figure 1, orange curve);

  5. we numerically take the derivative of the interpolated function with respect to neighboring points (for this, it was necessary k + 2values) and divide by N– thus, we obtain the sought-for estimate of the probability density.

Figure 1. Blue curve - the estimate of the distribution function of a random variable, given on the original set of observations X, containing N = 100 points;  the orange curve is an estimate of the distribution function of a random variable obtained by interpolation into a uniform scale containing k = 21 points.
Figure 1. Blue curve – the estimate of the distribution function of a random variable, given on the original set of observations X, containing N = 100 points; the orange curve is an estimate of the distribution function of a random variable obtained by interpolation into a uniform scale containing k = 21 points.

Below is the implementation of the method:

def pdf_custom(k: int, rnd_list: list):
    """
    Получает кривую плотности распределения вероятности
    :param k: количечиво интервалов разбиения гистограммы
    :param rnd_list: случайный процесс
    :return: pandas.DataFrame
    """
    X = sorted(rnd_list)  # сортируем значения случайной переменной
    N = len(X)
    i = 0
    dx = (X[-1] - X[0]) / (k + 2)  # находим шаг по аргументу в равномерной шкале
    result = []
    result_x = []
    for j in range(k + 2):  # пробегаем по точкам
        x = X[0] + j * dx  # находим соответствующее значение аргумента в равномерной шкале 
        result_x.append(x)
        while True:  # с помощью данного цикла находим индекс i такой, 
                     # что x лежит в пределах от X[i] до X[i+1]
            if x > X[i + 1]:
                i += 1
            else:
                break
        result.append(i + (x - X[i]) / (X[i + 1] - X[i]))
    norm = 0.5 / dx / N
    d = {
        'x': result_x[1:-1],
        'y': [(right - left) * norm for right, left in zip(result[2:], result[:-2])]}
    return pd.DataFrame(d)

The complexity of the presented algorithm is also O (N)… As a result of testing the runtime, we get the following values ​​for k = [100, 1000, 10000] at N = 100,000 and N = [25000, 50000, 100000] at k = 10,000:

0.020 s
0.020 s
0.024 s

0.009 s
0.014 s
0.024 s

The new method turned out to be somewhat faster, with the same algorithmic complexity.

Compare the accuracy of the presented methods

To compare the accuracy of the presented methods, we will use the Kullback-Leibler distance from the a priori distribution density function P until the grade is obtained Q:

KL =  mathbb {E}[P cdot ln(frac P Q)]

To calculate the metric, without further ado, we will use numpy:

def KL_dist(P: list, Q: list):
    assert len(P) == len(Q)
    eps = 1e-9
    P = np.asarray(P)
    Q = np.asarray(Q)
    return np.mean(P * (np.log(P + eps) - np.log(Q + eps)))

We obtain estimates of the distribution density of a random variable for different prior distributions for different ratios k / N and compare the values ​​of the KL divergence:

from functools import partial

pdf_function = {
    'rrand': partial(rel_pdf, 1),
    'grand': partial(gam_pdf, 0.5, 0.5),
    'wrand': partial(weib_pdf, 1, 5),
    'exprand': partial(exp_pdf, 1.5)
} 

N = 10000
k_values = np.logspace(2, 4, 10, dtype=np.int)

metrics_optimized = {}
for key, val in random_series.items():
    for k in k_values:
        estimated_pdf = pdf_optimized(k, val[:N])
        theoretical_pdf = pdf_function[key](estimated_pdf['x'])
        metrics_optimized.setdefault(key, []).append(
            KL_dist(theoretical_pdf['y'], estimated_pdf['y']))

metrics_custom = {}
for key, val in random_series.items():
    for k in k_values:
        estimated_pdf = pdf_custom(k, val[:N])
        theoretical_pdf = pdf_function[key](estimated_pdf['x'])
        metrics_custom.setdefault(key, []).append(
            KL_dist(theoretical_pdf['y'], estimated_pdf['y']))

plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
for key, val in metrics_optimized.items():
    plt.scatter(k_values / N, val)
plt.legend(metrics_optimized.keys())
plt.xlabel('k / N')
plt.ylabel('KL-divergency')
plt.title('Метод на основе гистограмм')

plt.subplot(1, 2, 2)
for key, val in metrics_custom.items():
    plt.scatter(k_values / N, val)
plt.legend(metrics_custom.keys())
plt.xlabel('k / N')
plt.ylabel('KL-divergency')
plt.title('Метод на основе функции распределения')

As a result, as expected in both cases, we obtain monotonically increasing dependences of the metric on the relation k / N (Figure 2), however, the method based on the numerical differentiation of the distribution function of a random variable gives an order of magnitude less discrepancy between the distribution density estimate curve and the theoretical curve than the method based on plotting a histogram, which is especially noticeable at large values k / N

Figure 2. Dependence of the Kullback-Leibler divergence value between the prior distribution and its estimate on the k / N ratio for the method based on data binarization (left) and the method based on taking the random from the estimate of the distribution function (right).
Figure 2. Dependence of the Kullback-Leibler divergence value between the prior distribution and its estimate on the k / N ratio for the method based on data binarization (left) and the method based on taking the random from the estimate of the distribution function (right).

Conclusion

The proposed alternative method for constructing estimates of the distribution density of a random variable from a given set of observations is more accurate than the method based on obtaining histograms.

I express my gratitude MilashchenkoEA for discussing these problems and for persuading me to write this short note 🙂

Similar Posts

Leave a Reply