Refinement of percentiles using polynomial approximation

When the customer asks to define percentiles For discrete values and wants to get exact values ​​in the form continuous quantitiesthe question arises whether this is possible. The answer is yes, it is possible if you use approximation.

Problem — The impossibility of obtaining continuous values ​​that would more accurately reflect the final percentiles than discrete values.

Let's look at the problem with an example. Suppose we have generated a data set on 100 observationsconsisting of discrete values ​​with a distribution from 1 to 10. We then calculate percentiles for three values: 0.25, 0.50 and 0.75.

# зафиксируем псевдослучайность
np.random.seed(45)

# сгенерируем данные с распределением от 1 до 10
data = pd.DataFrame(data=list(np.random.randint(1, 11, size=100)), columns=['values'])

# найдём длину оси X
x_length = len(data['values'].value_counts())

# визуализируем полученные данные
plt.hist(data, lw=2.5, edgecolor="#e5d5ff", color="#fbf9ff", bins=x_length)
plt.xticks(np.arange(1, x_length + 1, step=1))

# визуализируем процентили
for percentile in [0.25, 0.50, 0.75]:
    plt.axvline(
        x=data['values'].quantile(percentile), label=f'{percentile} Процентиль (не точный)', lw=2.5, color="#ffdb4d", linestyle="--")

# выключаем сетку
plt.grid(visible=False)
    
# вызываем легенду
plt.legend()

# вызываем график
plt.show()

As we can see, the percentiles took the values ​​4, 7 and 9. However, do they reflect the real probability distribution? No, since the original values ​​were discrete, therefore, the percentiles will also have a discrete form.

Solution — We'll take advantage of it. approximation. This will help simplify complex mathematical expressions and processes for more efficient study. To do this, we will find a polynomial that best describes the original probability distribution. This will allow us to determine the area under the curve, from which we can obtain accurate percentile values.

But how do we find the best polynomial? To do this, we use the initial data set and the metric R^2which measures how well a polynomial model fits the data. Model scores can be interpreted as follows:

  • R^2 = 0 The model does not explain the variability of the data at all, that is, the predicted values ​​of the model do not correlate with the real data. The model is no better than a simple average.

  • R^2 = 1 The model perfectly explains the variability of the data, i.e. the predicted values ​​completely coincide with the actual ones.

  • R^2 > 0″ src=”https://habrastorage.org/getpro/habr/formulas/6/60/60d/60d0aef24280dfe87ef812daa3eacb89.svg” width=”auto” height=”auto”/> And <img data-lazyloaded= The model explains some of the variability in the data. The closer R² is to 1, the better the model fits the data.

# создадим метрику для оценивания расхождений
def r2(x, y):
    return 1 - (np.mean((x - y) ** 2) / np.var(x))

# найдём наилучший полином по оценкам метрики
def get_best_polynome(x, y):
    
    best_result = 0
    best_polynome = 0
    
    # начинаем с 1, так как полином степени 0 не имеет смысла
    for degree in range(1, 11):
        
        # создаем полином указанной степени
        polynome = np.poly1d(np.polyfit(x, y, degree))
        
        # вычисляем метрику для текущего полинома
        results = r2(y, polynome(x))
        
        # обновляем результат и степень полинома, если нашли лучший вариант
        if results > best_result and results < 1:
            best_result = results
            best_polynome = polynome
    
    return best_polynome, best_result

# формируем ось X
x_data = range(1, x_length + 1)

# формируем ось Y (сортируем индексы, если они идут вразнобой)
y_data = list(data['values'].value_counts().sort_index())

# найдём полином, описывающий распределение вероятностей, лучше всего
best_polynome, best_score = get_best_polynome(x_data, y_data)

print(f'Лучший полином степени: {best_polynome.order}\nЛучшая оценка: {str(best_score)[:4]}')

# Лучший полином степени: 8
# Лучшая оценка: 0.99

As you can see, the polynomial that describes the available data best turned out to be of degree 8. Translated into accessible language, the degree of a polynomial is the number of bends in a line that allow one to more accurately describe the data. So, if the polynomial were of degree one, then the result would be a straight line from point A to point B without a single bend.

Let's plot a graph to visually evaluate how well the polynomial found earlier describes the initial data. To make the line as smooth as possible and the percentile results more accurate, it is better to use a larger number of points, because if you put, for example, 10 points, you will get a broken curve, and the polynomial results are more likely to be at the break points of the line.

# вычисляем коэффициенты полинома и создаём объект полинома на основе переданных коэффициентов
coefficients = np.poly1d(np.polyfit(x_data, y_data, best_polynome.order))

# генерируем 10000 равномерно распределённых точкек от 1 до 10
polyline = np.linspace(1, x_length, 10000)

# визуализируем полученные данные
plt.hist(data, lw=2.5, edgecolor="#ffffff", color="#fbf9ff", bins=x_length)
plt.xticks(np.arange(1, x_length + 1, step=1))

# график, отражающий количество значений для каждого измерения
plt.scatter(x=x_data, y=y_data, label=f'Изначальные данные', color="#ffdb4d")

# график, отражающий линию аппроксимации
plt.plot(polyline, coefficients(polyline), lw=2.5, label=f'Сгенерированный полином', color="#c2a4f4")

# выключаем сетку
plt.grid(visible=False)

# вызываем легенду
plt.legend()

# вызываем график
plt.show()

The 8th degree polynomial perfectly describes the initial data, which means we can move on to the final stage – finding percentiles. Let's look at the algorithm for finding them in more detail:

  1. First, we need to calculate the value of the polynomial at point x, for this we use the function “function()”. Inside the for loop, the function sums up the contribution of each member of the polynomial. Here we use the coefficients array, which contains the coefficients.

  2. Next, we normalize the function “function()” and write it in “normalized_function()” so that the integral of this function on a given interval is equal to 1. First, we integrate the function function() on the interval from lower_limit to upper_limit to find the value of the normalizing coefficient. Then, the function function() is divided by this normalizing coefficient so that the sum of all probabilities (the integral of the function) is equal to 1.

  3. We calculate the probability distribution function at point x, which represents the probability that a random variable will be less than or equal to x.

  4. Finally, we use a numerical method (Brent's algorithm, brentq) to find the value of x at which CDF(x) is equal to the given percentile. In other words, this function finds x at which the integral of the function from lower_limit to x is equal to the percentile value.

# вычислим значение полинома в точке x
def function(x):
    
    result = 0
    
    # суммируем вклад каждого члена полинома
    for i in range(len(coefficients) + 1):
        
        result = result + coefficients[i] * x ** i
    
    return result

# нормализует функцию, чтобы интеграл от этой функции на заданном интервале был равен 1
def normalized_function(x):
    
    integration, error = quad(function, lower_limit, upper_limit)
    
    result = (1 / integration) * function(x)
    
    return result

# вычисляем функцию распределения вероятностей в точке x
def CDF(x):
    
    integration, error = quad(normalized_function, lower_limit, x)
    
    return integration

# находим значение, соответствующее искомому процентилю
def get_percentile(percentile):
    
    result = brentq(
        lambda x: CDF(x) - percentile, lower_limit, upper_limit)
    
    return result

lower_limit = 1
upper_limit = 10

print(f'25-Процентиль: {get_percentile(0.25)}') # 4.148070830469738
print(f'50-Процентиль: {get_percentile(0.50)}') # 6.635723812105114
print(f'75-Процентиль: {get_percentile(0.75)}') # 8.16891812557873

As a result, from the initial discrete percentiles of 4, 7, and 9, we obtained more accurate continuous variants of 4.14, 6.63, and 8.16 by using approximation. Let's plot a graph that allows us to visually evaluate the differences between the initial percentiles and the refined ones.

# визуализируем полученные данные
plt.hist(data, lw=2.5, edgecolor="#ffffff", color="#fbf9ff", bins=x_length)
plt.xticks(np.arange(1, x_length + 1, step=1))

# график, отражающий линию аппроксимации
plt.plot(polyline, coefficients(polyline), lw=2.5, color="#c2a4f4")

for percentile in [0.25, 0.50, 0.75]:
    plt.axvline(
        x=get_percentile(percentile), label=f'{percentile} Процентиль (точный)', lw=2.5, color="#ade6b6", linestyle="-")
    plt.axvline(
        x=data['values'].quantile(percentile), label=f'{percentile} Процентиль (не точный)', color="#ffdb4d", linestyle="--")

# выключаем сетку
plt.grid(visible=False)

# вызываем легенду
plt.legend()

# вызываем график
plt.show()

Conclusion

We explored how polynomial approximation can be used to create a smooth function that better interpolates data and allows for more accurate percentile calculations. We went through several key steps along the way:

  1. Polynomial Construction: We learned how to determine which polynomial best fits the data using the R² coefficient of determination.

  2. Normalizing a Polynomial: In order for a polynomial to be used as a probability density function (PDF), we normalized it, which ensured that the integral over a given interval was equal to 1.

  3. Calculating Percentiles: We used a numerical search method to find exact percentile values ​​based on the integral of the distribution function (CDF) built on a normalized polynomial.

As a result, we have obtained a method that allows us to refine percentile values ​​for complex and noisy data, where standard methods may not be accurate enough. This approach can be useful in many areas of data analysis where high accuracy in calculations is required, especially when working with large and heterogeneous samples.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *