Typical tasks of an analyst. Part 2. Is there a trend?

Introduction

In the first part of the article on Habr, we looked at classical approaches to assessing changes in a metric provided that it is stationary. In this context, the statistical criteria used in A/B testing have proven to be very effective.

However, if there is a stable trend, for example, the average monthly audience is increasing from year to year, estimating the difference in averages for two adjacent time periods may be incorrect. In this case, the average of the previous period will always differ from the average of the post-period, and this may often be unrelated to the functional being studied.

One of the reasons is that a trend does not always depend on the company’s actions and is often a consequence of external conditions. For example, audience growth may be associated with an increase in the welfare of the population, business scaling, or seasonal factors.

Thus, the presence or absence of a trend is an important aspect of data analysis. Let’s look at a few successful and unsuccessful approaches that can be used to solve this problem.

Visual analysis

In the context of time series analysis, visual analysis plays an important role in trend determination, but it has obvious disadvantages and limitations. Subjectivity of perception, uncertainty of results and limitations in processing large volumes of data are just some of the aspects that should be taken into account when using this method.

Visual analysis is useful for initially assessing data, identifying common patterns and trends, and selecting an appropriate method for further analysis. It helps to understand the type of dependence in the data, highlight the main characteristics of the time series and formulate preliminary hypotheses.

Let’s look at a few examples below:

Python code
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns


def generate_data(
    n_samples: int,
    loc: int = 100,
    scale: int = 3,
    trend_max: int = 20,
    hetero: float = 1,
    n_outliers: int = 0,
) -> pd.DataFrame:
    time = pd.date_range("2024-01-01", periods=n_samples, freq="1H")
    metric_values = (
        loc
        + np.random.normal(scale=scale, size=n_samples)
        * np.linspace(1, hetero, n_samples)
        + np.linspace(0, trend_max, n_samples)
    )

    for i in range(n_outliers):
        metric_values[i * n_samples // n_outliers] = np.max(metric_values) * 1.02

    return pd.DataFrame({"time": time, "metric": metric_values})


def rolling_data(data, window_size=10):
    return data.rolling(window=window_size).mean()


ONE_MONTH = 24 * 30
test_data_1 = generate_data(ONE_MONTH, hetero=1, trend_max=20, scale=3)
test_data_2 = generate_data(ONE_MONTH, hetero=1, trend_max=20, scale=10)
smoothed_data = rolling_data(test_data_2["metric"])

fig, axes = plt.subplots(3, 1, figsize=(10, 12))

sns.lineplot(x="time", y="metric", data=test_data_1, ax=axes[0], alpha=0.5)
axes[0].set_ylim(0, 200)
axes[0].set_title("Low variance")

sns.lineplot(x="time", y="metric", data=test_data_2, ax=axes[1], alpha=0.5)
axes[1].set_ylim(0, 200)
axes[1].set_title("High variance")

axes[2].plot(test_data_2["time"], smoothed_data, alpha=0.5, color="red")
axes[2].set_ylim(0, 200)
axes[2].set_title("High variance and rolling window")

plt.subplots_adjust(hspace=0.5)
plt.show()

When the variance in the data is low, trend detection becomes relatively easy because the data has less fluctuation, making it easier to visualize. However, in more typical scenarios where the data is noisy, as in the second graph, isolating a specific trend becomes problematic. In such situations, you can try to reduce the impact of noise using smoothing techniques such as moving average or exponential smoothing. These methods can also have their drawbacks and lead to errors in the analysis. For example, a moving average based on previous values ​​may miss a recent trend or change, making it less sensitive to new data.

It is worth remembering that visual analysis should be used in combination with other analysis methods, since drawing final conclusions on its basis alone may not be reliable enough.

Linear regression

Linear regression is one of the most popular and effective methods for determining and assessing trends. It is an integral part courses in mathematical statistics and data analysis.

One of the main advantages of linear regression is its simplicity, interpretability, low computational requirements, and its efficiency in linear relationships.

However, this last plus is also a minus – linear regression may not be flexible enough to well approximate complex relationships in the data. If the trend is nonlinear, linear regression may produce an incorrect model.

Linear regression assumes that the errors are independent and equally distributed (homoscedasticity). Violations of these assumptions may lead to incorrect conclusions.

Regression construction most often occurs in the following stages

  • Dataset preparation.

  • Application of linear regression.

  • Testing for homoscedasticity.

    • Homoscedasticity means that the variance of the model errors is constant for all values ​​of the predictors. In other words, the spread of errors does not depend on the level of the predicted variable. Why is this important to do in more detail? read here. In this example there is a linear relationship, so we will use the Breusch-Pagan test.

  • Calculate and confidence interval of the slope.

    • R² in this case shows the quality of the model. The closer it is to 1, the more data is explained by the model. But low R² values ​​do not always indicate the failure of the model. In the case where the data is very noisy and there are outliers, R² will usually be quite low, but this does not deny the presence of linear dependencies or a clear trend. Let’s look at it in more detail below.

  • Checking for error independence

    • The independence of errors in a linear regression model is important for obtaining correct parameter estimates. This can be done using statistics such as the Durbin-Watson test, which tests for autocorrelation in the model residuals. You can find out more about it here. Typically, when the Durbin-Watson statistic is between 1.5 and 2.5, it is considered an acceptable indicator of the absence of autocorrelation. However, for accurate interpretation, you can use critical value tables for the Durbin-Watson test or statistical packages that automatically calculate the p-value for a given statistic.

  • Calculating slope confidence intervals using bootstrap.

    • Bootstrap was described in detail in a previous article. Here we will only clarify that such a method is needed to avoid random results in data with a wide interval. Building 1000 models using bootstrap will allow us to estimate the interval within which the slope of our trend can change.

Let’s look further at several examples of using linear regression to determine a trend in Python and Scipy, as well as the main difficulties that can be encountered when constructing linear regression.

Let us build, as an example, a model in which all conditions are met.

Python code
import matplotlib.pyplot as plt

import numpy as np

import pandas as pd
import scipy.stats as stats

import seaborn as sns
import statsmodels.api as sm


def generate_data(
    n_samples: int,
    loc: int = 100,
    scale: int = 10,
    trend_max: int = 10,
    hetero: float = 2,
) -> pd.DataFrame:
    time = pd.date_range("2024-01-01", periods=n_samples, freq="1H")

    metric_values = (
        loc
        + np.random.normal(scale=scale, size=n_samples)
        * np.linspace(1, hetero, n_samples)
        + np.linspace(0, trend_max, n_samples)
    )

    return pd.DataFrame({"time": time, "metric": metric_values})


ONE_MONTH = 24 * 30
test_data = generate_data(ONE_MONTH)

# plot data

sns.lineplot(x="time", y="metric", data=test_data, alpha=0.5)
# fit linear regression
slope, intercept, rvalue, pvalue, stderr = stats.linregress(
    test_data.index, test_data.metric
)

# test for Homoscedasticity of the residuals


def test_homoscedasticity(test_data: pd.DataFrame):
    # prepare y and X for statsmodels
    y = test_data.metric
    X_sm = test_data.index
    X_sm = sm.add_constant(X_sm)
    # fit linear regression using statsmodels
    mod_sm = sm.OLS(y, X_sm)
    res_sm = mod_sm.fit()
    # test for Homoscedasticity
    bp_lm, bp_lm_pvalue, bp_fvalue, bp_f_pvalue = sm.stats.diagnostic.het_breuschpagan(
        res_sm.resid, res_sm.model.exog
    )

    return bp_f_pvalue


bp_f_pvalue = test_homoscedasticity(test_data)
if bp_f_pvalue < 0.05:
    print("Residuals are not Homoscedastic. Linear regression may be invalid")
else:
    print("Residuals are Homoscedastic")


# test autocorrelation
def test_autocorrelation(residuals):
    dw_statistic = sm.stats.durbin_watson(residuals)
    return dw_statistic


# Assuming you have the 'residuals' variable defined elsewhere in your code
# Conduct the test for autocorrelation


# Check the p-value and print the result
residuals = test_data.metric - (intercept + slope * test_data.index)
dw_statistic = test_autocorrelation(residuals)

if dw_statistic > 1.5 and 2.5:
    print("Errors are independent (no signs of autocorrelation)")
else:
    print(
        "Аutocorrelation in the residuals. Independence of errors is not satisfactory."
    )

# bootstrap linear regression slope
n_iterations = 1000
slopes = np.zeros(n_iterations)
intercepts = np.zeros(n_iterations)
r_square = np.zeros(n_iterations)
for i in range(n_iterations):
    sample = test_data.sample(frac=1, replace=True)
    slope, intercept, rvalue, pvalue, stderr = stats.linregress(
        sample.index, sample.metric
    )
    slopes[i] = slope
    intercepts[i] = intercept
    r_square[i] = rvalue**2
# calculate 95% confidence interval
slopes_quantiles = np.percentile(slopes, [2.5, 97.5])
intercept_quantiles = np.percentile(intercepts, [2.5, 97.5])
r_square_quantiles = np.percentile(r_square, [2.5, 97.5])
print(f"r_square is between {r_square_quantiles}")
print(f"slope is between {slopes_quantiles}")


# plot two lines for the confidence interval slopes
plt.plot(
    test_data.time,
    intercept_quantiles[0] + slopes_quantiles[0] * test_data.index,
    label="lower bound",
    color="green",
    linestyle="--",
    linewidth=2,
)

plt.plot(
    test_data.time,
    intercept_quantiles[1] + slopes_quantiles[1] * test_data.index,
    label="upper bound",
    color="green",
    linestyle="--",
    linewidth=2,
)

# plot linear regression

plt.plot(
    test_data.time,
    intercept + slope * test_data.index,
    label="fitted line",
    color="red",
    linewidth=2,
)

plt.legend()
plt.show()

The results obtained indicate the presence of a trend in the data, with a slope ranging from ~0.012 to 0.018. In other words, every day the metric increases from 0.012 to 0.018 units. The homoscedasticity test confirms that the conditions for constructing the regression are met, and the analysis of independence of errors shows that the differences are not statistically significant.

Thus, the results obtained provide an idea of ​​the nature of the trend and its magnitude, which is difficult to do only with the help of visual analysis.

Testing for homoscedasticity.

The presence of high heteroskedasticity creates obstacles to model construction. Let’s look at what consequences can arise when constructing a linear regression if this check is ignored.

Python code
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.stats as stats
import seaborn as sns
import statsmodels.api as sm


def generate_data(
    n_samples: int,
    loc: int = 100,
    scale: int = 10,
    trend_max: int = 10,
    hetero: float = 30,
) -> pd.DataFrame:
    time = pd.date_range("2024-01-01", periods=n_samples, freq="1H")

    metric_values = (
        loc
        + np.random.normal(scale=scale, size=n_samples)
        * np.linspace(1, hetero, n_samples)
        + np.linspace(0, trend_max, n_samples)
    )

    return pd.DataFrame({"time": time, "metric": metric_values})


ONE_MONTH = 24 * 30
test_data = generate_data(ONE_MONTH)

# plot data

sns.lineplot(x="time", y="metric", data=test_data, alpha=0.5)
# fit linear regression
slope, intercept, rvalue, pvalue, stderr = stats.linregress(
    test_data.index, test_data.metric
)

# test for Homoscedasticity of the residuals


def test_homoscedasticity(test_data: pd.DataFrame):
    # prepare y and X for statsmodels
    y = test_data.metric
    X_sm = test_data.index
    X_sm = sm.add_constant(X_sm)
    # fit linear regression using statsmodels
    mod_sm = sm.OLS(y, X_sm)
    res_sm = mod_sm.fit()
    # test for Homoscedasticity
    bp_lm, bp_lm_pvalue, bp_fvalue, bp_f_pvalue = sm.stats.diagnostic.het_breuschpagan(
        res_sm.resid, res_sm.model.exog
    )

    return bp_f_pvalue


bp_f_pvalue = test_homoscedasticity(test_data)
if bp_f_pvalue < 0.05:
    print("Residuals are not Homoscedastic. Linear regression may be invalid")
else:
    print("Residuals are Homoscedastic")


# test autocorrelation
def test_autocorrelation(residuals):
    dw_statistic = sm.stats.durbin_watson(residuals)
    return dw_statistic


# Assuming you have the 'residuals' variable defined elsewhere in your code
# Conduct the test for autocorrelation


# Check the p-value and print the result
residuals = test_data.metric - (intercept + slope * test_data.index)
dw_statistic = test_autocorrelation(residuals)

if dw_statistic > 1.5 and 2.5:
    print("Errors are independent (no signs of autocorrelation)")
else:
    print(
        "Аutocorrelation in the residuals. Independence of errors is not satisfactory."
    )

# bootstrap linear regression slope
n_iterations = 1000
slopes = np.zeros(n_iterations)
intercepts = np.zeros(n_iterations)
r_square = np.zeros(n_iterations)

for i in range(n_iterations):
    sample = test_data.sample(frac=1, replace=True)
    slope, intercept, rvalue, pvalue, stderr = stats.linregress(
        sample.index, sample.metric
    )
    slopes[i] = slope
    intercepts[i] = intercept
    r_square[i] = rvalue**2
# calculate 95% confidence interval
slopes_quantiles = np.percentile(slopes, [2.5, 97.5])
intercept_quantiles = np.percentile(intercepts, [2.5, 97.5])
r_square_quantiles = np.percentile(r_square, [2.5, 97.5])
print(f"r_square is between {r_square_quantiles}")
print(f"slope is between {slopes_quantiles}")


# plot two lines for the confidence interval slopes
plt.plot(
    test_data.time,
    intercept_quantiles[0] + slopes_quantiles[0] * test_data.index,
    label="lower bound",
    color="green",
    linestyle="--",
    linewidth=2,
)

plt.plot(
    test_data.time,
    intercept_quantiles[1] + slopes_quantiles[1] * test_data.index,
    label="upper bound",
    color="green",
    linestyle="--",
    linewidth=2,
)


# plot linear regression

plt.plot(
    test_data.time,
    intercept + slope * test_data.index,
    label="fitted line",
    color="red",
    linewidth=2,
)

plt.legend()
plt.show()

Breusch-Pagan test signals that the spread of errors depends on the level of the predicted variable

The results might look something like this:

The results obtained from data analysis may be inconsistent, and the same code may show both a positive and negative trend. In such cases, conclusions about the presence of a trend are left to the discretion of random factors. More information on how to deal with heteroscedasticity can be found Herehowever, solving this problem is often complex and determining whether there is a clear trend in the data can be difficult.

R² estimate

An attentive reader might have noticed that in the first example the estimate of the coefficient of determination R² was omitted. Coming back to this, the results of the analysis show that the R² is between 0.06 and 0.14. This indicates that the model describes the data rather poorly, which, in general, confirms our assumption that such a model is typical for data with a high level of noise.

Below we will consider two additional examples illustrating why a low R² does not always indicate poor model quality, but may be a consequence of data features.

Python code
import matplotlib.pyplot as plt

import numpy as np

import pandas as pd
import scipy.stats as stats

import seaborn as sns
import statsmodels.api as sm


def generate_data(
        n_samples: int,
        loc: int = 100,
        scale: int = 30,
        trend_max: int = 100,
        hetero: float = 1
) -> pd.DataFrame:
    time = pd.date_range("2024-01-01", periods=n_samples, freq="1H")

    metric_values = (
            loc
            + np.random.normal(scale=scale, size=n_samples)
            * np.linspace(1, hetero, n_samples)
            + np.linspace(0, trend_max, n_samples)
    )

    return pd.DataFrame({"time": time, "metric": metric_values})


ONE_MONTH = 24 * 30
test_data = generate_data(ONE_MONTH)

# plot data

sns.lineplot(x="time", y="metric", data=test_data, alpha=0.5)
# fit linear regression
slope, intercept, rvalue, pvalue, stderr = stats.linregress(
    test_data.index, test_data.metric
)


# test for Homoscedasticity of the residuals

def test_homoscedasticity(test_data: pd.DataFrame):
    # prepare y and X for statsmodels
    y = test_data.metric
    X_sm = test_data.index
    X_sm = sm.add_constant(X_sm)
    # fit linear regression using statsmodels
    mod_sm = sm.OLS(y, X_sm)
    res_sm = mod_sm.fit()
    # test for Homoscedasticity
    bp_lm, bp_lm_pvalue, bp_fvalue, bp_f_pvalue = sm.stats.diagnostic.het_breuschpagan(
        res_sm.resid, res_sm.model.exog
    )

    return bp_f_pvalue


bp_f_pvalue = test_homoscedasticity(test_data)
if bp_f_pvalue < 0.05:
    print("Residuals are not Homoscedastic. Linear regression may be invalid")
else:
    print("Residuals are Homoscedastic")


# test autocorrelation
def test_autocorrelation(residuals):
    dw_statistic = sm.stats.durbin_watson(residuals)
    return dw_statistic


# Assuming you have the 'residuals' variable defined elsewhere in your code
# Conduct the test for autocorrelation


# Check the p-value and print the result
residuals = test_data.metric - (intercept + slope * test_data.index)
dw_statistic = test_autocorrelation(residuals)

if dw_statistic > 1.5 and 2.5:
    print("Errors are independent (no signs of autocorrelation)")
else:
    print("Аutocorrelation in the residuals. Independence of errors is not satisfactory.")

# bootstrap linear regression slope
n_iterations = 1000
slopes = np.zeros(n_iterations)
intercepts = np.zeros(n_iterations)
r_square = np.zeros(n_iterations)
for i in range(n_iterations):
    sample = test_data.sample(frac=1, replace=True)
    slope, intercept, rvalue, pvalue, stderr = stats.linregress(
        sample.index, sample.metric
    )
    slopes[i] = slope
    intercepts[i] = intercept
    r_square[i] = rvalue ** 2
# calculate 95% confidence interval
slopes_quantiles = np.percentile(slopes, [2.5, 97.5])
intercept_quantiles = np.percentile(intercepts, [2.5, 97.5])
r_square_quantiles = np.percentile(r_square, [2.5, 97.5])
print(f'r_square is between {r_square_quantiles}')
print(f'slope is between {slopes_quantiles}')

# plot two lines for the confidence interval slopes
plt.plot(
    test_data.time,
    intercept_quantiles[0] + slopes_quantiles[0] * test_data.index,
    label="lower bound",
    color="green",
    linestyle="--",
    linewidth=2
)

plt.plot(
    test_data.time,
    intercept_quantiles[1] + slopes_quantiles[1] * test_data.index,
    label="upper bound",
    color="green",
    linestyle="--",
    linewidth=2
)

# plot linear regression

plt.plot(
    test_data.time,
    intercept + slope * test_data.index,
    label="fitted line",
    color="red",
    linewidth=2
)

plt.legend()
plt.show()

Over the period, the average value has increased by 2 times trend_max = 100, but due to the large number of outliers and noise in the data, R² is quite low. In such cases, it is important to understand that R², contrary to common misconceptions, does not speak about the quality of the model, but only about how accurately it approximates our function, and in the case of a large number of outliers, the indicator will always be quite low. In this case, it is important to pay attention to other tests and slope.

We also did not mention an important weakness of linear regression – its instability to outliers. This is expressed in wide confidence intervals for the slope coefficients. Improving model quality often requires removing outliers or using more robust methods that can better cope with anomalies in the data. Let’s look at some of them below.

Mann-Kendall test

The Mann-Kendall method is a nonparametric test that determines the presence or absence of a trend in data ranks. It does not depend on specific metric values, but only takes into account their order, that is, the sign of the difference between successive observations.

Among the advantages of this test are its independence from outliers and the ability to identify monotonic dependencies even in the case of a nonlinear form of dependence. However, caution should be exercised when using it in practical problems, since even a small trend may be considered statistically significant. This can lead to incorrect conclusions from a business logic perspective, where minor changes may not be meaningful enough.

Let’s look at the following example for clarity.

Python code
import numpy as np
import pandas as pd
import scipy.stats as stats


def generate_data(
        n_samples: int,
        loc: int = 100,
        scale: int = 0.1,
        trend_max: int = 0.5,
        hetero: float = 1,
        outlier: bool = False,
) -> pd.DataFrame:
    time = pd.date_range("2024-01-01", periods=n_samples, freq="1H")
    metric_values = (
            loc
            + np.random.normal(scale=scale, size=n_samples)
            * np.linspace(1, hetero, n_samples)
            + np.linspace(0, trend_max, n_samples)
    )
    if outlier:
        metric_values[n_samples // 2] = -np.max(metric_values) * 1000
    return pd.DataFrame({"time": time, "metric": metric_values})


ONE_MONTH = 24 * 30
test_data = generate_data(ONE_MONTH)
# test mann-kendall trend
n_iterations = 100
p_values = np.zeros(n_iterations)
taus = np.zeros(n_iterations)
for i in range(n_iterations):
    test_data = generate_data(ONE_MONTH)
    tau, pvalue = stats.mstats.kendalltau(test_data.index, test_data.metric)
    taus[i] = np.round(tau, 2)
    p_values[i] = pvalue

tau_quantiles = np.percentile(taus, [2.5, 97.5])
# number of times we reject the null hypothesis
print(f'trend is significant in {np.sum(p_values < 0.05)} cases from {n_iterations}')
print(f'tau_quantiles interval {tau_quantiles}')

Thus, the Mann-Kendall test applied to this example will show the presence of a trend 100 out of 100 times, even if the trend is insignificant.

For this example, even a boosted Mann-Kendall metric will always indicate the presence of a trend. In addition to the p-value, the Mann-Kendall test returns a tau statistic that ranges from -1 to 1. A value of 1 indicates positive monotonicity, -1 negative, and 0 no monotonicity. However, this metric is difficult to interpret because it does not provide information about the magnitude of the trend. In the current example, the tau value is about 0.6, which indicates positive monotonicity, but does not provide information about the specific magnitude of the trend.

Thus, after a positive conclusion about the presence of a trend in the Man-Kendall test, it is necessary to additionally evaluate the slope angle using other tests. Linear regression on the same data shows that the slope is minimal.

If it is possible to estimate the angle of inclination in the second step, then the meaning of the Man-Kendall test itself is lost. Moreover, frequent failures of the test on minor trends can lead to incorrect conclusions by the analyst, which raises questions about its applicability in real-world scenarios.

Despite this, the strengths of the Mann-Kendall test include its ability to quickly detect nonlinear trends, which can be useful in some situations.

Spearman’s Rank Correlation Test

Just like the Mann-Kendall test, the Spearman test is based on rank analysis. The Spearman test calculates the Spearman correlation coefficient, which measures the degree of monotonic relationship between two variables.

Python code
import numpy as np
import pandas as pd
import scipy.stats as stats


def generate_data(
        n_samples: int,
        loc: int = 100,
        scale: int = 0.1,
        trend_max: int = 0.1,
        hetero: float = 1,
        num_outliers: int = 50,
) -> pd.DataFrame:
    time = pd.date_range("2024-01-01", periods=n_samples, freq="1H")
    metric_values = (
            loc
            + np.random.normal(scale=scale, size=n_samples)
            * np.linspace(1, hetero, n_samples)
            + np.linspace(0, trend_max, n_samples)
    )
    if num_outliers > 0:
        indices_outliers = np.random.choice(n_samples, num_outliers, replace=False)
        metric_values[indices_outliers] = np.max(metric_values) * 100

    return pd.DataFrame({"time": time, "metric": metric_values})


ONE_MONTH = 24 * 30
test_data = generate_data(ONE_MONTH)

# test mann-kendall trend
n_iterations = 100

p_values_mk = np.zeros(n_iterations)
taus = np.zeros(n_iterations)

p_values_sp = np.zeros(n_iterations)
spearman_coefficients = np.zeros(n_iterations)

for i in range(n_iterations):
    test_data = generate_data(ONE_MONTH)
    tau, p_value_mk = stats.mstats.kendalltau(test_data.index, test_data.metric)
    spearman_coefficient, p_value_sp = stats.spearmanr(test_data.index, test_data.metric)
    taus[i] = np.round(tau, 2)
    p_values_mk[i] = p_value_mk

    spearman_coefficients[i] = round(spearman_coefficient, 2)
    p_values_sp[i] = p_value_sp

tau_quantiles = np.percentile(taus, [2.5, 97.5])
spearman_quantiles = np.percentile(spearman_coefficients, [2.5, 97.5])
# number of times we reject the null hypothesis

print(f'MK trend is significant in {np.sum(p_values_mk < 0.05)} cases from {n_iterations}')
print(f'SP trend is significant in {np.sum(p_values_sp < 0.05)} cases from {n_iterations}')

print(f'tau_quantiles interval {tau_quantiles}')
print(f'spearman_quantiles interval {spearman_quantiles}')

The disadvantages and advantages of the Man-Kendall test are also transferred to the Spearman test, including sensitivity to low values ​​of trend angles, which also casts doubt on its applicability in many tasks where the goal is to determine the presence or absence of a trend and understand its nature.

Repeated median regression (RMR)

The problem of linear regression being unstable to outliers has led to the development of a whole family of methods that are robust to rare large and small values. The idea is simple: for each point in the sample, a line is drawn that passes through it and all other points, and then the slope of this line is calculated. Then, the median value is selected from all the obtained slope angles. This process is repeated for each point, and then the median of all medians is calculated again.

When it works well:

  • With outliers in the data: The method is robust to the presence of outliers and anomalies, which makes it useful in situations where classical linear regression may produce distorted results due to outliers.

  • Nonlinear dependencies: This method can be effective in approximating nonlinear relationships because it is not limited by the assumption that the relationship between variables is linear.

  • Working with heterogeneous data: The method can be effective when working with data containing noise or heterogeneous structures, as it estimates slope angles taking into account differences in the data.

It is important to note that this method may require additional computing resources, especially when processing large amounts of data.

Python code
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.stats as stats
import seaborn as sns
import statsmodels.api as sm


def generate_data(
    n_samples: int,
    loc: int = 100,
    scale: int = 10,
    trend_max: int = 20,
    hetero: float = 10,
    n_outliers: int = 30,
) -> pd.DataFrame:
    time = pd.date_range("2024-01-01", periods=n_samples, freq="1H")
    metric_values = (
        loc
        + np.random.normal(scale=scale, size=n_samples)
        * np.linspace(1, hetero, n_samples)
        + np.linspace(0, trend_max, n_samples)
    )

    for i in range(n_outliers):
        metric_values[i * n_samples // n_outliers] = -np.max(metric_values) * 10
    return pd.DataFrame({"time": time, "metric": metric_values})


ONE_MONTH = 24 * 30
test_data = generate_data(ONE_MONTH)


# bootstrap linear regression slope
n_iterations = 100
slopes_RMR = np.zeros(n_iterations)
intercepts_RMR = np.zeros(n_iterations)
slopes_LR = np.zeros(n_iterations)
intercepts_LR = np.zeros(n_iterations)
for i in range(n_iterations):
    sample = test_data.sample(frac=1, replace=True)
    res = stats.siegelslopes(sample.metric, sample.index)
    slopes_RMR[i] = res.slope
    intercepts_RMR[i] = res.intercept
    res_reg = stats.linregress(sample.metric, sample.index)
    slopes_LR[i] = res_reg.slope
    intercepts_LR[i] = res_reg.intercept
# calculate 95% confidence interval
slopes_quantiles_RMR = np.percentile(slopes_RMR, [2.5, 97.5])
intercept_quantiles_RMR = np.percentile(intercepts_RMR, [2.5, 97.5])

slopes_quantiles_LR = np.percentile(slopes_LR, [2.5, 97.5])
intercept_quantiles_LR = np.percentile(intercepts_LR, [2.5, 97.5])

# results RMR and LR
print(f"RMR  95% confidence interval for slope: {slopes_quantiles_RMR}")
print(f"RMR  95% confidence interval for intercept: {intercept_quantiles_RMR}")

print(f"LR 95% confidence interval for slope: {slopes_quantiles_LR}")
print(f"LR 95% confidence interval for intercept: {intercept_quantiles_LR}")

Based on the results, it is clear that this method is not only resistant to outliers, but also works effectively in conjunction with bootstrap. Another advantage is that this method is less sensitive to violations of homoscedasticity and can produce good results even if the error variance is not constant at all levels of the predictor. However, it is important to understand that it is less sensitive, and at high variance values ​​the model quality also deteriorates. Below is the result of execution with hetero=30.

As you can see, RMR produces values ​​close to real ones, while for classical linear regression the interval becomes extremely wide.

Conclusion

We looked at several popular methods for determining the trend. Each of them has its own strengths and weaknesses, and the choice of a specific method depends on the purpose of the analysis, the nature of the data and the required accuracy of trend estimation.
In this article, we practically did not touch upon the methods that are used in analyzing trends in nonlinear relationships such as polynomial regression, exponential regression, logistic regression and others. Or more complex algorithms such as ARIMA, SARIMA and others.

Additional tests for autocorrelation and heteroskedasticity can be found https://www.statsmodels.org/stable/diagnostic.html.

You can read about the consequences of careless cleaning of emissions in the article https://habr.com/ru/articles/781060/

Authors

Roman Rudnitskiy

Marat Yuldashev

Mikhail Tretyakov

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *