Who will win in the battle for the forecast. Chapter two. The beginning

The first chapter of this study described the weather time series dataset that we will use to perform the temperature forecasting task, and also provided the steps to pre-prepare it.

In this chapter we will consider the processes of autoregressive integrated moving average according to the methodology ARPSS (in English terminology – ARIMA). Let us examine why the ARIMA process yields a broad class of stationary and non-stationary models that adequately describe many time series encountered in practice. Then we will apply this methodology to find a suitable subclass of models from the general family of ARIMA models for adequately forecasting future temperature values.

  1. Description of the climate dataset and its preliminary preparation.

  2. Statistical analysis of data, including estimation of the seasonal component, for developing models from the family ARIMA for the purpose of temperature forecasting.

  3. Developing Deep Learning Models Keras using layer LSTM for a similar task.

  4. Comparison and evaluation of the effectiveness of forecasting results.

  5. Consideration of the approach of predicting the biased temperature using deep learning model Keras using layers LSTM.


Machine learning and deep learning methods in general are often considered as universal tools for solving various problems, including predictive modeling, natural language processing, and image recognition. And artificial intelligence is generally perceived as magic bells, the ringing of which provides an instant solution to all sorts of problems. However, in the direction of time series forecasting, as a specific area of ​​predictive modeling, the situation is not so clear. Far from clear.

A study was published back in 2018 [ссылка]which assessed and compared the effectiveness of both traditional statistical forecasting methods and modern machine learning methods, including recurrent neural networks, for solving time series forecasting problems. The experiments were based on forecasting both one step ahead and several intervals (horizons) of different lengths. The authors of the study state: “By comparing the post-sampling accuracy of popular ML methods [Machine Learning. – Примечание] with the accuracy of eight traditional statistical methods [среди которых ARIMA. – Примечание]we found that the latter dominates on both accuracy measures. [sMAPE и MASE. – Примечание] and for all forecasting horizons considered. In addition, we noted that their computational requirements are significantly higher than those of statistical methods.”

The authors also identified problems with using forecasts made using machine learning models, which raise many doubts about their quality and interpretation: “P“Getting numbers out of a 'black box' is unacceptable to practitioners who need to know how forecasts are generated and how they can be influenced or adjusted to produce workable predictions.”

The final message of the presented study can be concluded in the need to use traditional statistical methods as a basic solution problems of time series forecasting: «The problem with the academic literature on ML forecasting is that most published studies provide forecasts and claim satisfactory accuracy without comparing them to simple statistical methods or even simplified benchmarks. [яркий тому пример – kaggle. – Примечание]. This creates expectations that ML methods provide accurate predictions, but without any empirical evidence that this is the case.”

The study is very interesting; I recommend you read it. Based on its findings, we will set the bar very high for our temperature forecasting task, using a statistical model as a baseline solution. SARIMAXand we will try to jump to its results using a deep learning model Keras on the basis LSTM. And we will try to find out whether everything is so bad with deep learning in solving time series forecasting problems.

So, let's begin.

Let's first clarify the following.

Box-Jenkins model, or ARIMA (from English. AutoRegressive Integrated Moving Averageor Autoregressive Integrated Moving Average (ARPSS)) – this is first and foremost a methodology, not one specific model. Let’s figure it out.

ARIMA consists of three main components:

  • AutoRegressive (AR): This part of the model is responsible for the dependence of the current value of the time series on its previous values. This means that the current value of the series can be described as a linear combination of its past values.

  • Integrated (I): This part of the model deals with the transformation of the time series to achieve stationarity. If the time series is not stationary (e.g. has a trend or seasonal variations), it can be made stationary by transformation – mainly by using the operations of taking successive differences (differentiation) or logarithms [ссылка].

  • Moving Average (MA): This part of the model takes into account the influence of random errors (or “noise”) on the current value of the time series. This means that the current value depends on previous forecast errors. In other words, the moving average model assumes that the errors of the model in previous periods contain information about the entire history of the time series.

Each component discussed above is expressed in a letter designation:

  • p – order of autoregression;

  • d – order of difference;

  • q – order of moving average.

Parameters p, d And q can take on different values, which leads to the creation of multiple implementations ARIMA (=different models). Below are examples of configurations:

  • AR (2, 0, 0) or AR (p=2):

    model with only an autoregressive process with two previous values ​​(lags).

  • M.A. (0, 0, 1) or M.A. (q=1):

    model with only one-lag moving average process.

  • ARMA (1, 0, 1) or ARMA (p=1, q=1):

    mixed process autoregressive-moving average model: with one lag for the autoregressive and one lag for the moving average.

  • ARIMA (2, 1, 2)

    a more complex model with a combination of all three components: two-lag autoregression, first-order difference, and two-lag moving average.

The specific configuration is selected during the analysis of the characteristics of the target time series under consideration.. For example, if the time series under consideration is already stationary according to statistical tests, then differentiation (or logarithmization) is usually not necessary: ​​in this case, the parameter d is taken to be equal to 0 (about the stages of determining the components p, d And q (will be discussed further). In this regard, the stationarity of the time series is necessary for a better, more accurate determination of the parameters ARMA (p, q).

The methodology is that the construction of the model ARIMA for the implementation of a random process (if future values ​​of a time series can only be described using a probability distribution), its authors – J. Box and G. Jenkins – proposed dividing it into several stages:

  1. Identify the order of difference dthat is, to achieve stationarity of the series by taking a sufficient number of successive differences [из книги «Анализ временных рядов: прогноз и управление /Дж.Бокс и Г.Дженкинс. Выпуск 1. – Издательство «Мир», Москва 1974. – 194 с.»]. To determine the value d a heuristic criterion can be applied.

  2. Construct an ACF for the resulting stationary time series [Автокорреляционную функцию. – Примечание] and CHAKF [Частную автокорреляционную функцию. – Примечание] [для определения тесноты статистической связи между запаздывающими значениями (лагами) временного ряда. – Примечание]. By studying the nature of their behavior, put forward hypotheses about the values ​​of the parameters p And qthat is, a model is selected ARMA (p, q). At this stage, a basic set of models is formed, including one, two or even more models. [из книги «Анализ временных рядов и прогнозирование: учебник /В.Н. Афанасьев, М.М. Юзбашев. Изд. 2-е, перераб. и доп. – М.: Финансы и статистика, 2010. – 290 с.»]

  3. Estimate model parameters and perform a “diagnostic check” by comparing model results with actual data to identify fit flaws and diagnose their causes [из книги «Анализ временных рядов: прогноз и управление /Дж.Бокс и Г.Дженкинс. Выпуск 1. – Издательство «Мир», Москва 1974. – 36, 232, 313 с.»]. Graphical methods and statistical tests are used to analyze the model residuals (the difference between actual and predicted values) to ensure that they behave like white noise and that the model adequately describes the data. If problems are found, the model is adjusted.

  4. Forecasting: Once an accurate model has been defined and fitted to the actual data, it can be used to forecast future values ​​of the target time series. [из книги «Анализ временных рядов: прогноз и управление /Дж.Бокс и Г.Дженкинс. Выпуск 1. – Издательство «Мир», Москва 1974. – 144 с.»]Forecasts based on the information contained in the model can be both short-term and long-term.

Thus, Box-Jenkins methodology consists of analyzing a time series to determine the optimal parameters of a forecasting model (p, d, q) processes ARIMA. In turn, model ARIMA – is a model from a family (or class) of models ARIMAin which the difference operation (I) at least once to make it stationary, and the conditions of the autoregressive processes are combined (AR) and moving average (M.A.).

We will also add that there is an extension of the model ARIMA under the name SARIMA. This is a model with a seasonal component (S), which adds even more possibilities for time series modeling.

In conclusion, we will define what “endogenous” and “exogenous” variables are, presenting the definitions from the book [Практический анализ временных рядов: прогнозирование со статистикой и машинное обучение.: Пер. с англ. — СПб. : ООО “Диалектика”, 2021. — 238 с.]: “From the point of view of statistical research, the variables of a model that influence each other are called endogenous — their values ​​are described by the behavior of the model. In contrast to them, exogenous variables are not explained by the model – they are not described by the model's assumptions, and therefore their values ​​are taken as they are and are not subject to dynamic change.”

Now that we have thoroughly improved our knowledge in methodology ARIMAI suggest we return to our time series and analyze them using statistical methods to select the appropriate parameters p, d And q.

First, we import the necessary modules and attributes that we will use in the second chapter of the study, and also change some parameters of the method pyplot libraries matplotlib by default to custom:

from matplotlib import pyplot
from numpy import arange
from numpy.random import seed
from pandas import read_parquet, Series
from sklearn.metrics import mean_absolute_percentage_error
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tools.sm_exceptions import InterpolationWarning
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.stattools import adfuller, kpss
from termcolor import colored
from warnings import simplefilter

simplefilter('ignore', InterpolationWarning)

pyplot.rcParams["figure.autolayout"]=True
pyplot.rcParams["figure.figsize"]=(14, 7)

seed(1)

Using the function read_parquet() libraries pandas import the data set prepared in the first chapter of the study, passing it the file name along with its extension. Since there is an unresolved issue (open issue on github) about the loss of information with the frequency of data when they are saved in the format parquetthen additionally using the method asfreq('h') Let's reset the hourly frequency.

df = read_parquet('jena_climate_2009_2016_hour_grouped_data.parquet').asfreq('h')

Let's create a separate variable with the target time series for forecasting with the object type pandas Series. Let me remind you: the object Series “is a one-dimensional, homogeneous, labeled array containing values ​​and an index” [Pandas в действии / Борис Пасхавер. – СПб.: Питер, 2023. – 89 С.].

temperature = df['T (degC)']

So, the first thing we need to do according to the above methodology is to check the existing time series for stationarity (determine the parameter d). This can be done visually using graphs (empirically), but the most reliable way is to resort to statistical tests specially created for this purpose. In practice, a combination of two tests is often used – the extended Dickey-Fuller test (in English terminology – ADF test) and the Kwiatkowski-Phillips-Schmidt-Shin test (KPSS test) [ссылка]. Performing two tests at once ensures a more thorough check of the time series for stationarity. The purpose of the first is to check for the presence of a unit root; however, this test has a number of significant drawbacks. [ссылка]which are lacking in the KFSSH test, the purpose of which is to check for the presence of trends and seasonality in the data.

In this case, the methodological approach of the CPSS test is completely different from the approach of the extended Dickey-Fuller test, the main difference of which is the permutation of the null and alternative hypotheses:

Usually, for both tests, the significance level is taken to be 0.05. Comparing with it the values ​​calculated as a result of the tests p determine whether the time series is stationary according to each test or not.

So, let's perform the extended Dickey-Fuller test and extract from its results only p-meaning (result[1]), which will then be compared with the significance level for rejecting either the null or alternative hypothesis. For easy perception of the output, the determined stationary state of the time series will be highlighted in green, and the non-stationary state in red. In this case, if non-stationary time series are detected during testing, their names will be displayed.

def adf_test_columns_stationarity(dataframe):
    """Проверка временных рядов на стационарность по тесту Дики-Фуллера"""
    msg = "Выполнение расширенного теста Дики-Фуллера:"
    print('\n' + msg)
    print('-' * len(msg))
    adf_non_stationary_columns = []

    for i in range(len(dataframe.columns)):
        result = adfuller(dataframe[dataframe.columns[i]])
        if result[1] < 0.05:
            h_text = colored('стационарен', 'grey', 'on_green')
            print(f"Ряд {df.columns[i]} {h_text}: p-value = {result[1]:.5f}")
        else:
            h_text = colored('нестационарен', 'grey', 'on_red')
            print(f"Ряд {df.columns[i]} {h_text}: p-value = {result[1]:.5f} > 0.05")
            adf_non_stationary_columns.append(dataframe.columns[i])

    if len(adf_non_stationary_columns) > 0:
        msg = "Нестационарные временные ряды:"
        print('\n' + msg)
        print('-' * len(msg))
        print(* adf_non_stationary_columns, sep='\n')


adf_test_columns_stationarity(dataframe=df)

The test results are presented below:

As can be seen, all time series are determined to be stationary by the extended Dickey-Fuller test. Next, we run the time series stationarity test using the KPSS test:

def kpss_test_columns_stationarity(dataframe):
    """Проверка временных рядов на стационарность по тесту КФШШ"""
    msg = "Выполнение теста Квятковского-Филлипса-Шмидта-Шина:"
    print('\n' + msg)
    print('-' * len(msg))
    kpss_non_stationary_columns = []

    for i in range(len(dataframe.columns)):
        result = kpss(dataframe[dataframe.columns[i]])
        if result[1] > 0.05:
            h_text = colored('стационарен', 'grey', 'on_green')
            print(f"Ряд {df.columns[i]} {h_text}: p-value = {result[1]:.5f}")
        else:
            h_text = colored('нестационарен', 'grey', 'on_red')
            print(f"Ряд {df.columns[i]} {h_text}: p-value = {result[1]:.5f} < 0.05")
            kpss_non_stationary_columns.append(dataframe.columns[i])

    if len(kpss_non_stationary_columns) > 0:
        msg = "Нестационарные временные ряды:"
        print('\n' + msg)
        print('-' * len(msg))
        print(* kpss_non_stationary_columns, sep='\n')


kpss_test_columns_stationarity(dataframe=df)

The test results are presented below:

In contrast to the augmented Dickey-Fuller test, the KPSS test revealed non-stationarity in most of the time series, including the target one. The case where the augmented Dickey-Fuller test indicates stationarity of the time series, and the KPSS test for non-stationarity by its colleagues from statsmodels [ссылка] characterized as “the series is difference-stationary: to make the series stationary, it is necessary to use differentiation”In our case, we can conclude that the time series is not strictly stationary due to the presence of seasonality/trend.

Please note that by default the library's KPSS test stastmodels performs a test of the stationarity of a time series with respect to a constant (regression='c'), that is, by the average value of the series. If the null hypothesis of stationarity is rejected, this may indicate the presence of seasonal fluctuations. regression='ct' the test will check the stationarity of the time series with respect to a linear trend [ссылка]which suggests the presence of a trend in the time series. In this study, we will use the CPPS test with default settings and confirmation of its alternative hypothesis will mean that the series is non-stationary due to the presence of seasonal fluctuations.

But let's take another look at the results of the CPSU test. Does this remind you of anything? No, I'm not talking about the Communist Party of the Soviet Union (as you can see above, understanding how the CPSU test works does not require a party card). (“ha” three times) Remember the prepared graph of the distribution of time series in time from the first chapter of our study. The CPSU test directly confirmed our guesses about the periodicity of the data for most time series.

Hidden text

Link to the image from the first chapter

Looking at the obtained results, I am pleased to report that our case is the most interesting in terms of the distribution of time series by stationarity. Both the target time series – temperature – and most of the other time series that we planned to use as exogenous are non-stationary in seasonality/trend. In this regard, we will have to conduct many experiments, based on the results of which we will try to answer some questions that have not yet been defined as resolved among statisticians/data researchers. For example: in the case of using time series as exogenous, is it necessary to pre-perform the operation of taking the difference if, according to the test results, they turned out to be non-stationary? An amusing oddity to this question is the fact that in some books/articles provided by the authors of the questions as arguments for stats.stackexchange.com states one thing, while other articles on forecasting state the opposite. Another example: does setting the parameter imply d > 0 in model fitting parameters SARIMAX automatic integration of not only the endogenous regressor, but also the exogenous ones? The questions are interesting and to solve them we will use a powerful and proven tool – an experiment, which, as is known, is the most important thing in our research: we will perform a series of experiments and then analyze the results obtained.

So, given that our case is not only the most interesting, but also expensive in terms of the number of fascinating experiments, then, like the solution of any complex problem, it will be correct to break it down into several steps and go from the simplest solution to the complex one. In this regard, to solve the problem of predicting temperature, we will first consider only one time series – the temperature itself, which will act as endogenous, that is, as we remember, described by a model based on historical information. In English terminology, this corresponds to the concept univariate time series forecasting (one-dimensional time series forecasting). Once we have found the optimal parameters for this model, p, d And q and we get good results, we will then try to improve them – first we will try to add a seasonal component, and then exogenous variables and see if they can improve the result.

Let's get to work!

For convenient perception of the values ​​obtained during statistical tests, using the following function we will reduce the number of digits of the fractional part of the number to numwhich by default is equal to 3, and we will leave integers unchanged:

def round_series_floats(series, num=3):
    """Округление вещественных чисел до num-знаков после точки"""
    return series.apply(lambda x: x if isinstance(x, int) else round(x, num))

So, let's perform an in-depth check of the temperature time series for stationarity and output all the results of both tests. Additionally, let's plot a graph showing the original time series along with the moving average and standard deviation over an annual period of 8760:

def plot_movingAverage(timeseries, adf_p_value, kpss_p_value, window=24*365):
    """График для определения тенденции"""
    movingAverage = timeseries.rolling(window=window).mean().dropna()
    movingSTD = timeseries.rolling(window=window).std().dropna()

    x_start, x_end = movingAverage.index[0], movingAverage.index[-1]
    y_start, y_end = movingAverage.iloc[0], movingAverage.iloc[-1]

    pyplot.figure()
    pyplot.plot(timeseries, color="lightgrey", label="Оригинальный ряд")
    pyplot.plot(movingAverage, color="coral", label="Скользящее среднее")
    pyplot.plot(movingSTD, color="black", label="Стандартное отклонение")

    pyplot.scatter(x_start, y_start, color="lightsalmon", marker="o")
    pyplot.scatter(x_end, y_end, color="lightsalmon", marker="o")

    pyplot.text(x_start, y_start, f'{y_start:.3f}', ha="right")
    pyplot.text(x_end, y_end, f'{y_end:.3f}', ha="left")

    pyplot.plot([x_start, x_end], [y_start, y_end], color="blue",
                dashes=[4, 3], label="Тенденция")

    text_1 = f"ADF p-значение: {adf_p_value:.5f}"
    text_2 = f"KPSS p-значение: {kpss_p_value:.5f}"
    pyplot.text(0.845, 0.2, s=text_1 + '\n' + text_2,
                ha="left", va="bottom", transform=pyplot.gca().transAxes,
                bbox=dict(facecolor="blue", alpha=0.2))

    pyplot.xlabel('Дата наблюдения', fontsize=12)
    pyplot.ylabel('Температура (град. C)', fontsize=12)
    pyplot.legend(loc="lower right")
    pyplot.title(f"\"{timeseries.name}\": скользящее среднее и стандартное отклонение (окно = {window})")
    pyplot.grid(True)
    pyplot.show()

def test_stationarity(timeseries, plot=True):
    """Выполнение углублённой проверки временного ряда на стационарность"""
    print('\n'f"Выполнение расширенной проверки ВР \"{timeseries.name}\" на стационарность:")
    adf_test = adfuller(timeseries)
    adf_result = Series(adf_test[0:4],
                        index=['Тестовая статистика',
                                'p-значение',
                                'Использованные лаги',
                                'Кол-во использованных наблюдений']
                        )
    for key, value in adf_test[4].items():
        adf_result[f"Критическое значение {key}"] = value
    adf_result = round_series_floats(adf_result)

    kpss_test = kpss(timeseries)
    kpss_result = Series(kpss_test[0:3], index=['Тестовая статистика',
                                                'p-значение',
                                                'Использованные лаги'])
    for key, value in kpss_test[3].items():
        kpss_result[f"Критическое значение {key}"] = value
    kpss_result = round_series_floats(kpss_result)

    msg = "Результаты теста Дики-Фуллера (ADF-тест):"
    print('\n' + msg)
    print('-' * len(msg))
    for index, value in adf_result.items():
        sep = ' ' * (40 - len(index))
        print(f'{index}: {sep} {value}')

    msg = "Результаты теста Квятковского-Филлипса-Шмидта-Шина (KPSS-тест):"
    print('\n' + msg)
    print('-' * len(msg))
    for index, value in kpss_result.items():
        sep = ' ' * (40 - len(index))
        print(f'{index}: {sep} {value}')

    if plot:
        plot_movingAverage(timeseries=timeseries,
                           adf_p_value=adf_result['p-значение'],
                           kpss_p_value=kpss_result['p-значение'])

        
test_stationarity(timeseries=temperature)

The code may seem difficult to understand, but with its help we will derive all the necessary information that will allow us to draw correct conclusions about the stationarity of the time series. Function test_stationarity() will perform the above statistical tests and output the full results in an easy-to-read format. By using the function plot_movingAverage() a graph will be constructed that will help to visually verify the presence of a positive trend in the data; the graph will show both the original time series as a background and a moving average with a standard deviation with a window equal to 8760. In this case, we will mark the starting and ending points of the moving average on the graph, the coordinates of which we will determine from its data, and for convenience, we will mark their corresponding temperature values ​​next to them. We will draw a segment along these points, by which we will judge the trend in the data. Additionally, we will mark on the graph p– the values ​​of both tests.

Agree, it’s not that difficult.

The results of executing the functions are shown below.

If we did not draw the additional segment together with the graph grid, we might assume that both the trend and the standard deviation are constant over time (= stationary). However, as we can see, this is not the case. The graph shows a subtle trend of increasing temperature over time.

So, we've sorted out the graph: we've made sure that the time series data with temperature shows a tendency for it to increase. Next, let's look at the results of statistical tests. For convenience, we'll combine them together with the analysis into a common table.

Thus, according to the above results and their analysis, it follows that the time series “Temperature” is stationary according to the extended Dickey-Fuller test and non-stationary according to the CPSS test. Using an additionally constructed graph with the original time series together with the moving average and standard deviation for a period equal to 8760, as well as a segment drawn along the coordinates of the final points of the moving average, we are convinced that the time series has a tendency to increase in temperature. In this regard, we accept that the time series is non-stationary both in seasonality and in trend and, therefore, it is necessary to carry out the operation of taking a successive difference to achieve its stationarity.

Let's perform differentiation [ссылка] temperatures by one order of magnitude (period=1) and we will perform a stationarity test using the following function, the basis of which is already familiar to you:

def series_p_value_verification(series, periods=None):
    """Проверка временных рядов на стационарность по двум тестам"""
    msg = "Выполнение расширенного теста Дики-Фуллера:"
    print('\n' + msg)
    print('-' * len(msg))

    if periods:
        name = series.name
        series = series.diff(periods=periods).dropna()
        series.rename(name + ' diff=" + str(periods), inplace=True)

    adf_result = adfuller(series)
    if adf_result[1] < 0.05:
        h_text = colored("стационарен', 'grey', 'on_green')
        print(f"Ряд {series.name} {h_text}: p-value = {adf_result[1]:.5f}")
    else:
        h_text = colored('не стационарен', 'grey', 'on_red')
        print(f"Ряд {series.name} {h_text}: p-value = {adf_result[1]:.5f} > 0.05")

    msg = "Выполнение теста Квятковского-Филлипса-Шмидта-Шина:"
    print('\n' + msg)
    print('-' * len(msg))

    kpss_result = kpss(series)
    if kpss_result[1] > 0.05:
        h_text = colored('стационарен', 'grey', 'on_green')
        print(f"Ряд {series.name} {h_text}: p-value = {kpss_result[1]:.5f}")
    else:
        h_text = colored('не стационарен', 'grey', 'on_red')
        print(f"Ряд {series.name} {h_text}: p-value = {kpss_result[1]:.5f} < 0.05")


series_p_value_verification(series=temperature, periods=1)

The results are presented below.

The results show that one order of difference was sufficient to achieve stationarity and its further differentiation is not required. Thus, the parameter d for the target time series we take equal to 1. Let me remind you that there is no need to manually perform differentiation of the endogenous time series before training the model: the model ARIMA will automatically perform the operation of taking a successive difference according to the set parameter d. It should be borne in mind that “in practice d “usually equal to 0, 1 or a maximum of 2”, as J. Box and G. Jenkins note in their book.

To determine the parameters p And q We will use the following function, which will plot a stationary time series with a difference of one order, plots of the autocorrelation function (ACF) and partial autocorrelation function (PACF) with a time lag of 120, as well as ACF with a number of lags equal to the number of observations of the time series for a better presentation of information about the long-term relationship between the levels (lags) of the stationary time series. Using the plotted ACF graphs, we will also analyze our data for seasonality.

def plot_target_analysis(series, periods=None, lags=None):
    """График с анализом характеристик целевого признака для определения параметров p, q"""
    if periods:
        series = series.diff(periods=periods).dropna()

    fig, axs = pyplot.subplots(2, 3)

    axs[0, 0].plot(series)
    axs[0, 0].set_title(f"Ряд с разностью {periods}-го порядка")
    plot_acf(series, ax=axs[0, 1], lags=lags)
    axs[0, 1].set_title(f"АКФ ({periods}-й порядок)")
    plot_pacf(series, ax=axs[0, 2], lags=lags)
    axs[0, 2].set_title(f"ЧАКФ ({periods}-й порядок)")

    for ax in axs[1, :]:
        ax.remove()

    axs[1, 0] = fig.add_subplot(212)

    plot_acf(series, ax=axs[1, 0], lags=arange(len(series)), marker="")
    axs[1, 0].set_title(f"График автокорреляции временного ряда {periods}-го порядка")

    pyplot.show()


plot_target_analysis(series=temperature, periods=1, lags=120)

Hmm… Dear friends, here we are!

In addition to the annual seasonality, which is presented in the lower ACF graph as a sinusoid, decaying with an increase in the number of lags (further we will display this graph separately for better understanding), a strong daily seasonality of 24 was also found, which is clearly visible in the upper ACF graph. The difficulty is that the model SARIMA does not directly support more than one seasonality and it is considered that its implementation in this case would not be the best idea. I note that special models have been developed for such cases with multiple seasonalities in the data, one of which is TBATS. TBATS – is an abbreviation for the components embedded in it – Trigonometric, Box-Cox transformation, ARMA errors, Trend and Seasonal componentswhich in Russian can be translated as “Trigonometric functions – Box-Cox transform – ARSS errors – trend – seasonal components”. This model deserves separate consideration not only because of the possibility of taking into account several seasonal components, but also because of the possibility of applying the Box-Cox transform to time series data to obtain their normal distribution [ссылка]which was mentioned at the end of the first chapter of our study (the normal distribution of measured characteristics is an important point in statistics not only because of the simplicity of analysis, but also because of the ease of interpretation of test results). However, it also has its drawbacks, the main one being the slow execution of calculations.

Strictly speaking, to achieve the best results when there are multiple seasonalities in a time series, it is good practice to use several strategies and approaches. The straightforward solution is to split the data by year and model each seasonality separately, and then sum the results. Our data allows us to do this. But we will leave this task to the next generation, and we will take the difficult and frustrating path of trying to implement the model SARIMA for the problem of temperature forecasting taking into account two seasonalities.

But we will talk about this, dear friends, in the next, final part of the second chapter of our research.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *