Who will win in the battle for the forecast. Chapter one

After a break, I continue the series of articles about one of the most interesting areas in statistics and data science – forecasting time series (or time series, as they were originally called in econometrics textbooks). This work will not be in the style of a translation with my comments, but a full-fledged study on the effectiveness of forecasting models: we will develop and compare two models for forecasting time series – a traditional statistical model – an implementation of the model ARIMA with a seasonal component and exogenous variables called SARIMAX and a layer-based recurrent deep learning model LSTM. Let's find out which of them will most effectively cope with the climate data that François Chollet prepared for his book “Deep Learning with Keras“, the second edition of which was published in 2023. The second edition has been significantly revised to keep up with the times, and I highly recommend this book to both novice data analysts and seasoned data scientists with a background in time series.

Along the way, I will answer the accumulated questions from community members related to both the preparation of data for recurrent neural networks and the explanation of the details of the further use of trained models.

The code provided in the article is enriched with my knowledge and tested in practice – I actively use it in projects related to the application of machine learning, and I share it with you. But before that, I recommend refreshing your knowledge of what one-dimensional and multidimensional time series are, as well as point (single-step) and interval (multi-step) forecasting and their implementation (link to the article).

Finally, we will consider an approach to temperature forecasting using a layer-based model implementation. LSTMwhich François describes in his book, and we will try to figure out what is wrong with it: the forecasting goals set by the author are not achieved properly.

Anyway, everyone fasten your seat belts: we're about to begin.

For ease of repeating examples, a link to my website notebook will be provided at the end of the article. kaggle.com. It is based on code with minimal explanations, without the explanations that I provide in this paper. However, I will update it as parts of the study are published.

All code presented in this article was developed in IDE Spyder and is presented in a functional style. Based on it, you will better understand this or that task related to such an interesting direction in econometrics as forecasting time series.

Environment parameters Python:

Python 3.10.9
matplotlib 3.8.4
numpy 1.26.4
pandas 2.2.2
statsmodels 0.14.2
keras/tensorflow 2.11.0

To reduce computation time, we will use a reduced data set, so most of the examples in this study can be done with a CPU. However, for those who want to do additional research on the original climate data set, I recommend using google.colab.

The study consists of the following chapters:

  1. Description of the climate dataset and its preliminary preparation.

  2. Statistical analysis of data, including estimation of the seasonal component, for developing models from the family ARIMA for the purpose of temperature forecasting.

  3. Developing Deep Learning Models Keras using layer LSTM for a similar task.

  4. Comparison and evaluation of the effectiveness of forecasting results.

  5. Consideration of the approach of predicting the biased temperature using deep learning model Keras using layers LSTM.

The second chapter, due to the large amount of information, will be divided into two parts.

All examples in this article use time-series weather data recorded at the hydrometeorological station at the Max Planck Institute for Biogeochemistry (link to the site).

This dataset includes measurements of 14 different meteorological parameters (such as air temperature, atmospheric pressure, humidity) taken every 10 minutes since 2003. To save time, we will analyze data covering the period from 2009 to 2016, which was prepared by François Chollet for his book Deep Learning on Python» (Deep Learning with Python by François Chollet) in both editions.

There are several ways to download the data archive and unpack it into the current working directory. Either using the utility get_file libraries tensorflowas in the first edition book:

import tensorflow as tf
import os
zip_path = tf.keras.utils.get_file(
    origin='https://storage.googleapis.com/tensorflow/tf-keras-datasets/jena_climate_2009_2016.csv.zip',
    fname="jena_climate_2009_2016.csv.zip",
    extract=True)
csv_path, _ = os.path.splitext(zip_path)

Or using the commands from Francois's second edition book, which we will use:

!wget https://s3.amazonaws.com/keras-datasets/jena_climate_2009_2016.csv.zip
!unzip jena_climate_2009_2016.csv.zip

This dataset is well suited for practicing time series forecasting skills and techniques, so I recommend saving it to your local computer for easy access in the future.

First, let's import the necessary modules and attributes that we will use in the first chapter of the study:

from matplotlib import pyplot
from numpy import nan, where
from pandas import read_csv, DateOffset

Also, for better presentation of the graphs, we will change the following parameters of the method pyplot libraries matplotlib to custom ones. I set the graph sizes (14, 7) based on the resolution of the laptop I used to conduct this study.

pyplot.rcParams["figure.autolayout"]=True
pyplot.rcParams["figure.figsize"]=(14, 7)

Once we have saved and unzipped the dataset archive to our local computer, it is time to import it for further analysis. The code below first imports the file using the function read_csv libraries pandaswhich takes the file name with its extension as its first argument; as an index (a set of identifying labels attached to the data structure pandas) we will use the column Date Timeand so that pandas didn't import the index values ​​as strings, we'll tell it to treat those values ​​as date/time stamps using the argument date_format.

Please note that, unlike the temperature forecasting examples from Francois' book, we will use data not in a ten-minute format, but in an hourly format, since this is, firstly, more interesting than copying the analysis example from the book, and secondly, it will save a lot of time for further calculations. For this, we will use the method resample() with parameter h (or “hour”): in fact, the data will be grouped by a given period of time (in our case, the period is taken to be equal to one hour, which will cover six ten-minute lines – from 00:00 to 00:50 inclusive) and thanks to the method mean() we will return the average value for each group (each hour).

def open_csv_and_data_resample(filename):
    """Открытие csv-файла и предварительная подготовка набора данных"""
    dataframe = read_csv(filename, index_col="Date Time", date_format="%d.%m.%Y %H:%M:%S")
    dataframe = dataframe.resample('h').mean()
    return dataframe


df = open_csv_and_data_resample(filename="jena_climate_2009_2016.csv")

So we imported csv-file with the given parameters and converted the original data set format from ten-minute measurements to hourly ones, calculating the average value of all numerical columns.

The resulting data set has the following form:

print(df.shape)

(70129, 14)

Let's display the names of the time series:

print(* df.columns, sep='\n')

p (mbar)
T (degC)
Tpot (K)
Tdew (degC)
rh (%)
VPmax (mbar)
VPact (mbar)
VPdef (mbar)
sh (g/kg)
H2OC (mmol/mol)
rho (g/m**3)
wv (m/s)
max. wv (m/s)
wd (deg)

Below are the corresponding explanations:

«Давление»,
«Температура»,
«Температура в Кельвинах»,
«Температура (точка росы)»,
«Относительная влажность»,
«Давление насыщенных паров»,
«Давление пара»,
«Дефицит давления пара»,
«Удельная влажность»,
«Концентрация водяного пара»,
«Герметичность»,
«Скорость ветра»,
«Максимальная скорость ветра»,
«Направление ветра в градусах»

Let's look at the first and last five rows of the data set:

print(df.head())


                       p (mbar)  T (degC)  ...  max. wv (m/s)    wd (deg)
Date Time                                  ...                           
2009-01-01 00:00:00  996.528000 -8.304000  ...       1.002000  174.460000
2009-01-01 01:00:00  996.525000 -8.065000  ...       0.711667  172.416667
2009-01-01 02:00:00  996.745000 -8.763333  ...       0.606667  196.816667
2009-01-01 03:00:00  996.986667 -8.896667  ...       0.606667  157.083333
2009-01-01 04:00:00  997.158333 -9.348333  ...       0.670000  150.093333

[5 rows x 14 columns]
print(df.tail())

                        p (mbar)  T (degC)  ...  max. wv (m/s)    wd (deg)
Date Time                                   ...                           
2016-12-31 20:00:00  1001.410000 -2.503333  ...       1.526667  203.533333
2016-12-31 21:00:00  1001.063333 -2.653333  ...       1.250000   98.366667
2016-12-31 22:00:00  1000.511667 -3.553333  ...       1.410000  167.958333
2016-12-31 23:00:00   999.991667 -3.746667  ...       1.650000  223.600000
2017-01-01 00:00:00   999.820000 -4.820000  ...       1.960000  184.900000

[5 rows x 14 columns]

From the above tables, we can see that the observation recording period is one hour. Thus, during one day we will have 24 observations, and during one normal year (not a leap year) 8760 (24×365) observations should accumulate.

From the final lines we see that the set has been attached to a measurement for 2017, which I suggest getting rid of so that the data is displayed correctly in subsequent graphs. We will do this using filtering pandas:

df = df[df.index.year < 2017]

Let's output information about the dataset along with metadata:

print(df.info())

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 70128 entries, 2009-01-01 00:00:00 to 2016-12-31 23:00:00
Freq: h
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   p (mbar)         70040 non-null  float64
 1   T (degC)         70040 non-null  float64
 2   Tpot (K)         70040 non-null  float64
 3   Tdew (degC)      70040 non-null  float64
 4   rh (%)           70040 non-null  float64
 5   VPmax (mbar)     70040 non-null  float64
 6   VPact (mbar)     70040 non-null  float64
 7   VPdef (mbar)     70040 non-null  float64
 8   sh (g/kg)        70040 non-null  float64
 9   H2OC (mmol/mol)  70040 non-null  float64
 10  rho (g/m**3)     70040 non-null  float64
 11  wv (m/s)         70040 non-null  float64
 12  max. wv (m/s)    70040 non-null  float64
 13  wd (deg)         70040 non-null  float64
dtypes: float64(14)
memory usage: 8.0 MB
None

Currently, the number of observations for each column of the data set is 70128, of which 70040 are non-zero; the difference between them indicates that there are missing (empty) values ​​in the data. Also note the frequency of the indices (freq), which is defined as hourly (h).

Let's calculate the sum of missing values ​​and duplicate rows for each column separately:

def detect_mis_and_dub_values(dataframe, print_dubs=True):
    """Поиск пропущенных и повторяющихся значений"""
    msg = "Количество пропущенных значений:"
    print('\n' + msg)
    print('-' * len(msg))
    print(dataframe.isna().sum().sort_values(ascending=False))

    if print_dubs:
        msg = "Количество дубликатов:"
        print('\n' + msg)
        print('-' * len(msg))
        print(dataframe.duplicated().sum())


detect_mis_and_dub_values(dataframe=df)
Количество пропущенных значений:
--------------------------------
p (mbar)           88
T (degC)           88
Tpot (K)           88
Tdew (degC)        88
rh (%)             88
VPmax (mbar)       88
VPact (mbar)       88
VPdef (mbar)       88
sh (g/kg)          88
H2OC (mmol/mol)    88
rho (g/m**3)       88
wv (m/s)           88
max. wv (m/s)      88
wd (deg)           88
dtype: int64

Количество дубликатов:
----------------------
87

So, each time series has 88 empty values. The appearance of duplicates in the amount of 87 pieces may indicate their connection with those rows that have only missing values. Let's find all the rows with missing values ​​and check this: first, create a new variable nan_rows with type dataframe and by means of filtration pandas let's assign all the rows with values ​​to it nanand then we'll output the first and last five lines.

nan_rows = df[df.isnull().any(axis=1)]

print(nan_rows.head())
print(nan_rows.tail())

                     p (mbar)  T (degC)  ...  max. wv (m/s)  wd (deg)
Date Time                                ...                         
2014-09-24 18:00:00       NaN       NaN  ...            NaN       NaN
2014-09-24 19:00:00       NaN       NaN  ...            NaN       NaN
2014-09-24 20:00:00       NaN       NaN  ...            NaN       NaN
2014-09-24 21:00:00       NaN       NaN  ...            NaN       NaN
2014-09-24 22:00:00       NaN       NaN  ...            NaN       NaN

[5 rows x 14 columns]


                     p (mbar)  T (degC)  ...  max. wv (m/s)  wd (deg)
Date Time                                ...                         
2016-10-28 07:00:00       NaN       NaN  ...            NaN       NaN
2016-10-28 08:00:00       NaN       NaN  ...            NaN       NaN
2016-10-28 09:00:00       NaN       NaN  ...            NaN       NaN
2016-10-28 10:00:00       NaN       NaN  ...            NaN       NaN
2016-10-28 11:00:00       NaN       NaN  ...            NaN       NaN

[5 rows x 14 columns]

Our hypothesis about the connection between rows with missing data and duplicates was confirmed. Note that empty values ​​are present in observations recorded in 2014 and 2016; no gaps were found in the preceding years (2013 and 2015, respectively).

Before we handle missing values, let's look at the distribution of each time series over time (note that the plotting python'(This graph will be long because we changed the default settings to custom ones for better presentation of the information):

df.plot(subplots=True, layout=(7,2))
pyplot.show()

Looking at the graph, one can see that most of the available time series exhibit recurring annual fluctuations, which are usually associated with seasonal changes.

So, since the dataset has strictly placed date and time stamps with a frequency of one hour and the data has an annual periodicity, I propose to handle the missing values ​​according to the following logic. Simply deleting data rows will throw off the frequency and subsequently cause errors when training statistical forecasting models: statsmodels will methodically issue warnings about broken order in the data. In this regard, using the following function we will replace all values nan for each column, the values ​​that were recorded the year before.

As in the example with finding missing values, we use filtering pandas let's create a new variable with the type dataframecontaining rows with values nan (nan_rows). Then, using the cycle for Let's run through each index (read – timestamp) in the variable nan_rows and using the method pandas DateOffset(years=1) we define for each found index nan_index timestamp a year earlier (prev_year_timestamp). After that, all we have to do is replace the values nan in the entire data set (or only those columns that are explicitly specified in the function argument colls) to the corresponding values ​​from the same time, but a year earlier.

def replace_nan(dataframe, colls=None):
    """Поиск и замена пропущенных значений на значения предыдущего года"""
    nan_rows = dataframe[dataframe.isnull().any(axis=1)]
    for nan_index in nan_rows.index:
        prev_year_timestamp = nan_index - DateOffset(years=1)
        if colls:
            dataframe.loc[nan_index, colls] = dataframe.loc[prev_year_timestamp, colls]
        else:
            dataframe.loc[nan_index] = dataframe.loc[prev_year_timestamp]
    return dataframe


df = replace_nan(dataframe=df)

Let's run the command and recheck the data set for missing values:

detect_mis_and_dub_values(dataframe=df, print_dubs=False)

Количество пропущенных значений:
--------------------------------
p (mbar)           0
T (degC)           0
Tpot (K)           0
Tdew (degC)        0
rh (%)             0
VPmax (mbar)       0
VPact (mbar)       0
VPdef (mbar)       0
sh (g/kg)          0
H2OC (mmol/mol)    0
rho (g/m**3)       0
wv (m/s)           0
max. wv (m/s)      0
wd (deg)           0
dtype: int64

There are no missing values. Let's move on.

Let's create a separate variable with descriptive statistics and take a look at it:

description = df.describe()

Both the descriptive statistics and the above graph with the distribution of time series over time show that two of them – “Wind Speed” and “Maximum Wind Speed” – have outliers in the form of monstrously negative values ​​among the observations. Let's plot a separate graph for these time series and take a closer look at them:

def show_plot_wind_colls():
    """График с 'wv (m/s)' и 'max. wv (m/s)'"""
    fig, axs = pyplot.subplots(1, 2)

    axs[0].plot(df['wv (m/s)'], color="orange")
    axs[0].set_title('Скорость ветра')
    axs[0].set_ylabel('м/с')
    axs[0].set_xlabel('Дата наблюдения')

    axs[1].plot(df['max. wv (m/s)'], color="green")
    axs[1].set_title('Максимальная скорость ветра')
    axs[1].set_xlabel('Дата наблюдения')

    pyplot.suptitle('\"Скорость ветра\" и \"Максимальная скорость ветра\"')
    pyplot.show()


show_plot_wind_colls()

Since wind speed cannot be negative (at least in my mind) and, as mentioned earlier, the data set has strictly placed date and time labels, we will process these values ​​using the same logic that we used for missing values. To do this, we will replace all negative values ​​in the “Wind Speed” and “Maximum Wind Speed” time series with the values nan and subsequently assign them the values ​​that were recorded a year earlier.

wind_colls = ['wv (m/s)', 'max. wv (m/s)']
df[wind_colls] = df[wind_colls].apply(lambda x: where(x < 0, nan, x))

Let's see how many empty values ​​we got after this:

detect_mis_and_dub_values(dataframe=df, print_dubs=False)


Количество пропущенных значений:
--------------------------------
wv (m/s)           4
max. wv (m/s)      4
p (mbar)           0
T (degC)           0
Tpot (K)           0
Tdew (degC)        0
rh (%)             0
VPmax (mbar)       0
VPact (mbar)       0
VPdef (mbar)       0
sh (g/kg)          0
H2OC (mmol/mol)    0
rho (g/m**3)       0
wd (deg)           0
dtype: int64

The time series “Wind speed” “Maximum wind speed” have four missing values. Let's replace them with the values ​​of the previous year:

df = replace_nan(dataframe=df, colls=wind_colls)

So, not only did we handle all the missing values ​​by replacing them with the previous year's values ​​for the corresponding time series, but we also dealt with the negative values ​​for wind speeds. However, the duplicate rows that were previously detected are still there: they just updated their values ​​by replacing nan to the substituted values ​​a year earlier. I suggest checking this – and displaying the first and last five lines of the variable doubleswhich we will create using the filtering we have already come to love pandas:

dubles = df[df.duplicated()]

print(dubles.head())
print(dubles.tail())

                       p (mbar)   T (degC)  ...  max. wv (m/s)    wd (deg)
Date Time                                   ...                           
2014-09-24 18:00:00  988.471667  15.901667  ...       1.316667  252.283333
2014-09-24 19:00:00  988.553333  15.338333  ...       1.783333  246.783333
2014-09-24 20:00:00  988.718333  15.023333  ...       1.850000  252.083333
2014-09-24 21:00:00  988.735000  14.650000  ...       2.003333  248.000000
2014-09-24 22:00:00  988.798333  13.810000  ...       1.550000  120.810000

[5 rows x 14 columns]

                       p (mbar)  T (degC)  ...  max. wv (m/s)    wd (deg)
Date Time                                  ...                           
2016-10-28 07:00:00  988.013333  0.986667  ...       1.433333  194.750000
2016-10-28 08:00:00  988.035000  1.721667  ...       1.156667  203.750000
2016-10-28 09:00:00  988.080000  3.701667  ...       1.686667  220.550000
2016-10-28 10:00:00  987.733333  5.751667  ...       1.656667  204.766667
2016-10-28 11:00:00  987.243333  8.681667  ...       2.713333  160.683333

[5 rows x 14 columns]

As expected. Of course, we won't get rid of these repeated lines, but leave them alone.

Given the above data preparation, let's re-plot the time series distribution over time.

df.plot(subplots=True, layout=(7,2))
pyplot.show()

That's a different matter!

Let us note once again that most of the available time series exhibit a pronounced annual periodicity, which is closely related to the concept of seasonality, the analysis of which we will undertake in the second chapter of the study.

Let's update the descriptive statistics:

description = df.describe()

Let's output the final form of the data set and information about it:

print(df.shape)

(70128, 14)


print(df.info())

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 70128 entries, 2009-01-01 00:00:00 to 2016-12-31 23:00:00
Freq: h
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   p (mbar)         70128 non-null  float64
 1   T (degC)         70128 non-null  float64
 2   Tpot (K)         70128 non-null  float64
 3   Tdew (degC)      70128 non-null  float64
 4   rh (%)           70128 non-null  float64
 5   VPmax (mbar)     70128 non-null  float64
 6   VPact (mbar)     70128 non-null  float64
 7   VPdef (mbar)     70128 non-null  float64
 8   sh (g/kg)        70128 non-null  float64
 9   H2OC (mmol/mol)  70128 non-null  float64
 10  rho (g/m**3)     70128 non-null  float64
 11  wv (m/s)         70128 non-null  float64
 12  max. wv (m/s)    70128 non-null  float64
 13  wd (deg)         70128 non-null  float64
dtypes: float64(14)
memory usage: 10.0 MB
None

Thus, the data set is presented in the form of 14 time series, comprising 70128 hourly observations, the registration of which covers the period from 2009-01-01 00:00:00 to 2016-12-31 23:00:00 inclusive.

As you remember, at the beginning it was written about a normal year, which, taking into account the frequency of data per hour, will have a total number of observations of 8760 (24*365). However, in our climate data set there are two leap years – 2012 and 2016, which, taking into account the extra day in the year, will have 8784 (24*366) observations. The presence of leap years explains the resulting difference of 48 observations: 24*365*8=70080 (70080+2*24=70128).

Identifying periodicity in the data is an important step because it is directly reflected in the seasonality parameter. m for such statistical models with a seasonal component as SARIMA. We will leave the study of the influence of leap years outside the scope of this study; by and large, additional/separate models need to be created for them. At the same time, we will not remove leap years or their extra days from the data set, since in the future we will try to run calculations taking into account the seasonal parameter m=8760. This should result in ignoring the extra hours in leap years: the model will only take into account the general seasonal patterns for all years.

M=8760 … This is not an error! What could be simpler to calculate than a seasonal period of 8760 observations!? This is a great way to test how long your computer can last before it is overwhelmed by the mathematical complexity and algorithmic implementation of a statistical model like SARIMA. Jokes aside, this circumstance is one of the main disadvantages of statistical models with a seasonal component when applied to “big” data due to the lack of calculation optimization. However, we will discuss seasonality, statistical methods for determining it, and attempts to calculate it in more detail in the second chapter of our study.

Let's save the prepared version of the data set together with date and time stamps in a separate file with the extension parquet and the name “jena_climate_2009_2016_hour_grouped_data”:

df.to_parquet('jena_climate_2009_2016_hour_grouped_data.parquet')

Format parquet preferable than csvbecause by default it preserves all the metadata of the dataset (indexes, column types, etc.) and is more efficient in terms of performance and storage of information. In addition, when saving a file in the format csv Much of the metadata is lost (particularly the frequency of the data – freq), therefore, when importing it later, it is necessary to explicitly specify the column with the index and its type/format. Format parquetI repeat, it does this by default.

Now let's look separately at the target time series “Temperature”, the values ​​of which we will forecast in the following research examples:

temperature = df['T (degC)']

def show_plot_temperature():
    """График температуры и её распределения"""
    fig, axs = pyplot.subplots(2, 1)

    axs[0].plot(temperature)
    axs[0].set_title('Температура')
    axs[0].set_xlabel('Дата наблюдения')
    axs[0].set_ylabel('Градусов по Цельсию')
    axs[0].grid()

    axs[1].hist(temperature.to_numpy(), bins=50, edgecolor="black")
    axs[1].set_title('Гистограмма распределения температуры')
    axs[1].set_xlabel('Градусов по Цельсию')
    axs[1].set_ylabel('Частота')

    pyplot.show()


show_plot_temperature()

It should be noted that the temperature shows a clear annual periodicity, and the main values ​​are located in the range from -5 to 25.

Looking at the histogram, one might assume that the data is normally distributed, but this is not the case. In this study, we will not test the time series under consideration for normality. However, in relation to the target temperature time series, I note that according to the Quantile-Quantile graph (QQ-graph) and the Shapiro-Wilk test, the temperature data are not normally distributed. This circumstance can lead to unreliability of the results of statistical tests and calculation of the confidence interval. There are methods by which it is possible to transform the time series data to a normal distribution, but these problems deserve separate consideration in a separate paper. In the end, the condition of normality of the input data is not mandatory for the implementation of the models of the family ARIMA: the main attention for their development is paid to such a concept as stationarity of a time series (from Lat. stationary – constant, motionless; related to parking), which I propose to deal with in the next chapter of the study. It will also tell about the division of the data under consideration into training, validation and test parts and about the purposes of forecasting.

A non-graphing data preparation code variant for the terminal command line is presented below:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

from numpy import nan, where
from pandas import read_csv, DateOffset

def open_csv_and_data_resample(filename):
    """Открытие csv-файла и предварительная подготовка набора данных."""
    dataframe = read_csv(filename, index_col="Date Time", date_format="%d.%m.%Y %H:%M:%S")
    dataframe = dataframe.resample('h').mean()
    return dataframe

def replace_nan(dataframe, colls=None):
    """Поиск и замена пропущенных значений на значения предыдущего года"""
    nan_rows = dataframe[dataframe.isnull().any(axis=1)]
    for nan_index in nan_rows.index:
        prev_year_timestamp = nan_index - DateOffset(years=1)
        if colls:
            dataframe.loc[nan_index, colls] = dataframe.loc[prev_year_timestamp, colls]
        else:
            dataframe.loc[nan_index] = dataframe.loc[prev_year_timestamp]
    return dataframe

def run():
    print('\n'"Выполняется открытие и преобразование файла ..."'\n')
    df = open_csv_and_data_resample(filename="jena_climate_2009_2016.csv")
    print("Выполняется фильтрация данных ..."'\n')
    df = df[df.index.year < 2017]
    print("Выполняется поиск и замена пропущенных значений ..."'\n')
    df = replace_nan(dataframe=df)
    wind_colls = ['wv (m/s)', 'max. wv (m/s)']
    df[wind_colls] = df[wind_colls].apply(lambda x: where(x < 0, nan, x))
    df = replace_nan(dataframe=df, colls=wind_colls)
    print("Сохранение подготовленного набора данных в отдельный файл ..."'\n')
    df.to_parquet('jena_climate_2009_2016_hour_grouped_data.parquet')
    print("... завершено"'\n')

    

if __name__ == "__main__":
    run()

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *