Forecasting is an important economic tool. It enables rational purchasing, long-term action plans or, as in the case of an audit, forecasting future costs. Forecasting is also one of the fields of Data Science.
Let’s look at how to create a simple predictive model based on a linear trend using econometric methods.
Let’s take some dataset (can be found in the Github repository, link at the end of the article). Let’s assume that genesis doesn’t matter (author’s note – origin), but let’s take into account that the data have a normal distribution:
import pandas as pd excel_data_df = pd.read_excel('analysis.xlsx') excel_data_df.head()
The first step in our work is anti-aliasing. The simplest method for smoothing series is moving average. Its meaning is to select odd sequences (3, 5, 7) and convert the center point to the arithmetic mean of neighboring values. For example:
(8 + 12 + 11 + 9 + 10) / 5
There is also a weighted moving average formula. Its difference is that the weight of the center point doubles when recalculated:
(8 + 12 + (2 * 11) + 9 + 10) / 6
This method is simple and easy to use. However, it also has its drawbacks: so, we either lose the extreme values of the series, or leave them in their original form.
But, we will apply it while maintaining the original extreme points, stopping at a sequence of three elements:
yts =  yts.append(excel_data_df['yt']) i=0 for element in excel_data_df['yt']: i=i+1 if(i>excel_data_df['yt'].size-2): pass else: yts_per = (excel_data_df['yt'][i-1]+2*excel_data_df['yt'][i]+excel_data_df['yt'][i+1])/4 yts.append(yts_per) yts.append(excel_data_df['yt'][i-1])
The output will be a new column:
excel_data_df['yts'] = yts excel_data_df.head()
Now let’s build graphs by comparing the initial data and the data, after applying anti-aliasing (blue – original data, orange – smoothed):
import matplotlib.pyplot as plt plt.plot(excel_data_df['t'], excel_data_df['yt']) plt.plot(excel_data_df['t'], excel_data_df['yts'])
It is clearly seen that the bursts of the chart have become more similar to the real situation. Now we can start finding the coefficients of the system of linear algebraic equations (SLAE), the solution of which will allow us to obtain the weights of the trend formula for our time series.
t_list=excel_data_df['t'].values.tolist() yts_list=excel_data_df['yts'].values.tolist() i=0 t_sum=0 t2_sum=0 yts_sum=0 ytst_sum=0 while i<len(t_list): t_sum=t_list[i]+t_sum t2_sum=t_list[i]*t_list[i]+t2_sum yts_sum=yts_list[i]+yts_sum ytst_sum=yts_list[i]*t_list[i]+ytst_sum i=i+1 print("Сумма по t:", t_sum, "| Сумма по t^2:", t2_sum, "| Сумма по yt:", yts_sum, "| Сумма по yt*t:", ytst_sum)
The first coefficient (taken as k1) is the number of values of the series
Matrix = np.array([[len(t_list), t_sum], [t_sum, t2_sum]])
Vektor = np.array([yts_sum, ytst_sum])
The Matrix array is the left side of the SLAE, the Vektor array is the free members, or the right side of the SLAE.
Having received the roots, we can construct a function of the form:
which is a linear trend.
for ease of perception.
b1Is a non-random function that describes the relationship between value and time, and b0 Is a random deviation. Let’s check the resulting formula with Excel calculations:
Let’s check the resulting formula with Excel calculations:
On the chart, we can see the trend line itself, make sure that our formula is correct, and also see the value of R2. R2 is Pearson’s coefficient of determination, which calculates how much a given trend can accurately describe a time series.
Roughly speaking, this is the accuracy of the forecast. A value greater than 0.8 indicates good predictability. Our model has a predictability of 0.96. If the value is less than 0.8, it would be worth paying attention to other models, let’s say exponential and polynomial.
Thus, we were able to quickly and accurately create the simplest predictive model. And finally, it remains only to look at the resulting forecast:
The code and source data can be found at: https://github.com/ikarteeva/econometrictrend