Mathematical predictions:

We often see some patterns in the processes happening around us. It seems obvious that the demand for slippers / swimming trunks / ice cream increases in the summer. Such predictions do not require any sophisticated mental work. However, human genius, in an effort to formalize and automate everything, has come up with a bunch of methods for studying these patterns (how to shit on empty space, of course, too). This is what we will talk about today.

For example, we want to understand what determines the cost of apartments in the city. Thus, the key parameters that explain most of the price will be the number of rooms in the apartment, the quality of the finish, the age and type of the apartment building, the area, the availability of various public goods, such as the metro, bus stops, schools, etc.

To predict the price, we can build a model that takes into account all the parameters listed, and for each individual set of values ​​of these parameters, it gives a specific estimate of the price of the apartment. For the average person, the model is a black box into which we enter data and get a result. Like a calculator: we enter a complex expression and get a result. What is inside is unimportant to the average person (in principle, tarot is also a kind of model). However, we will take a little look behind the scenes of the calculations to understand at what point everything goes according to plan.

The type of model is determined based on the available data. Most often, it is simply a weighted sum of the various numerical characteristics of the model. For example, (this is what a typical regression model might look like):

rent-cost-of-apartment = A * number-of-rooms + B * time-to-metro-in-minutes-on-foot + C * age-of-house

(For those who remember the godforsaken mathematics)

explained-parameters = function(explaining-parameters)

Here the numbers A, B and C are determined depending on what data we are looking at. There is a strict mathematical theory that describes how these numbers can be selected, but we will not talk about that now.

However, the data may depend not only on some external parameters, but also on each other. In particular, on the order in which we look at them. Let's look at another example. Let's say we want to predict the demand for selling ice cream (I want ice cream!). It is logical to assume that the demand for it depends not only on the temperature outside, the weather, but also on the season in general. If we take such nuances into account, our model can become much more accurate.

Okay, but why not introduce the observation time as a separate parameter? In short, such time dependencies are usually very complex. In the ice cream example, you can expect both an obvious jump in the summer and less obvious fluctuations in demand depending on holidays throughout the year (how wonderful it is to eat ice cream on your belly in front of the TV during the New Year holidays) or people's fatigue about ice cream (it gets boring by August). Moreover, such trends may well not be tied strictly to any date.

But how else can we capture such dependencies? We can group observations by separate time intervals and look at how the parameter being explained (demand in the ice cream example) behaved in previous moments of time!

For example, measure how much ice cream is sold per day (I still want ice cream!), as well as other parameters such as weather and air temperature, and take all this into account when building dependencies. Thus, using different approaches to grouping, it will be possible to talk about different dependencies. For example, you can try to build a model like this:

number-of-those-who-bought-ice-cream = A * air-temperature + B * number-of-precipitation + T1 * number-of-those-who-bought-yesterday + T2 * number-of-those-who-bought-the-day-before-yesterday + …

Parameters A, B, T1, T2, … are determined depending on the available data using terrible mathematics.

However, if there are many observations, then there is no point in taking them all into account in the model: the model will be too large. Depending on the task, you can try to identify long-term trends. For example, that the demand for ice cream depends on the season. Such dependencies are usually determined for each task separately by experts. They are either obvious or too complex. Mathematics is unlikely to help here, since there is too much uncertainty in the data over large periods of time.


Trends, moving average method:

But mathematicians have something to offer if we consider small time intervals. There are several approaches to this. One of them is to use some values ​​at close moments of time as parameters for prediction. It was demonstrated above in the example with ice cream.

The other is a bit more complicated. Its idea is to use not just the values ​​of the explained parameter in previous days, but to consider their average values. This is necessary to reduce the influence of the random factor in the data, and to try to find patterns in different time intervals. Graphically, this can be represented as “smoothing” the curve.

The method, when we obtain new parameters by averaging a certain number of known values ​​of the parameters to be explained, is called the moving average method. Due to its simplicity and efficiency, this method has become very popular. It is as if we slide a small window over the data and average everything that falls into it.


Slutsky-Yule effect:

However, all these conveniences come at a price. In particular, one of the main flaws of the moving average method is the artificial construction of false periodic relationships in the data.

This is called the Slutsky-Yule effect (named after the scientists who studied it).

False trends arise when the data over which averaging occurs intersect. Thus, in a trend-free series, oscillations caused by this effect may arise. The “peaks” and “holes” in our new artificial dependence do not always appear at equal time intervals, as in such elementary periodic functions as sine or cosine. The amplitude of oscillations can also be very diverse. Therefore, how to identify such trends is a complex issue, to which a lot of advanced research in the field of statistics and mathematical physics is devoted.

An obvious problem that follows directly from this is the increased likelihood of an outcome where the model works well on the training data, but not so well in practice. (The data-fitting memes are not just a joke.)

Therefore, any serious trends should be explained very carefully using the moving average method. Averaging over intersecting data leads to artificial trends. That is, to the fact that we see patterns where there are none.


How pleasant it is sometimes to conduct a statistical study and find something where there is no correlation at all. But please, don't abuse it. There are already a lot of shitty studies. And few good ones. Well, I wish you a great and mathematical time!

Author: Liza Ivanova

Original

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *