Let's talk to each other about reality

As many of us guess, man is a school monkey who watches other monkeys in order to learn something useful from them. This is precisely what a significant part of learning in general and in IT in particular is based on – spying on others. Watching people post their code or just talk about something. However, they don’t tell you everything.

Very often, this code or text is not a product aimed at giving other monkeys full training. Very often, readers are loaded with something taken out of context and not containing anything important. For example, in stories about working with data, they forget to talk about the place from which this data came from. That is, about reality.

And this is best demonstrated with two examples.

Example No. 1: Analyst's work

(The article itself is here – https://habr.com/ru/articles/787098/)

A simple set of platitudes that will allow a beginner to gain an understanding of data analysis methods. For simplicity, this is wrapped in an example of sales from the site and even the correct disclaimers are made

For a whole month before the New Year, there was mass advertising with the distribution of promo codes with discounts on red scooters. As a result, the company's overall profit fell. An experienced analyst will immediately identify that profits were affected not only by promotional codes, but also by other factors, such as seasonality, the introduction of a new website, price changes and several other influencing factors that coincided in time with promotional codes and could be no less significant. Therefore, a more relevant metric is needed, for example, the percentage of users who used promotional codes and the percentage of red scooters in the share of sales.

It would seem that the reader is being given an indication that it is not worth looking too hard for the desired relationship and that it is worth considering the influence of other factors.

But look at the example graph.

According to legend, this is “the percentage of users who used promotional codes.” That is, this graph reflects data on the entry of a promotional code and website traffic. What question should a seasoned analyst ask themselves when looking at the late January spike? That's right: “Isn't this a data glitch?”

Because in fact, we do not work with data that 100% accurately reflect reality, but with data collected using specific technical methods, which inevitably creates risks and distortions.

If the percentage of users who used promotional codes fell, then either the number of users fell, or the number of users increased, or both.

The number of users is recorded internally by the site. We have complete control over this data and can rely on it 100%. What could have happened?

For example:

  • a piece of part of the promo code database fell off and some promo code entries were no longer recorded as successful

  • Some browser/antivirus/ad blocker was updated and the JavaScript form binding stopped working for some users

  • The server script that processes promotional codes was updated, the trimming of spaces at the ends was removed from it and because of this, some promotional codes from letters sent through the buffer were no longer accepted

  • much more

And, of course, none of this is “implementing a new site,” because (a) introducing a new site is primarily a usability issue and (b) it is minor work within the existing website.

We generally control the number of site users using external means. Simply because collecting service logs and clearing them of bots, etc. is a hemorrhoid, and external counters provide a lot of goodies. What could have happened?

For example:

  • inevitable transition from GA3 to GA4

  • the service has changed its calculation methods

  • colleagues changed settings/filters

  • someone noticed that the counter was not included in the template processing the 404 error and corrected the situation

  • much more

No, I’m not saying that @maratyv should have written all the possible options that led to a change in the collection of statistics, but you can say “yes, this jump is very similar to a failure or change in the methodology for collecting statistics, but now we will not consider this, because This is just a synthetic example for mastering data analysis methods.”

And when it comes to real analysis, the monkey, having learned from the example from the article in question, may say to itself “I’ll check, like me, where this data came from.”

Maybe he won’t, but let’s give the monkey a chance to remember that the data used is only a distorted and incomplete reflection of reality, and this distortion and incompleteness must be taken into account.

Example #2: Loading data

(The article itself is here – https://habr.com/ru/companies/rshb/articles/797435/)

I have no idea how completely and correctly the process is described from a technical point of view, but I am delighted that @VasilPRM inserted a SWOT analysis into the text. This is exactly what I am writing about – we need to show monkeys the right approaches, and SWOT analysis (or any other calculation of pros and cons) is very correct.

But the issue of interfacing with reality is again not resolved.

And the reality manifests itself here – data in Excel.

Firstly, there is the question of format.

“09/06/2023” is it June or September? What information will be transferred to the database?

You will say that the dots as a separator hint at the European date format rather than the American one. Yes, but there is no guarantee, and at the same time, on one of the screenshots there is a date in the format “09/06/2023”.

Also, in one of the screenshots you can see the weight in the format “0.00”, but I don’t see the instructions “make sure you use the correct decimal separator”.

No matter how trivial it may sound, we are not dealing with reality itself, but with data recorded in a format that allows for different interpretations.

Secondly, we have an even more significant problem – where did this data come from?

Because this is an even more terrible manifestation of reality than just the accuracy of the calculation and recording format. These data were driven there by hand by people who do not hesitate to practice laziness of mind and who have anatomical features in terms of where the hands grow on their body.

In the column “Type EO” the letters “M” and “O” are manually entered, and are they Latin or Cyrillic? Now they look at it with their eyes and the difference is not significant, but when it is loaded into a large database, it may affect something.

Here it would be very appropriate to tell the monkey that some other people are responsible for stuffing, but it is the monkey who is responsible for loading. This means that you need to check all the data formats, the correctness of filling, and make sure that not only in the “EO Type” column, but also in the “Security System” field, the first “C” is Cyrillic and not Latin.


And these are just two examples from many other texts in which people reveal their narrow topics without touching on related issues.

And I understand these people, because they talk about analyzing and exporting data, and teaching monkeys about hygiene is not their task.

Unfortunately, the reality is that the monkeys will only read your article and take action. And in this scheme there is no one who will change their hands, give them brains and tell them that reality exists and this must be taken into account.

Therefore, colleagues, start making the correct clarifications and reservations in your texts.

After all, we all have to live in a world created by monkeys who have mastered the profession based on our texts and examples.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *