# Top 3 Statistical Paradoxes in Data Science

Translation prepared as part of the course “

Machine Learning. Professional“.We also invite everyone to take part in the two-day online intensive

Deploy ML model: from dirty code in a laptop to a working service.

## Observation errors and subgroup differences cause statistical paradoxes

Observation errors and subgroup differences can easily lead to statistical paradoxes in any data science application. Ignoring these elements can completely discredit the conclusions of our analysis.

Indeed, it is not uncommon to see such astonishing phenomena as subgroup trends that reverse completely in the aggregated data. In this article, we’ll take a look at the top 3 most common statistical paradoxes found in Data Science.

## 1. Burkson’s paradox

The first striking example is the inverse correlation between the severity of COVID-19 disease and cigarette smoking (see, for example, the European Commission review __Wenzel 2020__). Cigarette smoking is a well-known risk factor for respiratory disease, so how do you explain this controversy?

Work __Griffith 2020__, recently published in Nature, suggests that this may be a case of a Collider Bias, also called __by the Berkson paradox__… To understand this paradox, let us consider the following graphical model, in which we have included a third random variable: hospitalization.

Burkson’s paradox: “hospitalization” is a collider variable for “smoking cigarettes” and for “severity of COVID-19.” (Image by author)

The third variable “hospitalization” is ** collider** the first two. This means that cigarette smoking and severe COVID-19 increase the chances of being hospitalized. The Berkson paradox arises at the moment when we accept the collider as a condition, that is, when we observe the data of only hospitalized people, and not the entire population as a whole.

Let’s take a look at the following example dataset. In the left figure, we have data for the entire population, and in the right figure, we are considering only a subset of hospitalized people (that is, we are using a collider variable).

Burkson’s Paradox: If we add a condition according to the hospitalization collider, we see an inverse relationship between smoking and COVID-19! (Image by author)

In the left figure, we can observe the direct correlation between complications from COVID-19 and cigarette smoking, which we expected, since we know that smoking is a risk factor for respiratory diseases.

But in the right figure – where we only look at hospital patients – we see the opposite trend! To understand this, pay attention to the following points.

1) Severe COVID-19 increases the chances of hospitalization. That is, if the severity of the disease is higher than 1, then hospitalization is required.

2. Smoking several cigarettes a day is a major risk factor for various diseases (cardiovascular diseases, cancer, diabetes), which, for whatever reason, increase the likelihood of hospitalization.

3. Thus, if **patient** **mild form of COVID-19**, he has** more likely to be a smoker**! Moreover, unlike COVID-19, the reason for hospitalization will be if the patient has any disease that can be caused by smoking (for example, cardiovascular disease, cancer, diabetes).

This example is very similar to the original work __Berkson 1946__where the author noticed a negative correlation between **cholecystitis** and **diabetes** in hospital patients, although diabetes is a risk factor for cholecystitis.

## 2. Hidden (latent) variables

Availability __hidden variable__** **can also cause the appearance of an inverse correlation between the two variables. While the Burkson paradox arises from the use of the collider condition (which, therefore, should be avoided), this type of paradox can be corrected by **assuming a hidden variable as a condition**…

Consider, for example, the relationship between the number of firefighters involved in extinguishing a fire and the number of people injured as a result. We expect that an increase in the number of firefighters will improve the result (to some extent – see. __brooks law__), however, there is a direct correlation in the aggregated data: **the more firefighters are involved, the higher the number of injured**!

To understand this paradox, consider the following graphical model. The key point is to re-examine the third random variable: “fire severity”.

Latent Variable Paradox: “Fire severity” is a latent variable for “n firefighters involved” and “n injured”. (Image by author)

The third hidden variable correlates directly with the other two. Indeed, more **severe fires tend to result in more injuries**and at the same time for **extinguishing requires a large number of firefighters**…

Let’s take a look at the following example with a dataset. In the left figure, we reflect the general data for all types of fires, and in the right figure, we consider only information corresponding to three fixed degrees of fire severity (i.e., we condition our observational data for a latent variable).

Hidden Variables: If we take the hidden variable “fire severity” as a condition, we see an inverse correlation between the number of firefighters involved and the number of injured! (Image by author)

In the right picture, where are we **we take as the condition for observation data the severity of the fire**, we see the inverse correlation we expected.

For a given severity of the fire, we actually see that

**the more firefighters are involved, the fewer people are injured**…

If we look at

**high severity fires**we will see that**same trend**even though the number of firefighters involved and the number of injuries are increasing.

## 3. The Simpson paradox

__Simpson’s paradox__** **– this is an amazing phenomenon when we constantly observe some kind of tendency arising in subgroups, and which changes to the opposite, if these subgroups are combined. This is often due to** imbalance of classes in data subgroups**…

A sensational case of this paradox occurred in 1975, when __Bickel__ University of California enrollment rates were analyzed to find **evidence of gender discrimination**, and two clearly contradictory facts were revealed.

One side,

**each faculty****number of accepted****female applicants are higher than male applicants**…

On the other hand,

**total acceptance rate**among**of female applicants was lower than that of male applicants**…

To understand how this could be, let’s consider the following dataset with two faculties: Faculty A and Faculty B.

Out of 100 male applicants: 80 applied for Faculty A, of which 68 were accepted (85%), and 20 applied for Faculty B, of which 12 were accepted (60%).

Out of 100 female applicants: 30 applied for Faculty A, of which 28 were accepted (93%), while 70 applied for Faculty B, of which 46 were accepted (66%).

Simpson’s Paradox: Female applicants are more likely to be accepted in every faculty, but overall admission rates for females are lower compared to males! (Image by author)

The paradox is expressed by the following inequalities.

Simpson’s Paradox: Inequality at the heart of an apparent contradiction. (Image by author)

We can now understand the origin of our seemingly conflicting observations. The point is that there is a tangible **class gender imbalance among **applicants in each of the two faculties (Faculty A: 80–30, Faculty B: 20–70). Really, **most** **female students applied for a more competitive Faculty B **(which has low reception rates), while **most** **male students applied for the less competitive Faculty A **(which has higher acceptance rates). This leads to the conflicting data that we received.

### Conclusion

**Hidden variables, collider variables**, and **class imbalance** can easily lead to **statistical paradoxes** in many practical points of data science. Therefore, these key points need to be given special attention in order to correctly identify trends and analyze the results.

Learn more about the course “

Machine Learning. Professional“Participate in an online intensive

“Deploy ML model: from dirty code in a laptop to a working service”