Paradoxes in data and why visualization is necessary

In this post, I want to look at several “paradoxes” in the data, which are useful to know about both for the novice data analyst and for anyone who does not want to be misled by incorrect statistical conclusions.

There is no complex mathematics behind the examples in question beyond the basic properties of the sample (such as the arithmetic mean and variance), but such cases can be encountered both at an interview and in life.

“New Zealanders emigrating to Australia raises both countries' IQs”

This quote attributed to Sir Robert Muldoon, Prime Minister of New Zealand. Could this be possible from a mathematical point of view?

So, the Will Rogers phenomenon is called apparent paradox, which is that moving a (numerical) element from one set to another can increase the average of both sets. Let's start with an obvious example: consider the sets $A = \{0, 1, 2\}$ And $B = \{100, 1000\}$ . U the arithmetic mean of the elements is 1, and the arithmetic mean of the elements is 550. If you take the number 100 and move it from the second set to the first, you get the sets $A' = \{0, 1, 2, 100\}$ with arithmetic mean 25.75 > 1 and $B' = \{1000\}$ with arithmetic mean 1000 > 550.

However, it is not necessary for the sets to be that far apart on the number line, and it is not necessary for the element being moved to be the minimum in the second set and/or the maximum in the first.

For example, let $A = \{1, 3, 5, 7, 9\}, B = \{4, 6, 8, 10\}$ . Average of set elements is 5, and the elements of the set – 7. Now we move 6 from the second set to the first, and we get the sets $A' = \{1, 3, 5, 6, 7, 9\}$ with an average of 5.1(6) and with an average of 7.(3).

In fact, such an increase in both averages occurs under the following conditions:

the moved element is strictly less than the average value of the elements of the second set before its removal;
the element being moved is strictly greater than the average value of the first set before it was added;
as a consequence, initially the average value of the elements of the first set (where they are moved) must be strictly less than that of the second (from where they are moved).

Conclusion: A similar situation can occur in various areas. For example, with improved diagnosis of diseases at an early stage, life expectancy among the healthy and among the sick may increase in the same sample if some of the “healthy” (and in fact poorly examined) move into the “sick” category, and many of the They will be successfully treated thanks to early detection of the disease.

And yes, the attentive reader will say that there is no paradox here, and he will be absolutely right. This phenomenon sounds a little counterintuitive in words, but the examples above make it obvious.

Simpson's paradox

Imagine that you work in a company that sells two types of products, say, sepulki and spillikins. (For the sake of simplicity of the model, let’s assume that sepules and spillilets are always taken into account separately in receipts.) In the morning, a joyful intern-analyst runs into your office and reports that over the last month the average check in the sepulk category has increased by 5%, the average check in the spillilets category – by 7%. He did not check the overall average bill, but it is logical to assume that it also increased by some amount in the range between 5 and 7 percent. What could go wrong?

	February	March
Sepulki, average check	200 rubles	210 rubles (+5%)
Spills, average bill	100 rubles	107 rubles (+7%)

You open the analytics system, look in more detail and understand that the company’s total average bill decreased, although prices for goods did not change (that is, the average check is proportional to the number of goods in the check), and there were no discounts. Let's add additional data to our optimistic table, without which it is impossible to calculate the total average bill:

	February	March
Sepulki, average check	200 rubles	210 rubles (+5%)
Spills, average bill	100 rubles	107 rubles (+7%)
Share of sepulec purchases	50%	35%
Share of purchases of spillikins	50%	65%
*Total average bill*	150 rubles	145.05 rubles (-4.63%)

The average bill in March is calculated as follows: $0.35\times 210 + 0.65\times 107 = 145.05 < 150$ .

Conclusion: you should not draw conclusions on individual indicators if they are not the only ones who set the key metric (in this case, the trainee did not take into account that in the average receipt for all purchases, categories are actually weighted in proportion to the share of purchases in them). Understanding Simpson's paradox can protect you from making incorrect conclusions, including during AB testing.

Anscombe Quartet

And now a story about why data visualization is literally necessary. Imagine being told about four sets of points about which the following is known: the average value of the variable variance of the variable the average value of the variable variance of the variable and correlation between variables they have the same* for each of the sets. And also the coefficients that define the linear regression line are the same.

^{*accurate to two or three decimal places}

It would seem that the samples should be very similar to each other. But the catch here lies in the fact that by default many people imagine something like a normal distribution (or another of the main types), although nothing is said about this initially. Let's take advantage dataset from the seaborn library and visualize this data:

import seaborn as sns
sns.set_theme(style="ticks")
# Load the example dataset for Anscombe's quartet
df = sns.load_dataset("anscombe")

# Show the results of a linear regression within each dataset
sns.lmplot(
    data=df, x="x", y="y", col="dataset", hue="dataset",
    col_wrap=2, palette="muted", ci=None,
    height=4, scatter_kws={"s": 50, "alpha": 1}
)

Very different data sets, but the linear regression line is (almost!) the same for all

Let's calculate the characteristics of these sets of points:

mean_1 = df[df["dataset"] == "I"].mean()
mean_2 = df[df["dataset"] == "II"].mean()
mean_3 = df[df["dataset"] == "III"].mean()
mean_4 = df[df["dataset"] == "IV"].mean()
mean_1, mean_2, mean_3, mean_4

This code is for calculating averages by coordinates And produces the following result:

(x    9.000000
 y    7.500909
 dtype: float64,
 x    9.000000
 y    7.500909
 dtype: float64,
 x    9.0
 y    7.5
 dtype: float64,
 x    9.000000
 y    7.500909
 dtype: float64)

I put the code for the remaining characteristics under the cat so as not to make the article too long:

Hidden text

Dispersion

std_1 = df[df["dataset"] == "I"].std()
std_2 = df[df["dataset"] == "II"].std()
std_3 = df[df["dataset"] == "III"].std()
std_4 = df[df["dataset"] == "IV"].std()
std_1, std_2, std_3, std_4

(x    3.316625
 y    2.031568
 dtype: float64,
 x    3.316625
 y    2.031657
 dtype: float64,
 x    3.316625
 y    2.030424
 dtype: float64,
 x    3.316625
 y    2.030579
 dtype: float64)

Correlation coefficient And

import numpy as np

corr_1 = np.corrcoef(df[df["dataset"] == "I"]["x"], df[df["dataset"] == "I"]["y"])[0, 1]
corr_2 = np.corrcoef(df[df["dataset"] == "II"]["x"], df[df["dataset"] == "II"]["y"])[0, 1]
corr_3 = np.corrcoef(df[df["dataset"] == "III"]["x"], df[df["dataset"] == "III"]["y"])[0, 1]
corr_4 = np.corrcoef(df[df["dataset"] == "IV"]["x"], df[df["dataset"] == "IV"]["y"])[0, 1]
corr_1, corr_2, corr_3, corr_4

(0.81642051634484, 0.8162365060002428, 0.8162867394895984, 0.8165214368885028)

Linear regression line

k1, b1 = np.polyfit(df[df["dataset"] == "I"]["x"], df[df["dataset"] == "I"]["y"], 1)
k2, b2 = np.polyfit(df[df["dataset"] == "II"]["x"], df[df["dataset"] == "II"]["y"], 1)
k3, b3 = np.polyfit(df[df["dataset"] == "III"]["x"], df[df["dataset"] == "III"]["y"], 1)
k4, b4 = np.polyfit(df[df["dataset"] == "IV"]["x"], df[df["dataset"] == "IV"]["y"], 1)

k1, k2, k3, k4, b1, b2, b3, b4

(0.5000909090909095,
 0.5000000000000002,
 0.499727272727273,
 0.4999090909090908,
 3.0000909090909076,
 3.0009090909090905,
 3.0024545454545457,
 3.0017272727272712)

Conclusion: Sometimes just looking at all the basic statistics will not be enough, and then visualization is necessary. In this case, often even the simplest scatter plot (graph with points on a plane) is enough to notice differences in samples.

The Datasaurus Dozen

Anscombe's quartet clearly demonstrates why data visualization is important, but picking 4 sets of 11 points is nothing special, is it? A more interesting example would be the “Datasaurus Dozen”, which consists of 13 sets of dots that add up to form different shapes.

The authors' approach was to initially take a “dinosaur” of points, and then iteratively change the data slightly so that the values of the means, variances, and correlation coefficient remained the same, accurate to two decimal places, until another figure (oval, star, etc.). Each of the resulting figures required about 200 thousand iterations of the algorithm.

Pseudocode of the algorithm for generating such sets of points has the following form:

current_ds ← initial_ds
for x iterations, do:
    test_ds ← perturb(current_ds, temp)
    if similar_enough(test_ds, initial_ds):
        current_ds ← test_ds

function perturb(ds, temp):
    loop:
        test ← move_random_points(ds)
        if fit(test) > fit(ds) or temp > random():
            return test

initial_ds — initial set of points
current_ds — a set of points at the moment
fit() — a function that checks how much a set of points currently resembles the desired figure
similar_enough() — a function that checks that the values of statistics are close enough
move_random_points() – a function that randomly shifts points

Conclusion

All these examples bring us to the importance of using exploratory data analysis (exploratory data analysis) – this expression denotes an approach to working with data through the analysis of all key indicators and (almost always) their visualization. A critical look at the conclusions drawn from a pair of indicators is an important trait for both a data analyst and any person who does not want to be deceived.

Thank you for reading! I will be glad to add additions and questions in the comments.