Let’s look at two conditional variables X and Y. Having built the diagram, we will see a cloud that is clearly stretched from the lower left to the upper right, as in the figure above. A linear regression fits perfectly into such a picture, which, with a relatively low error, will help us predict the values: the greater X, the greater Y. The task is completed. At first sight.
A more experienced colleague will recommend that we add to the diagram a breakdown by cohorts: for example, by country. Following his advice, we will see that there really is a connection, but it is diametrically opposite – within a single country, the more X, the less Y.
This is the Simpson paradox: a phenomenon in which the combination of several groups of data with the same directional dependence leads to a change in direction to the opposite.
Example 1: Sexual Discrimination at Berkeley
The most famous example of the Simpson paradox in the real world is the confusion with gender discrimination when enrolling at the University of Berkeley in 1973. Among researchers, there is a tale that the university was even tried, but there is no convincing evidence of a trial on the Internet.
This is how university admission statistics for 1973 look like:
The difference is significant. Too big to be random.
However, if we break down the data by faculty, the picture changes. Researchers found that the reason for the difference is that women applied for destinations with a tougher competition. In addition, it was found that 6 out of 85 faculties had discrimination in favor of women, and only 4 were against.
The difference arises solely due to the difference in sample sizes and the size of the competition between faculties. I will show you the example of two faculties.
Both faculties accept the same shares of women and men. However, since the absolute number of men was greater at the faculty with a higher percentage of admissions, if we combine the data, it turns out that in general the percentage of men is higher.
Example 2: unbalanced A / B experiment
Imagine you are conducting an A / B experiment to increase the conversion of your landing page. The experiment was conducted for two days, but on the first day the visitor distributor broke down, and option B received more visitors. On the second day, this problem was fixed. The result is the following numbers:
|Day 1||400||30 (7.5%)||2000||140 (7%)|
|Day 2||1000||60 (6.0%)||1000||55 (5.5%)|
|Total||1400||90 (6.4%)||3000||195 (6.5%)|
On each day, option A had a higher conversion rate, but option B won in total. This happened because on day with a higher conversion, option B had more traffic. In this example, an inexperienced researcher will roll out option B for all traffic, while in fact the conversion will increase if he uses option A.
Example 3: the impact of page visits on conversion
Each site has a page that motivates you to buy more than others. Suppose we create a visitors scoring system and select factors for it. We have a “About Product” page, and we assume that visiting it increases the likelihood of a conversion. Let’s look at the data.
At first glance, everything is obvious – the conversion for those who visit the page is less by as much as 3 pp, which means that the page reduces the likelihood of conversion. But if we break the data into the two most important cohorts in Internet marketing – desktop and mobile users, we will see that in fact in each of them the probability of conversion increases with a page visit.
|Visited page||Visited page|
We assumed that visiting a page affects conversion. In practice, a third variable intervened – the user’s platform. Due to the fact that it affects not only conversion, but also the probability of visiting the page, in the aggregated state, it distorted the data in such a way that it led us to conclusions that are opposite to the actual behavior of users.
What to do
In the analysis of data, you need to understand what kind of history lies behind them: what is happening in the real world, how it was measured and converted into a data type. Therefore, a data researcher in the marketing department needs to know the basics of marketing, and in the oil and gas industry – something about mining. This will help to avoid a large number of potential errors, not the least of which is the aggregation error caused by the Simpson paradox.
The following data characteristics typically result in the Simpson paradox:
- The presence of significant cohorts that can affect the values of the dependent (Y) and independent (X) variables;
- Unbalanced cohorts.
In each case, an individual approach is needed. To consider that all data should always be divided into cohorts is also the wrong approach, because often it is aggregated data that allows you to build the most accurate model. In addition, any data can be divided so as to obtain the relationship that we would like to receive. True, this will not have any practical application – cohorts should be justified.
For Internet marketing, one of the most important conclusions is the need to verify the correct operation of the splitter in A / B experiments. User groups in each test case should be approximately the same. It is not only about the total number of users, but also about their structure. If you suspect a problem, you should first check the cohorts for the following characteristics:
- Demographic characteristics;
- Geographic distribution;
- Traffic source;
- Type of device;
- Visiting time.
In the next article I will tell you how to detect and process the Simpson paradox when analyzing data in Python.
Original article describing the Berkeley case: P.J. Bickel, E.A. Hammel and J.W. O’Connell (1975) “Sex Bias in Graduate Admissions: Data From Berkeley”