Systematic errors are not alien to histograms. The fact is that they are rather arbitrary and can lead to incorrect conclusions about the data. If you want to visualize the variable, it is better to choose a different graph.
Whether you’re in a meeting with senior management or data scientists, one thing you can be sure of is that a bar chart will appear at some point.
And it’s not hard to guess why. Histograms are very intuitive: anyone will understand them at a glance. Moreover, they objectively represent reality, right? But no.
A histogram can be misleading and lead to erroneous conclusions – even on the simplest dataset!
In this article, we’ll take a look at 6 reasons why bar charts are definitely not the best choice when it comes to data visualization:
They depend too much on the number of intervals.
They depend too much on the maximum and minimum of the variable.
They make it impossible to notice the meaningful values of the variable.
They do not distinguish continuous from discrete variables.
They make it difficult to compare distributions.
Their construction is difficult if not all data is in memory.
“Okay, I get it: histograms are not perfect. But do I have a choice? ” Of course have!
At the end of the article, I will recommend another schedule, called CDP, that overcomes these shortcomings.
So what’s wrong with the histogram?
1. It depends too much on the number of intervals.
To plot a histogram, you must first define the number of intervals, also called bins. There are many different practical methods for this (you can see an overview of them at this page). But how critical is this choice? Let’s take real data and see how the histogram changes depending on the number of intervals.
The variable represents the maximum heart rate (beats per minute) obtained from 303 people during some physical activity (data taken from the UCI heart disease dataset: a source).
Looking at the top left plot (which we get by default in Python and R), we get the impression of a good distribution with one peak (mode). However, if we were to consider other variants of the histogram, we would get a completely different picture. Different histograms of the same data can lead to conflicting conclusions.
2. It depends too much on the maximum and minimum of the variable.
Even after the number of intervals is set, the intervals depend on the position of the minimum and maximum of the variable. It is enough for one of them to change slightly, and all the intervals will also change. In other words, histograms are not reliable.
For example, let’s try changing the maximum of a variable without changing the number of bins.
Only one value differs, but the whole graph is different. This is an undesirable property because we are interested in the overall distribution: one value should not affect the graph so much!
3. Does not provide an opportunity to notice significant values of the variable.
In general, when a variable contains some frequently repeated values, we of course need to be aware of this. However, histograms prevent this because they are interval-based, and intervals hide individual values.
The classic example is when the missing values are massively assigned 0. As an example, consider a dataset of a variable that contains 10,000 values, 26% of which are zeros.
The graph on the left is what you get by default in Python. Looking at it, you will not notice the accumulation of zeros, and you might even think that this variable has a “smooth” dynamics.
The graph on the right is obtained by narrowing the intervals and gives a clearer picture of reality. But the fact is that no matter how you narrow the intervals, you will never be sure if the first interval contains only 0 or some other value.
4. Does not allow to distinguish continuous variables from discrete ones.
Often times, we would like to know if a numeric variable is continuous or discrete. It’s almost impossible to tell from the histogram.
Let’s take the variable “Age“(Age). You can get Age = 49 years old (when the age is rounded up) or Age = 49.828884325804246 years (when age is calculated as the number of days since birth divided by 365.25). The first is a discrete variable, the second is a continuous one.
The one on the left is continuous, and the one on the right is discrete. However, in the top plots (default in Python), you won’t see any difference between the two: they look exactly the same.
5. Difficult to compare distributions.
It is often necessary to compare the same variable in different clusters. For example, with respect to the UCI heart disease data above, we can compare:
whole population (for reference)
people under 50 with heart disease
people under 50 who do NOT have heart disease
people over 60 with heart disease
people over 60 and NOT suffering from heart disease.
Here’s what we would end up with:
Histograms are area-based, and when we try to make comparisons, the areas end up overlapping, making the task nearly impossible.
6. Difficult to build if not all data is in memory.
If all of your data is in Excel, R, or Python, building a histogram is easy: in Excel, you just need to click on the histogram icon, in R, run hist (x), and in Python, plt.hist (x).
But let’s say your data is stored in a database. You don’t want to dump all the data just to build a histogram, right? Basically, all you need is a table containing the extreme values and the number of observations for each interval. Something like this:
| INTERVAL_LEFT | INTERVAL_RIGHT | COUNT |
| ————— | —————- | ————— |
| 75.0 | 87.0 | 31 |
| 87.0 | 99.0 | 52 |
| 99.0 | 111.0 | 76 |
| … | … | … |
But getting it using a SQL query is not as easy as it seems. For example, in Google Big Query, the code will look like this:
WITH STATS AS ( SELECT COUNT(*) AS N, APPROX_QUANTILES(VARIABLE_NAME, 4) AS QUARTILES FROM TABLE_NAME ), BIN_WIDTH AS ( SELECT -- freedman-diaconis formula for calculating the bin width (QUARTILES[OFFSET(4)] — QUARTILES[OFFSET(0)]) / ROUND((QUARTILES[OFFSET(4)] — QUARTILES[OFFSET(0)]) / (2 * (QUARTILES[OFFSET(3)] — QUARTILES[OFFSET(1)]) / POW(N, 1/3)) + .5) AS FD FROM STATS ), HIST AS ( SELECT FLOOR((TABLE_NAME.VARIABLE_NAME — STATS.QUARTILES[OFFSET(0)]) / BIN_WIDTH.FD) AS INTERVAL_ID, COUNT(*) AS COUNT FROM TABLE_NAME, STATS, BIN_WIDTH GROUP BY 1 ) SELECT STATS.QUARTILES[OFFSET(0)] + BIN_WIDTH.FD * HIST.INTERVAL_ID AS INTERVAL_LEFT, STATS.QUARTILES[OFFSET(0)] + BIN_WIDTH.FD * (HIST.INTERVAL_ID + 1) AS INTERVAL_RIGHT, HIST.COUNT FROM HIST, STATS, BIN_WIDTH
A bit cumbersome, isn’t it?
Alternative: Cumulative Distribution Plot.
Having learned 6 reasons why the histogram is not the ideal choice, the natural question arises: “Do I have an alternative?” The good news: there is a better alternative called the Cumulative Distribution Plot (CDP). I know this title is not so catchy, but I guarantee it will be worth it.
A cumulative distribution plot is a plot of the quantiles of a variable. In other words, each CDP shows:
along the axis x: the original value of the variable (as in the histogram);
along the axis y: how many observations are of the same or less importance.
Let’s look at an example with a variable – maximum heart rate.
Take a point with coordinates x = 140 and y = 90 (30%). On the horizontal axis, you see the value of the variable: 140 beats per minute. On the vertical axis, you see the number of observations with a heart rate equal to or below 140 (in this case, 90 people, which means 30% of the sample). Consequently, 30% of our sample has a maximum heart rate of 140 beats per minute or less.
What’s the point in a graph showing how many observations are “equal or below” a given level? Why not just “equal”? Because otherwise the result would depend on the individual values of the variable. And it won’t work because each value has very few observations (usually only one if the variable is continuous). In contrast, CDPs rely on quantiles, which are more stable, expressive, and easier to read.
In addition, CDP is much more useful. When you think about it, you often have to answer questions like “How many of them are between 140 and 160?” Or “how many of them have more than 180?” With the CDP in front of your eyes, you can give an immediate response. This would be impossible with a histogram.
CDP solves all the problems we saw above. In fact, compared to the histogram:
one. Requires no user selection. For one dataset, there is only one possible CDP.
2. Doesn’t suffer from outliers. Extreme values do not affect CDP because quantiles do not change.
3. Allows you to define meaningful values. If there is a concentration of data points at a particular value, this is immediately visible, since there will be a vertical segment corresponding to the value.
four. Allows you to recognize a discrete variable at a glance. If there is only a specific set of possible values (that is, the variable is discrete), this is immediately visible, since the curve will take the shape of a staircase.
five. Simplifies comparison of distributions. It is easy to compare two or more distributions on the same graph, since they are just curves, not areas. In addition, the axis y is always between 0 and 100%, which makes comparison even easier. For comparison, this is the example we saw above:
6. It’s easy to build even if you don’t have all the data in memory. All you need are quantiles, which can be easily retrieved using SQL:
SELECT COUNT(*) AS N, APPROX_QUANTILES(VARIABLE_NAME, 100) AS PERCENTILES FROM TABLE_NAME
How to plot cumulative distribution graph in Excel, R, Python
In Excel, you need to plot two columns. The first with 101 numbers evenly spaced from 0 to 1. The second column should contain percentiles, which can be obtained by the formula: = PERCENTILE (DATA, FRAC)where DATA is a vector containing data, and FRAC – this is the first column: 0.00, 0.01, 0.02, 0.03, …, 0.98, 0.99, 1. Then you just need to plot the graph across these two columns by placing the variable values on the axis x…
In R, this is done in one line:
from statsmodels.distributions.empirical_distribution import ECDF import matplotlib.pyplot as plt ecdf = ECDF(data) plt.plot(ecdf.x, ecdf.y)
Thanks for attention! Hope you found this article helpful.
I appreciate the feedback and constructive criticism. If you would like to talk about this article or other related topics, you can write to me at Linkedin…
The translation of the material was prepared as part of the online course “Machine Learning. Basic“… We invite everyone interested to Open Day course, where you can find out all the details of the training and communicate with the teacher.
– Learn more about the course “Machine Learning. Basic”
– Watch an online meeting “Open Day“