Hi all. This post is about the intuitive understanding of the Normal distribution.
A typical course in probability theory goes like this. At first, clear and relatively simple concepts are given to you. Everything is easily explained “on the fingers”: tossing coins, red and white balls in urns, rabbits in cages, and so on.
But in the next topic, you are thrown into a pit with this monster:
Suddenly there are no more coins, no more urns, no more balls. You are only told to memorize this Normal distribution probability density function, that it is very important and that the graph is like a bell. The rest you are on your own.
But what is it? Why is there an exhibitor? Why minus? Why divide by 2 sigma square? Where did Pi come from? Where did the coins, balls, urns and rabbits go? Why did we go from intuitive explanations to stupid memorization?
Each formula carries some idea. In this post, we will dissect the normal distribution until we understand what it carries. At the end, we will derive the probability density function to understand where it comes from.
I will show that, despite the intimidating appearance, the Normal Distribution is not a horse in a vacuum. It’s still about coin tosses, urns and other real world stuff.
This article is a translation of my article from Substack Understanding the Normal Distribution for Real. Go there if you are more comfortable reading in English or would like to receive such content by mail.
I also have a telegram channel @boris_again
We dissect the monster
Let’s start from the inside. Let’s take a look at the ideas behind this piece:
We cut the monster into pieces.
Where is the mean, one per distribution parameters.
Let’s look at the graph of the function at :
We see a parabola. It is similar to the shape of a bell, but upside down. Also note that the axis arbitrary, not in range so it’s not yet a distribution.
Note that the further from the mean value, the larger the value of the function. Second, the square allows us to treat negative and positive values equally. It makes the shape of the bell symmetrical.
Idea: locates the top of the bell and the distribution becomes symmetrical.
Finally, a bell shape! But the y values are negative. Naturally, we cannot have negative probabilities.
What happens if you change ?
Conclusion: Change moves the peak of the bell to another location.
Add the next piece, division by sigma square:
Here, sigma is the second parameter of the distribution: the standard deviation. The square of sigma is the variance. What does this give our distribution?
Let’s try to change it:
Idea: the sigma denominator specifies the rate at which the value of the function changes as it moves away from the mean. Smaller sigmas create narrower bell shapes.
We can think of sigma as a measure of uncertainty. Small sigmas indicate that the mean is more likely. Large sigmas distribute the probability over a wider range.
Sigma is squared to show that uncertainty increases quadratically (fast) rather than linearly (slowly). In other words, a small variation in the data greatly changes the bell curve.
Great, we have a bell curve. But it doesn’t look like a probability distribution. For this to be a distribution, the outputs must be within and add up to 1. This is where the exponent comes in.
Let’s plot on a graph around zero:
Note that the exponent maps any negative input to a value between 0 and 1. In our case, the argument always negative.
Great! Now all values are between , and we have the bell-shaped curve we need. We’re done.
Not really. It’s a nice bell curve, but the values don’t add up to 1. The peak alone is almost 1.
How can I make the sum equal to 1? Normalize!
How would you normalize a sequence of numbers like: ? Easy: divide by the amount.
In our case, a function is not just a sequence of numbers. She is continuous. However, the idea is the same. Let’s integrate to get the sum:
Now that we have the sum, let’s divide the function by it:
Finally, we have collected all the parts to get a normal distribution.
If you need the details of calculating the integral, then I recommend this video:
Relation to the Binomial Distribution
The normal distribution is closely related to the binomial distribution. Let’s digress and take a closer look at the binomial distribution. This will help us get a complete picture of the normal distribution.
Imagine that you have a Substack mailing list and you are going to launch an advertising campaign aimed at 100 people. You know your conversion to subscription: 10%. What is the probability that exactly 5 people subscribe?
Let’s imagine each user as a coin toss. It either registers with a probability or not. This can be described by the Bernoulli(n, p) distribution, which has the following probability mass function:
Idea: is a coin toss, there are two possible outcomes, and with probability an eagle falls.
This distribution allows you to answer questions like “What is the probability of this coin coming up heads?” or, more practically, “What is the probability that the user will subscribe?”.
We can add several Bernoulli random variables and get the Binomial distribution . It tells us about the probability of getting success from independent Bernoulli trials with probability .
The binomial distribution combines all independent trials to answer new questions: “What is the probability of getting 3 heads out of 3 tosses?” or “How many users can we expect at registration?”
Here is the PMF of the Binomial Distribution:
Where is the binomial coefficient.
The binomial coefficient is used to take into account the many ways of registering Human. For example, if there are four visitors, there are six ways for two of them to register.
Let’s plug in our values to find the probability that exactly five people sign up:
The expected number of subscribers among 5 users is simply multiplied by , i.e. 3.387. We can also get the probability of registering at least 5 subscribers by summing over and get 0.94242.
Now let’s see what happens as the number of trials grows.
Note: with growth the resulting PMF approaches the familiar bell shape of the Normal Distribution.
It turns out that The normal distribution is an extreme case of the binomial distribution. The binomial distribution answers the question, “How likely is it to get k heads out of n coin tosses?” The normal distribution has the same idea but gives an approximate result.
We are interested in this approximation because the calculation of the coefficients of the binomial distribution for large values is computationally expensive. Factorials in formulas are the biggest problem. For example, for the binomial coefficient is 75287520. This is a very expensive calculation, especially if you need to sum over many .
Instead of calculating the binomial PMF, we can approximate it by calculating the PDF of the normal distribution. This is much faster: you only need to substitute a few numbers into the formula. This approach is often used in surveys.
Basic idea of a normal distribution: the number of successes in a large number of independent “yes or no” trials is distributed symmetrically around the mean, and the shape of the distribution is described Gaussian function.
Hopefully now the normal distribution PMF is no longer just a scary formula in a vacuum. It is still associated with coin tossing and real life, just like the Binomial Distribution.
Derive the Normal Distribution
Why exactly such a function allows us to approximate the Binomial distribution? To answer this question, we will need to derive the PMF of the normal distribution. There are several ways to do this, but we will use our knowledge that the normal distribution is an extreme case of the binomial distribution. I will only describe the basic steps as the details of the output are quite long, but you can find the full output at this article.
The Normal distribution is an extreme case of the Binomial if not very small, and the following condition is satisfied:
If it doesn’t, we end up with a Poisson distribution, which is also cool, but beyond the scope of this post.
Remember the binomial PMF? Suppose we have a sequence of Bernoulli trials, each with a probability of success and we repeat this experiment once. Let is the number of successes in n trials. Then has a binomial distribution with parameters . Probability mass function defined as:
The hardest part is the factorial. Let’s take advantage Stirling approximationto calculate factorials faster:
Substituting this into the binomial coefficient we get:
It may look intimidating, but in fact it is just a replacement and some rearrangement of terms.
Substituting this approximation into the PMF of the Binomial Distribution, we get:
This is the probability density function of the normal distribution for and square :
Conclusion: the gaussian appears when we replace the calculation of factorials in the binomial approximation.
We disassembled the normal distribution into its components, explored the relationship between the binomial and normal distributions, and finally got the normal PDF. I hope now this is not so much a mysterious monster for you as a wonderful way to describe the phenomena of the real world.
You can find the chart code in this collab.