Understanding Normal Distribution

Hi all. This post is about the intuitive understanding of the Normal distribution.

A typical course in probability theory goes like this. At first, clear and relatively simple concepts are given to you. Everything is easily explained “on the fingers”: tossing coins, red and white balls in urns, rabbits in cages, and so on.

But in the next topic, you are thrown into a pit with this monster:

\begin{aligned} f(x) = \frac{1}{\sigma \sqrt{2\pi}} \exp \left( -\frac{(x-\mu)^2}{2\sigma^2 } \right) \end{aligned}

Suddenly there are no more coins, no more urns, no more balls. You are only told to memorize this Normal distribution probability density function, that it is very important and that the graph is like a bell. The rest you are on your own.

Good old Gaussian bell

Good old Gaussian bell

But what is it? Why is there an exhibitor? Why minus? Why divide by 2 sigma square? Where did Pi come from? Where did the coins, balls, urns and rabbits go? Why did we go from intuitive explanations to stupid memorization?

Each formula carries some idea. In this post, we will dissect the normal distribution until we understand what it carries. At the end, we will derive the probability density function to understand where it comes from.

I will show that, despite the intimidating appearance, the Normal Distribution is not a horse in a vacuum. It’s still about coin tosses, urns and other real world stuff.

Self-promotion

This article is a translation of my article from Substack Understanding the Normal Distribution for Real. Go there if you are more comfortable reading in English or would like to receive such content by mail.

I also have a telegram channel @boris_again


We dissect the monster

Let’s start from the inside. Let’s take a look at the ideas behind this piece:

y = -\frac{(x-\mu)^2}{2 \sigma^2}

We cut the monster into pieces.

y = (x - \mu)^2

Where \muis the mean, one per distribution parameters.

Let’s look at the graph of the function y = (x - \mu)^2 at \mu=0:

We see a parabola. It is similar to the shape of a bell, but upside down. Also note that the axis y arbitrary, not in range [0, 1]so it’s not yet a distribution.

Note that the further x from the mean value, the larger the value of the function. Second, the square allows us to treat negative and positive values ​​equally. It makes the shape of the bell symmetrical.

Idea: \mu locates the top of the bell and the distribution becomes symmetrical.

y = -(x-\mu)^2

Finally, a bell shape! But the y values ​​are negative. Naturally, we cannot have negative probabilities.

What happens if you change \mu?

Changing the mean

Changing the mean

Conclusion: Change \mumoves the peak of the bell to another location.

Add the next piece, division by sigma square:

y = -\frac{(x-\mu)^2}{2 \sigma^2}

Here, sigma is the second parameter of the distribution: the standard deviation. The square of sigma is the variance. What does this give our distribution?

Let’s try to change it:

Changing sigma

Changing sigma

Idea: the sigma denominator specifies the rate at which the value of the function changes as it moves away from the mean. Smaller sigmas create narrower bell shapes.

We can think of sigma as a measure of uncertainty. Small sigmas indicate that the mean is more likely. Large sigmas distribute the probability over a wider range.

Sigma is squared to show that uncertainty increases quadratically (fast) rather than linearly (slowly). In other words, a small variation in the data greatly changes the bell curve.


Great, we have a bell curve. But it doesn’t look like a probability distribution. For this to be a distribution, the outputs must be within [0, 1] and add up to 1. This is where the exponent comes in.

Let’s plot on a graph e(x) around zero:

e(x)

e(x)

Note that the exponent maps any negative input to a value between 0 and 1. In our case, the argument e(x) always negative.

y = \exp{(-\frac{(x-\mu)^2}{2 \sigma^2})}

Great! Now all values ​​are between [0, 1], and we have the bell-shaped curve we need. We’re done.


Not really. It’s a nice bell curve, but the values ​​don’t add up to 1. The peak alone is almost 1.

How can I make the sum equal to 1? Normalize!

How would you normalize a sequence of numbers like: [0, 1, 2, 3, 4]? Easy: divide by the amount.

In our case, a function is not just a sequence of numbers. She is continuous. However, the idea is the same. Let’s integrate to get the sum:

\begin{aligned} y = \int \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)dx = \sqrt{2\pi}\sigma \end {aligned}

Now that we have the sum, let’s divide the function by it:

y = \frac{1}{\sqrt{2 \pi}\sigma} \exp{(-\frac{(x-\mu)^2}{2 \sigma^2})}

Finally, we have collected all the parts to get a normal distribution.

If you need the details of calculating the integral, then I recommend this video:


Relation to the Binomial Distribution

The normal distribution is closely related to the binomial distribution. Let’s digress and take a closer look at the binomial distribution. This will help us get a complete picture of the normal distribution.

Imagine that you have a Substack mailing list and you are going to launch an advertising campaign aimed at 100 people. You know your conversion to subscription: 10%. What is the probability that exactly 5 people subscribe?

Let’s imagine each user as a coin toss. It either registers with a probability p=0.1or not. This can be described by the Bernoulli(n, p) distribution, which has the following probability mass function:

P(X=k) = \begin{cases} p & \text{if } k=1 \\ 1-p & \text{if } k=0 \\ \end{cases}

Idea: is a coin toss, there are two possible outcomes, and with probability p an eagle falls.

This distribution allows you to answer questions like “What is the probability of this coin coming up heads?” or, more practically, “What is the probability that the user will subscribe?”.

We can add several Bernoulli random variables and get the Binomial distribution Binom(k, p). It tells us about the probability of getting k success from n independent Bernoulli trials with probability p.

The binomial distribution combines all independent trials to answer new questions: “What is the probability of getting 3 heads out of 3 tosses?” or “How many users can we expect at registration?”

Here is the PMF of the Binomial Distribution:

\begin{aligned} f(k, n) = {n\choose k} p^k (1-p)^{nk} \end{aligned}

Where {n\choose k} = \frac{n!}{k!(nk)!} is the binomial coefficient.

The binomial coefficient is used to take into account the many ways of registering k Human. For example, if there are four visitors, there are six ways for two of them to register.

Let’s plug in our values ​​to find the probability that exactly five people sign up: n=100, k=5, p=0.1.

\begin{aligned} f(k, n) = {100\choose 5} 0.1^5 (1-0.1)^{100-5} = 0.03387 \end{aligned}

The expected number of subscribers among 5 users is simply n multiplied by p, i.e. 3.387. We can also get the probability of registering at least 5 subscribers by summing over  k\geq 5, and get 0.94242.

Now let’s see what happens as the number of trials grows.

Note: with growth n the resulting PMF approaches the familiar bell shape of the Normal Distribution.

It turns out that The normal distribution is an extreme case of the binomial distribution. The binomial distribution answers the question, “How likely is it to get k heads out of n coin tosses?” The normal distribution has the same idea but gives an approximate result.

We are interested in this approximation because the calculation of the coefficients of the binomial distribution for large values ​​is computationally expensive. Factorials in formulas are the biggest problem. For example, for n=100, k=5 the binomial coefficient is 75287520. This is a very expensive calculation, especially if you need to sum over many  k.

Instead of calculating the binomial PMF, we can approximate it by calculating the PDF of the normal distribution. This is much faster: you only need to substitute a few numbers into the formula. This approach is often used in surveys.

Basic idea of ​​a normal distribution: the number of successes in a large number of independent “yes or no” trials is distributed symmetrically around the mean, and the shape of the distribution is described Gaussian function.

Hopefully now the normal distribution PMF is no longer just a scary formula in a vacuum. It is still associated with coin tossing and real life, just like the Binomial Distribution.


Derive the Normal Distribution

Why exactly such a function allows us to approximate the Binomial distribution? To answer this question, we will need to derive the PMF of the normal distribution. There are several ways to do this, but we will use our knowledge that the normal distribution is an extreme case of the binomial distribution. I will only describe the basic steps as the details of the output are quite long, but you can find the full output at this article.

The Normal distribution is an extreme case of the Binomial if p not very small, and the following condition is satisfied:

np \to \infty \ \text{as} \n \to \infty

If it doesn’t, we end up with a Poisson distribution, which is also cool, but beyond the scope of this post.

Remember the binomial PMF? Suppose we have a sequence of Bernoulli trials, each with a probability of success pand we repeat this experiment n once. Let X is the number of successes in n trials. Then X has a binomial distribution with parameters n,p. Probability mass function X defined as:

\begin{equation} P(X=k) = \binom{n}{k}p^k(1-p)^{nk} \end{equation}

The hardest part is the factorial. Let’s take advantage Stirling approximationto calculate factorials faster:

\begin{equation}n!  \approx n^ne^{-n} \sqrt{2\pi n}= \sqrt{2\pi n}(\frac{n}{e})^n \end{equation}

Substituting this into the binomial coefficient we get:

\begin{align} \frac{n!}{k!(nk)!} \approx \frac{\sqrt{2\pi n}(\frac{n}{e})^n}{\sqrt{2 \pi k}(\frac{k}{e})^k \sqrt{2\pi (nk)}(\frac{nk}{e})^{nk}} \end{align}

It may look intimidating, but in fact it is just a replacement and some rearrangement of terms.

Substituting this approximation into the PMF of the Binomial Distribution, we get:

\begin{align} P(X=k) = \binom{n}{k}p^k(1-p)^{nk} \approx \frac{1}{\sqrt{2\pi npq)) \ cdot \exp\left(-\frac{(k-np)^2}{2npq}\right) \end{align}

This is the probability density function of the normal distribution for \mu=npand square \sigma^2 = npq :

\begin{equation} f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(x-\mu)^2}{2\sigma ^2}\right) \end{equation}

Conclusion: the gaussian appears when we replace the calculation of factorials in the binomial approximation.


Completion

We disassembled the normal distribution into its components, explored the relationship between the binomial and normal distributions, and finally got the normal PDF. I hope now this is not so much a mysterious monster for you as a wonderful way to describe the phenomena of the real world.

You can find the chart code in this collab.

Thank you for your attention! If you like this content, then subscribe to my telegram channel @boris_again And substack.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *