# Do You Need a Beautiful Resume? Evaluating the Results of the Experiment Using Statistics

Recently YouTube gave me this __video__ about evaluating the effectiveness of a resume. The author of the video created five versions to study the influence of four factors: the applicant's name, the company name, the gap in employment, and the design. Each resume was sent to 100 relevant vacancies.

My name is Olga Matushevich, I am a mentor on the course __“Data Analyst”__ in Yandex Practicum. In this text, I will tell you what results the experiment from YouTube showed, and I will try to find out whether they are statistically significant.

## How employers responded to the resume in the video

**№1**. A boring standard resume of a white (judging by the name) male with a mention of working at Meta Meta Platforms Inc. activities are prohibited in the Russian Federation in the most visible place. The option received 18 invitations to interview.**№2**. Same summary, but mentions Meta Meta Platforms Inc. activities are prohibited in the Russian Federation hidden in the middle of the document, and in the most visible place is a noname company. 10 invitations to interviews.**№3**. Resume #1, made in a beautiful Canva template. Result: 8 (!!!) invitations to interviews.**№4**. Resume #1, but with a woman's name in the title, and the woman, judging by her name, clearly has ethnic roots. 10 invitations to interview.**№5**. Resume #1, but with a three-year break in work. 8 invitations to interviews.

**Attention, question: Are these results statistically significant?**

In this case, we have an A/B/n test. This is an A/B test in which, with one control group A (in our case, resume #1), we create several test groups (resumes #2, #3, #4, and #5).

To evaluate the test results, we will use the methods from the free course __“Basics of Statistics and A/B Testing”__ from Yandex Praktikum.

## Formulation of hypotheses

In this study we use **two-sided alternative hypothesis**. This way we can check whether there are significant differences in the number of invitations to interviews for different versions of the resumes, both in the direction of increasing results and in the direction of decreasing results. Even though all test resumes showed a worse result relative to the control one.

**Null hypothesis (H0):**The differences in interview invitation rates for resume #1 and resume #2 are not statistically significant.**Alternative hypothesis (H1):**The differences in the rates of interview invitations for resume #1 and resume #2 are statistically significant.

We will compare not only the results of sending resume #2 with the results of sending resume #1. We will also compare the results of sending resumes #3, #4 and #5 with the results of the control group – resume #1. In this case, we will replace number 2 in the formulations of the hypotheses with the number of the corresponding resume.

## Selecting a criterion

We will use **z-test for proportions**. It is used in statistics to compare two proportions measured in two independent samples. This method is especially useful when you want to assess whether there are statistically significant differences between the success rates of two different groups – which is exactly the case here.

The Z-score is based on the standard normal curve (z-distribution) and is calculated as follows:

Where `p1`

And `p2`

— the success rates in the first and second groups, respectively, `n1`

And `n2`

— the sizes of these groups, and p — the combined proportion of success, calculated by the formula:

Where `x1`

And `x2`

— the number of successes in each group.

If the calculated z-statistic value falls within the critical range of the standard normal distribution (usually ±1.96 for α = 0.05), the differences are considered statistically significant. This indicates that the observed differences in proportions between the two groups are not due to chance.

## Level α

Let's take the standard level **α = 0.05**This means that we allow a 5% chance of making a Type I error—that is, incorrectly rejecting the null hypothesis when it is actually true.

## Calculations

First, let's calculate the success rates for each resume:

Summary #1 (control): 18 invitations, p1 = 0.18.

Summary #2: 10 invitations, p2 = 0.10.

Summary #3: 8 invitations, p3=0.08.

Summary #4: 10 invitations, p4 = 0.10

Summary #5: 8 invitations, p5 = 0.08

Now, using the formulas given above, we calculate the z-values and compare them with the control value of ±1.96:

Summary #1 and Summary #2: z ≈ 1.63, p-value greater than 0.05. The difference is not statistically significant.

Summary #1 and Summary #3: z ≈ 2.10, p-value less than 0.05. The difference is statistically significant.

Summary #1 and Summary #4: z ≈ 1.63, p-value greater than 0.05. The difference is not statistically significant.

Summary #1 and Summary #5: z ≈ 2.10, p-value less than 0.05. The difference is statistically significant.

Wow! We found statistically significant differences between the results of resume mailings #1 and #3, and between the results of resume mailings #1 and #5. **So, we can confidently say that a beautiful resume template and a break in work of about three years reduce your chances of getting a job.**

**Or not?**

## Multiple hypothesis testing

Here the methods from the course will help us __“Mathematics for data analysis”__ from Yandex Praktikum. The necessary formulas are in the “Statistical Methods” module.

When several independent tests are performed, the probability that at least one of them will yield a false positive increases. Assuming that the tests are independent, then the probability of not getting any false positives in each test is 1−α. If we perform m such tests, then the probability of not getting any type I errors in all tests is . Thus, the probability that at least one of the tests will produce a type I error is: .

In our case, we conducted four hypothesis tests, i.e. m = 4. Using the given formula, we obtain the probability of making at least one type I error .

**To control this risk and maintain the overall significance level at the target level of 0.05, adjustments for multiple comparisons such as Holm or Bonferroni corrections are applied.**These methods adjust decision criteria to reduce the likelihood of false positives and provide more reliable conclusions.

The Bonferroni correction is very simple to explain, but very powerful. It is extremely rare to reject the null hypothesis when it is applied. **We will use the Holm correction.**To do this we:

Let's order the p-values from smallest to largest.

We apply the correction according to the formula: for the i-th p-value: ), where m is the total number of tests and α is the overall significance level.

Let's move from z-values to p-values (we'll do this outside the article, perhaps using __tables__).

p12 ≈ 0.103 (for comparison of summaries 1 and 2)

p13 ≈ 0.036 (for comparison of summaries 1 and 3)

p14 ≈ 0.103 (for comparison of summaries 1 and 4)

p15 ≈ 0.036 (for comparison of summaries 1 and 5)

Now let's sort them in ascending order:

p13 ≈ 0.036

p15 ≈ 0.036

p12 ≈ 0.103

p14 ≈ 0.103

Let's calculate the new α adjusted by the Holm correction:

for the first (smallest) p-value: α1 = 0.05 / 4 = 0.0125

for the second p-value: α2 = 0.05 / 3 ≈ 0.0167

for the third p-value: α3 = 0.05 / 2 = 0.025

for the fourth p-value: α4 = 0.05 / 1 = 0.05

Unfortunately, all p-values are now above the adjusted α level. **This means that we can NOT say for sure that any of the deviations from the standard resume that we studied will reduce your chances of getting a job.**

## So how should you format your resume?

The results from the video may lead us to eloquent conclusions, but in reality, everything is not so dramatic – and the author's experiment does not actually prove the influence of experience in a large company, a career break, gender or ethnicity on employment.

Let's return to the question from the title – do you need a beautiful resume? On the one hand, we have not proven the harm of beautiful templates. On the other hand, to put it mildly, the usefulness of such improvements has not been proven either. And since the presence of a statistically significant positive effect has not been proven, why waste time?