Analysis of changes in age and anthropometric data of National Hockey League players

Lately, experts and players of the National Hockey League (NHL) have been increasingly saying that the league is getting younger and is shifting towards shorter, lighter, but more agile players. Hockey with its huge players is becoming a thing of the past, and the dimensions of such “giants” as New York Ranger forward Matt Rempe, with his height of 200 cm and weight of 109 kg, are discussed more than the hockey player's game itself.

I took it from the site NHL data on the last 10 seasons for players who played more than 10 matches in a season.

Let's analyze this data and see if the league is actually getting younger and the players are getting smaller and lighter

Data

We have files with datasets by seasons, which contain data on players who played more than 10 games in the NHL season, who played more than 10 games in the season, taken from the official website National Hockey League. Let's leave the columns that interest us: Position, Date of Birth, Height in inches, Weight in pounds, Number of matches played, season, player's 1st season.

Let's look at 5 random strings

Pos

DOB

Ht

Wt

G.P.

season

1st Season

279

D

1983-06-08

71

180

43

2014-2015

20062007

853

C

1987-08-07

71

200

80

2015-2016

20052006

4411

C

1988-04-29

74

201

70

2019-2020

20072008

1935

C

1991-02-22

75

200

19

2016-2017

20142015

5379

D

1994-07-25

72

181

57

2021-2022

20132014

Descriptive Statistics

Let's calculate the main statistical indicators (mean, median, standard deviation) for the age, height and weight of players for each season.

``````# Функция для расчета основных статистических показателей
def calculate_statistics(data, season, column):
stats = data.groupby(season)[column].agg(['mean', 'median', 'std'])
return stats

# Расчет статистических показателей для возраста, роста и веса
age_stats = calculate_statistics(data,'season', 'age')
height_stats = calculate_statistics(data,'season', 'Ht')
weight_stats = calculate_statistics(data,'season', 'Wt')
height_stats_cm = calculate_statistics(data,'season', 'Ht_cm')
weight_stats_kg = calculate_statistics(data,'season', 'Wt_kg')

# Вывод результатов
print("Статистические показатели для возраста:")
print(age_stats)
print("\nСтатистические показатели для роста:")
print(height_stats_cm)
print("\nСтатистические показатели для веса:")
print(weight_stats_kg)``````

Statistics for age:
mean median std
season
2014-2015 27.865753 27.0 4.548903
2015-2016 27.573370 27.0 4.436711
2016-2017 27.427989 27.0 4.430596
2017-2018 27.362319 27.0 4.341212
2018-2019 27.212516 27.0 4.090101
2019-2020 27.463315 27.0 4.105912
2020-2021 27.575549 27.0 4.110550
2021-2022 27.683168 27.0 4.212581
2022-2023 28.000000 28.0 4.081955
2023-2024 28.239637 28.0 4.213909

Statistical indicators for growth:
mean median std
season
2014-2015 185.649315 185.0 5.279594
2015-2016 185.884511 185.0 5.330337
2016-2017 185.705163 185.0 5.273089
2017-2018 185.566535 185.0 5.272569
2018-2019 185.674055 185.0 5.215526
2019-2020 185.710598 185.0 5.318094
2020-2021 185.901099 185.0 5.359289
2021-2022 185.923267 185.0 5.485504
2022-2023 186.042636 185.0 5.443297
2023-2024 185.879534 185.0 5.565527

Stats for weight:
mean median std
season
2014-2015 92.153096 92.08 6.704483
2015-2016 92.150014 92.08 6.875161
2016-2017 91.576182 91.17 6.844911
2017-2018 91.259802 90.72 6.806483
2018-2019 91.076063 90.72 6.912537
2019-2020 90.994918 90.72 7.004026
2020-2021 90.877074 90.72 6.756815
2021-2022 90.740681 90.72 6.973008
2022-2023 90.836021 90.72 6.884751
2023-2024 90.535557 90.72 7.058457

Visually, there are no big differences, let's look at the histograms and probability density distribution for each indicator.

Data visualization

``````# Получаем уникальные сезоны
seasons = data['season'].unique()

# Устанавливаем количество строк и колонок для субплотов
nrows = 5
ncols = 2

# Функция для построения гистограмм с кривыми плотности по сезонам
def plot_histograms_with_density(column, title, xlabel):
fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(15, 30))
fig.suptitle(title, fontsize=20)
axes = axes.flatten()
for i, season in enumerate(seasons):
season_data = data[data['season'] == season][column].to_numpy()
if column == 'Wt':
bins = int(((np.max(season_data) - np.min(season_data)))//4)
else:
bins = int((np.max(season_data) - np.min(season_data))) +1
axes[i].hist(season_data,
bins=bins, density = False,
range = [np.round(np.min(season_data)).astype(int), np.round(np.max(season_data).astype(int)+1)])
axes[i].set_title(season)
axes[i].set_xlabel(xlabel)
plt.tight_layout(rect=[0.03, 0.03, 1, 0.95])
plt.show()
# Draw the density plot
plt.figure(figsize=(12, 8))
sns.set(style="whitegrid")
for season in seasons:
season_data = data[data['season'] == season][column]
season_data.plot.density(ind = np.arange(min(season_data), max(season_data)+1))

# Plot formatting
plt.legend(seasons)
plt.title('Density Plot, '+ column)
plt.xlabel(xlabel)
plt.ylabel('Density')
plt.show()

# Построение гистограмм с кривыми плотности для возраста, роста и веса
plot_histograms_with_density('age', 'Распределение возраста игроков по сезонам', 'Возраст')
plot_histograms_with_density('Ht', 'Распределение роста игроков по сезонам', 'Рост (дюйм)')
plot_histograms_with_density('Wt', 'Распределение веса игроков по сезонам', 'Вес (фунт)')``````

It’s also difficult to draw any unambiguous conclusions from the graphs, so let’s move on to regression analysis

Time series regression

Using Linear Regression, we will plot the time series and the regression trend line for each feature. To perform linear regression, it is not necessary that the data itself (for example, the age of the players) be normally distributed.

It is important that the residuals (model prediction errors) are normally distributed. We will check the distribution of the residuals by constructing `QQ-плота` and dough `Шапиро-Уилка`

``````# Функция регрессии временных рядов
def regresion_model(stats):
# Подготовка данных для регрессии
X = stats.index.values.reshape(-1, 1)
y_mean = stats['mean'].values
# Линейная регрессия
model = LinearRegression()
model.fit(X, y_mean)
trend_line = model.predict(X)
return trend_line, model. intercept_ , model. coef_[0]

# Функция для построения временных рядов и регрессионной линии
def plot_trend_with_regression(stats, title, ylabel):

y_mean = stats['mean'].values
std_dev = stats['std'].values
trend_line, _, _ = regresion_model(stats)
residuals = y_mean - trend_line
# Построение графика
plt.figure(figsize=(12, 8))
sns.set(style="whitegrid")
plt.plot(stats.index.to_numpy(), y_mean, marker="o", label="Среднее значение")
plt.plot(stats.index.to_numpy(), trend_line, color="red", label="Трендовая линия")
plt.fill_between(stats.index, stats['mean'] - stats['std'], stats['mean'] + stats['std'], color="gray", alpha=0.2, label="Стандартное отклонение")
for i in range(len(stats)):
plt.annotate(f'{y_mean[i]:.2f}', (stats.index.to_numpy()[i], y_mean[i]),
textcoords="offset points", xytext=(0,10), ha="center")
plt.annotate(f'±{std_dev[i]:.2f}', (stats.index.to_numpy()[i], y_mean[i] - std_dev[i]),
textcoords="offset points", xytext=(0,-10), ha="center", color="blue")
plt.title(title, fontsize=20)
plt.xlabel('Сезон', fontsize=15)
plt.ylabel(ylabel, fontsize=15)
plt.legend()
plt.show()

# Проверка нормальности остатков
sm.qqplot(residuals, line="45")
plt.title('QQ-плот остатков')
plt.show()

stat, p_value = shapiro(residuals)
print(f'Шапиро-Уилк тест: Статистика={stat}, p-значение={p_value}')
if p_value > 0.05:
print('Остатки распределены нормально')
else:
print('Остатки не распределены нормально')

age_stats = calculate_statistics(data, 'season_numeric', 'age')
height_stats = calculate_statistics(data,'season_numeric', 'Ht')
weight_stats = calculate_statistics(data, 'season_numeric', 'Wt')
# Построение графиков для возраста, роста и веса
plot_trend_with_regression(age_stats, 'Тренд изменения среднего возраста игроков по сезонам', '')
plot_trend_with_regression(height_stats, 'Тренд изменения среднего роста игроков по сезонам', 'Рост (дюймы)')
plot_trend_with_regression(weight_stats, 'Тренд изменения среднего веса игроков по сезонам', 'Вес (фунты)')``````

`Шапиро-Уилк` test: Statistics=0.9729418158531189p-value=0.9166999459266663 The remainder is distributed Fine

`Шапиро-Уилк` test: Statistics=0.933466374874115p-value=0.48284029960632324 Remains distributed Fine

`Шапиро-Уилк` test: Statistics=0.9445896148681641p-value=0.6051148176193237 Remains distributed Fine

We see that there is a slight trend towards a decrease in the weight of players, while the growth trend remains virtually unchanged, and the age trend even goes up slightly.

Regression analysis. Mann-Kendall test

To identify the trend we will use the Test Mann-Kendall. It is a powerful tool for identifying trends in time series, especially when the data is not necessarily normally distributed.

``````features = ['age', 'Ht', 'Wt']
result = []
for column in features:
MK = mk. original_test (data.groupby('season')[column].agg('mean'))
stats = calculate_statistics(data, 'season_numeric', column)
_, x0, x = regresion_model(stats)
stats = [MK.p, MK.trend, x0,x, 10*x, MK.Tau]
result.append(stats)``````

Let's look at the regression values ​​for age, height and weight for the seasons “2014-2015” – “2023-2024”

p-value

regression

slope

intercept

effect

Tau

Age

0.152406

no trend

-76.611793

0.051623

0.516228

0.377778

Height

0.107405

no trend

49.991625

0.011496

0.114960

0.422222

Weight

0.000172

decrease

987.935300

-0.389616

-3.896159

-0.955556

Where:

• `p-value` – test p-value

• `regression` – indicator of the presence and direction of a trend

• `slope` And `intercept` – regression coefficients (slope and free coefficient)

• `effect` – Average difference between indicators at the beginning and at the end of the study period

• `Tau` – Kendell correlation coefficient. A value between -1 and 1. A value of 1 indicates a perfect positive correlation (all data is increasing), a value of -1 indicates a perfect negative correlation (all data is decreasing).

We see a clear trend towards a decrease in the age and weight of players. Over 10 years, the regression was 3,896 pound or so 1.77 kg. On the one hand, this is not such a big difference, on the other hand, the value of the Kendall correlation coefficient `Tau` talks about sustainable negative trend

Regression for each position

Let's see what the Mann-Kendall Test shows for the weights of players at each position.

``````result = []
for position in [defenders_weight_stats, forwards_weight_stats]:
MK = mk. original_test (position['mean'])
#stats = calculate_statistics(defenders_weight_stats, 'season_numeric', 'mean')
_, x0, x = regresion_model(position)
stats = [MK.p, MK.trend, x0,x, 10*x, MK.Tau]
result.append(stats)``````

p-value

regression

slope

intercept

effect

Tau

Defenders

0.000677

decrease

1274.143826

-0.529516

-5.295163

-0.866667

Forwards

0.012266

decreasing

836.922875

-0.315804

-3.158039

-0.644444

We see that the trend for weight loss is present for both defenders and forwards, but it is more pronounced for defenders. Over 10 years, the average weight of defenders has decreased by 5.3 pounds or so 2.4 kg.

Regression Analysis for Beginners

Let's analyze how height and weight have changed for players who spent their debut seasons in the league. This will allow us to understand whether the height and weight requirements change for newcomers who begin their journey in the league.

Test results `Манна-Кендалла`

p-value

regression

slope

intercept

effect, lb

effect, kg

Tau

Height

0.474274

no trend

-8.207392

0.040279

0.402788

0.182701

0.2

Weight

0.474274

no trend

996.715891

-0.395773

-3.957726

-1.795193

-0.2

We see that the graphs have a slight bias towards increasing height and decreasing and weight of players spending 1 season in the league. However, the test `Манна-Кендалла` indicates the absence of trends. This means that clubs do not strive to take lighter and shorter players, but look at other indicators (hockey players' skills). And a negative trend for the weight of NHL players may indicate that players are simply brought to optimal conditions in accordance with their physiological indicators.

Body mass index regression

Let's calculate the body mass index for hockey players to see if it changes from season to season.
The formula for calculation is as follows:

Visually, the trend towards a decrease in BMI indicators is visible. Let's see what the Mann-Kendall test shows

p-value

regression

slope

intercept

effect

Tau

BMI index

0.000083

decreasing

146.917965

-0.059696

-0.596957

-1.0

All indicators show a steady trend towards a decrease in the BMI indicator. Although the effect over 10 years is not so significant, the downward trend is steady. However, according to WHO indicators, hockey players are in the overweight zone, which is perhaps quite normal for this type of sport.

Conclusion

The aim of the study was to examine how the main physiological indicators of National Hockey League players (height, weight, age) have changed over the past 10 years. During the data analysis, we obtained the following results:

• Over the past 10 years, there has been a trend toward weight loss among NHL players.

• There is no trend towards a decrease in age and height. That is, the statement “the league is getting younger” is incorrect. We also analyzed players who are spending their first year in the league and did not identify any trends. This means that players enter the league with roughly the same average height and weight each season. And weight loss occurs already during the training process at the club. We checked how the body mass index of hockey players changed and found a stable trend towards it reduction. Although in absolute values ​​the figures of weight loss and BMI index are not so big. However, the trend is present. NHL players are gradually becoming lighter.

The open source code of the project is posted in my Git https://github.com/permyakov-andrew/Hockey/tree/main/NHL_players_analyst

PS This is my first work, please don't judge too harshly. All comments and constructive remarks are strictly welcome 🙂