We predict the results of the group stage and the winner of Euro 2024 using machine learning and GPT 4.0 chat

Disclaimer

According to the forecasts below, you should not place bets, because… they do not take into account bookmaker margins, team form, and many other factors. In general, betting on sports is a very specific activity, largely designed for human psychology, hidden weaknesses, etc., so in general you should not bet based on any forecasts on the Internet.

The following dataset was used to conduct the study: https://www.kaggle.com/datasets/martj42/international-football-results-from-1872-to-2017?resource=download R programming language, GPT 4.0 chat.

The objectives of the study: 1) check the accuracy of the forecast as a result of machine learning based on a database for 20 years 2) find out the size of the win/loss at the bookmaker's office when using the above approach.

Problems of the study: important factors such as the level and cost of players, the current form of teams, the factor of the home tournament for the German national team and much more are not taken into account here.

This is all taken into account by the bookmakers themselves when setting odds for matches, minus 10-15 percent of their margin, so it is impossible to win by simply choosing favorites.

Personally, for me, the answer to the question of whether the machine will be able to detect non-obvious patterns and beat the bookmaker is of more interest, rather than determining the favorite.

Methodology

First of all, the dataset was processed, since it includes the results of more than 47,000 matches over 152 years, including various African qualifications, which are not interesting to us and would slow down the data processing, the dataset was reduced to the results of the Euro, qualifications for it and the league nations.

*I did not take the qualification for the world, because… although the teams are the same, this is a different tournament and the format is slightly different*

The starting point was the euro of 1996 and, as a result, qualification for it, starting in 1994. This decision is due to a change in the format of the tournament, as well as the collapse of the countries of the socialist bloc (the number of participating countries has increased).

Thus, we get approximately the same composition of participants and results over the past 20 years. The final dataset consisted of 2,758 matches.

Next, using the GPT chat, I went through several options for machine learning in Python (used: pandas, numpy, train_test_split, GridSearchCV, RandomForestClassifier, accuracy_score).

The best result was the forecast accuracy – 53.51%.

The accuracy of the forecast was improved using the R language.

The best result on R was the forecast accuracy – 57.65%

A very good percentage, considering that the game goes for 3 results. Since we have a higher percentage of accuracy on R, we will use it for forecasting.

> library(randomForest)
> library(dplyr)
> 
> # Загрузка данных
> data <- read.csv("filtered_results.csv")
> 
> # Преобразование столбца date в формат даты
> data$date <- as.Date(data$date, format="%Y-%m-%d")
> 
> # Создание целевой переменной
> data$result <- ifelse(data$home_score > data$away_score, 1, 
+                       ifelse(data$home_score < data$away_score, -1, 0))
> 
> # Преобразование данных в единый формат
> home_games <- data %>%
+     select(team = home_team, opponent = away_team, score = home_score, opponent_score = away_score, result)
> 
> away_games <- data %>%
+     select(team = away_team, opponent = home_team, score = away_score, opponent_score = home_score, result) %>%
+     mutate(result = ifelse(result == 1, -1, ifelse(result == -1, 1, 0)))
> 
> all_games <- bind_rows(home_games, away_games)
> 
> # Создание новых признаков
> team_stats <- all_games %>%
+     group_by(team) %>%
+     summarise(total_games = n(),
+               total_win_rate = mean(result == 1),
+               total_avg_score = mean(score))
> 
> # Подготовка данных для модели
> data <- data %>%
+     left_join(team_stats, by = c("home_team" = "team")) %>%
+     rename(home_team_total_games = total_games,
+            home_team_total_win_rate = total_win_rate,
+            home_team_total_avg_score = total_avg_score) %>%
+     left_join(team_stats, by = c("away_team" = "team")) %>%
+     rename(away_team_total_games = total_games,
+            away_team_total_win_rate = total_win_rate,
+            away_team_total_avg_score = total_avg_score)
> 
> # Проверка и замена NA значений
> data[is.na(data)] <- 0
> 
> # Подготовка данных для модели
> features <- c("home_team_total_win_rate", "away_team_total_win_rate", 
+               "home_team_total_games", "away_team_total_games", 
+               "home_team_total_avg_score", "away_team_total_avg_score")
> X <- data[features]
> y <- factor(data$result)
> 
> # Разделение данных на обучающую и тестовую выборки
> set.seed(42)
> train_indices <- sample(seq_len(nrow(data)), size = 0.8 * nrow(data))
> X_train <- X[train_indices, ]
> y_train <- y[train_indices]
> X_test <- X[-train_indices, ]
> y_test <- y[-train_indices]
> 
> # Обучение модели Random Forest
> rf_model <- randomForest(X_train, y_train, ntree=200, mtry=3, importance=TRUE)
> 
> # Предсказание на тестовой выборке
> y_pred <- predict(rf_model, X_test)
> accuracy <- sum(y_pred == y_test) / length(y_test)
> print(paste("Accuracy:", accuracy))
[1] "Accuracy: 0.576576576576577"
> 
> # Пример новых матчей
> new_matches <- data.frame(
+     home_team = c("Germany", "Hungary", "Spain", "Italy", "Poland", "Slovenia", "Serbia", "Romania", "Belgium", "Austria", 
+                   "Turkey", "Portugal", "Croatia", "Germany", "Scotland", "Slovenia", "Denmark", "Spain", "Slovakia", 
+                   "Poland", "Netherlands", "Georgia", "Turkey", "Belgium", "Switzerland", "Scotland", "Albania", "Croatia", 
+                   "Netherlands", "France", "England", "Denmark", "Slovakia", "Ukraine", "Georgia", "Czech Republic"),
+     away_team = c("Scotland", "Switzerland", "Croatia", "Albania", "Netherlands", "Denmark", "England", "Ukraine", "Slovakia", 
+                   "France", "Georgia", "Czech Republic", "Albania", "Hungary", "Switzerland", "Serbia", "England", "Italy", 
+                   "Ukraine", "Austria", "France", "Czech Republic", "Portugal", "Romania", "Germany", "Hungary", "Spain", 
+                   "Italy", "Austria", "Poland", "Slovenia", "Serbia", "Romania", "Belgium", "Portugal", "Turkey")
+ )
> 
> # Расчет признаков для новых матчей
> new_matches <- new_matches %>%
+     left_join(team_stats, by = c("home_team" = "team")) %>%
+     rename(home_team_total_win_rate = total_win_rate,
+            home_team_total_games = total_games,
+            home_team_total_avg_score = total_avg_score) %>%
+     left_join(team_stats, by = c("away_team" = "team")) %>%
+     rename(away_team_total_win_rate = total_win_rate,
+            away_team_total_games = total_games,
+            away_team_total_avg_score = total_avg_score)
> 
> # Проверка и замена NA значений
> new_matches[is.na(new_matches)] <- 0
> 
> # Предсказание результатов новых матчей
> predictions <- predict(rf_model, new_matches[features])
> results <- ifelse(predictions == 1, "Home Win", ifelse(predictions == 0, "Draw", "Away Win"))
> 
> # Вывод результатов
> for (i in 1:nrow(new_matches)) {
+     print(paste(new_matches$home_team[i], "vs", new_matches$away_team[i], "-> Prediction:", results[i]))

Group stage results:

1. Germany vs Scotland -> Prediction: Home Win

2. Hungary vs Switzerland -> Prediction: Home Win

3. Spain vs Croatia -> Prediction: Home Win

4. Italy vs Albania -> Prediction: Home Win

5. Poland vs Netherlands -> Prediction: Away Win

6. Slovenia vs Denmark -> Prediction: Draw

7. Serbia vs England -> Prediction: Draw

8. Romania vs Ukraine -> Prediction: Home Win

9. Belgium vs Slovakia -> Prediction: Home Win

10. Austria vs France -> Prediction: Away Win

11. Turkey vs Georgia -> Prediction: Home Win

12. Portugal vs Czech Republic -> Prediction: Home Win

13. Croatia vs Albania -> Prediction: Home Win

14. Germany vs Hungary -> Prediction: Home Win

15. Scotland vs Switzerland -> Prediction: Home Win

16. Slovenia vs Serbia -> Prediction: Home Win

17. Denmark vs England -> Prediction: Draw

18. Spain vs Italy -> Prediction: Home Win

19. Slovakia vs Ukraine -> Prediction: Home Win

20. Poland vs Austria -> Prediction: Home Win

21. Netherlands vs France -> Prediction: Away Win

22. Georgia vs Czech Republic -> Prediction: Away Win

23. Turkey vs Portugal -> Prediction: Away Win

24. Belgium vs Romania -> Prediction: Draw

25. Switzerland vs Germany -> Prediction: Away Win

26. Scotland vs Hungary -> Prediction: Home Win

27. Albania vs Spain -> Prediction: Away Win

28. Croatia vs Italy -> Prediction: Draw

29. Netherlands vs Austria -> Prediction: Home Win

30. France vs Poland -> Prediction: Home Win

31. England vs Slovenia -> Prediction: Home Win

32. Denmark vs Serbia -> Prediction: Home Win

33. Slovakia vs Romania -> Prediction: Away Win

34. Ukraine vs Belgium -> Prediction: Home Win

35. Georgia vs Portugal -> Prediction: Away Win

36. Czech Republic vs Turkey -> Prediction: Home Win

Let's see how the 1/8 playoff bracket was formed taking into account the results of the matches.

library(randomForest)
> library(dplyr)
> 
> # Загрузка данных
> data <- read.csv("filtered_results.csv")
> 
> # Преобразование столбца date в формат даты
> data$date <- as.Date(data$date, format="%Y-%m-%d")
> 
> # Создание целевой переменной
> data$result <- ifelse(data$home_score > data$away_score, 1, 
+                       ifelse(data$home_score < data$away_score, -1, 0))
> 
> # Преобразование данных в единый формат
> home_games <- data %>%
+     select(team = home_team, opponent = away_team, score = home_score, opponent_score = away_score, result)
> 
> away_games <- data %>%
+     select(team = away_team, opponent = home_team, score = away_score, opponent_score = home_score, result) %>%
+     mutate(result = ifelse(result == 1, -1, ifelse(result == -1, 1, 0)))
> 
> all_games <- bind_rows(home_games, away_games)
> 
> # Создание новых признаков
> team_stats <- all_games %>%
+     group_by(team) %>%
+     summarise(total_games = n(),
+               total_win_rate = mean(result == 1),
+               total_avg_score = mean(score))
> 
> # Подготовка данных для модели
> data <- data %>%
+     left_join(team_stats, by = c("home_team" = "team")) %>%
+     rename(home_team_total_games = total_games,
+            home_team_total_win_rate = total_win_rate,
+            home_team_total_avg_score = total_avg_score) %>%
+     left_join(team_stats, by = c("away_team" = "team")) %>%
+     rename(away_team_total_games = total_games,
+            away_team_total_win_rate = total_win_rate,
+            away_team_total_avg_score = total_avg_score)
> 
> # Проверка и замена NA значений
> data[is.na(data)] <- 0
> 
> # Подготовка данных для модели
> features <- c("home_team_total_win_rate", "away_team_total_win_rate", 
+               "home_team_total_games", "away_team_total_games", 
+               "home_team_total_avg_score", "away_team_total_avg_score")
> X <- data[features]
> y <- factor(data$result)
> 
> # Разделение данных на обучающую и тестовую выборки
> set.seed(42)
> train_indices <- sample(seq_len(nrow(data)), size = 0.8 * nrow(data))
> X_train <- X[train_indices, ]
> y_train <- y[train_indices]
> X_test <- X[-train_indices, ]
> y_test <- y[-train_indices]
> 
> # Обучение модели Random Forest
> rf_model <- randomForest(X_train, y_train, ntree=200, mtry=3, importance=TRUE)
> 
> # Предсказание на тестовой выборке
> y_pred <- predict(rf_model, X_test)
> accuracy <- sum(y_pred == y_test) / length(y_test)
> print(paste("Accuracy:", accuracy))
[1] "Accuracy: 0.576576576576577"
> 
> # Групповой этап
> group_stage_matches <- data.frame(
+     home_team = c("Germany", "Hungary", "Spain", "Italy", "Poland", "Slovenia", "Serbia", "Romania", "Belgium", "Austria", 
+                   "Turkey", "Portugal", "Croatia", "Germany", "Scotland", "Slovenia", "Denmark", "Spain", "Slovakia", 
+                   "Poland", "Netherlands", "Georgia", "Turkey", "Belgium", "Switzerland", "Scotland", "Albania", "Croatia", 
+                   "Netherlands", "France", "England", "Denmark", "Slovakia", "Ukraine", "Georgia", "Czech Republic"),
+     away_team = c("Scotland", "Switzerland", "Croatia", "Albania", "Netherlands", "Denmark", "England", "Ukraine", "Slovakia", 
+                   "France", "Georgia", "Czech Republic", "Albania", "Hungary", "Switzerland", "Serbia", "England", "Italy", 
+                   "Ukraine", "Austria", "France", "Czech Republic", "Portugal", "Romania", "Germany", "Hungary", "Spain", 
+                   "Italy", "Austria", "Poland", "Slovenia", "Serbia", "Romania", "Belgium", "Portugal", "Turkey")
+ )
> 
> # Расчет признаков для группового этапа
> group_stage_matches <- group_stage_matches %>%
+     left_join(team_stats, by = c("home_team" = "team")) %>%
+     rename(home_team_total_win_rate = total_win_rate,
+            home_team_total_games = total_games,
+            home_team_total_avg_score = total_avg_score) %>%
+     left_join(team_stats, by = c("away_team" = "team")) %>%
+     rename(away_team_total_win_rate = total_win_rate,
+            away_team_total_games = total_games,
+            away_team_total_avg_score = total_avg_score)
> 
> # Проверка и замена NA значений
> group_stage_matches[is.na(group_stage_matches)] <- 0
> 
> # Предсказание результатов группового этапа
> predictions <- predict(rf_model, group_stage_matches[features])
> results <- ifelse(predictions == 1, "Home Win", ifelse(predictions == 0, "Draw", "Away Win"))
> 
> # Вывод результатов и подсчет очков
> group_stage_matches <- group_stage_matches %>%
+     mutate(result = results,
+            home_points = ifelse(result == "Home Win", 3, ifelse(result == "Draw", 1, 0)),
+            away_points = ifelse(result == "Away Win", 3, ifelse(result == "Draw", 1, 0)))
> 
> # Создание таблицы очков
> group_points <- group_stage_matches %>%
+     select(home_team, home_points) %>%
+     rename(team = home_team, points = home_points) %>%
+     bind_rows(group_stage_matches %>%
+                   select(away_team, away_points) %>%
+                   rename(team = away_team, points = away_points)) %>%
+     group_by(team) %>%
+     summarise(total_points = sum(points)) %>%
+     arrange(desc(total_points))
> 
> # Вывод очков команд
> print(group_points)
# A tibble: 24 × 2
   team           total_points
   <chr>                 <dbl>
 1 France                    9
 2 Germany                   9
 3 Portugal                  9
 4 Spain                     9
 5 Romania                   7
 6 Czech Republic            6
 7 Netherlands               6
 8 Scotland                  6
 9 Denmark                   5
10 England                   5
# ℹ 14 more rows
# ℹ Use `print(n = ...)` to see more rows
> 
> # Определение команд, вышедших в плей-офф
> groups <- list(
+     A = c("Germany", "Scotland", "Hungary", "Switzerland"),
+     B = c("Spain", "Croatia", "Italy", "Albania"),
+     C = c("Slovenia", "Denmark", "Serbia", "England"),
+     D = c("Poland", "Netherlands", "Austria", "France"),
+     E = c("Belgium", "Slovakia", "Romania", "Ukraine"),
+     F = c("Turkey", "Georgia", "Portugal", "Czech Republic")
+ )
> 
> playoff_teams <- list()
> third_place_teams <- list()
> 
> for (group in names(groups)) {
+     group_teams <- groups[[group]]
+     group_points_filtered <- group_points %>% filter(team %in% group_teams)
+     playoff_teams[[group]] <- group_points_filtered$team[1:2]
+     third_place_teams[[group]] <- group_points_filtered$team[3]
+ }
> 
> # Определение лучших третьих мест
> third_place_teams_points <- group_points %>% filter(team %in% unlist(third_place_teams))
> best_third_place_teams <- third_place_teams_points %>% arrange(desc(total_points)) %>% head(4) %>% pull(team)
> 
> # Заполнение расписания матчей плей-офф
> playoff_schedule <- data.frame(
+     match = c("Match № 38", "Match № 37", "Match № 40", "Match № 39", "Match № 42", "Match № 41", "Match № 43", "Match № 44"),
+     home_team = c(playoff_teams$A[2], playoff_teams$A[1], playoff_teams$C[1], playoff_teams$B[1], playoff_teams$D[2], playoff_teams$F[1], playoff_teams$E[1], playoff_teams$D[1]),
+     away_team = c(playoff_teams$B[2], playoff_teams$C[2], best_third_place_teams[1], best_third_place_teams[2], playoff_teams$E[2], best_third_place_teams[3], best_third_place_teams[4], playoff_teams$F[2])
+ )
> 
> print(playoff_schedule)

1/8 playoffs:

Match No. 38

Scotland

Croatia

Match No. 37

Germany

England

Match No. 40

Denmark

Italy

Match No. 39

Spain

Slovenia

Match No. 42

Netherlands

Belgium

Match No. 41

Portugal

Hungary

Match No. 43

Romania

Poland

Match No. 44

France

Czech Republic

This completes the first stage of the study.

At the second stage, I will summarize the interim results and give a forecast for the playoffs, taking into account the actual pairs formed in the 1/8.

At the third stage I will summarize the general results.

Evaluation of the research results:

1) Let's see how many results were predicted correctly and compare the percentage with 57.65. So let’s check how accurately the computer assessed the accuracy of its forecast.

2) Let's look at the virtual bank after the tournament and check whether the machine managed to beat the bookmaker.

Virtual bank

In order to find out whether such a strategy will bring us profit or loss at the bookmaker's office, we will create a virtual bank of $5,300. 51 matches will be played in this tournament, each will have a conditional bet of $100 based on the machine's prediction + we will bet 2 times $100 on the champion – before the start of the tournament and after the end of the group stage.

I will take the average odds on the site https://www.flashscore.com.ua/so as not to advertise any specific bookmaker.

And the champion of Euro 2024 according to the car will be Spain.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *