Classification of events on the football field

Hi all! The DFL – Bundesliga Data Shootout competition ended at the end of 2022. Since I am interested in football and sports analytics in general, I decided to take part in this competition. The purpose of this article is to describe my approach, and I am confident that many of the techniques applied to this problem can be adapted to other computer vision problems. See the cut for details!

Purpose of the competition

From video recordings of a football match, it is necessary to establish what kind of event is happening on the football field, as a result they expect a csv file with the following fields: video id – the moment in time when this event occurred – what kind of event happened – confidence from 0 to 1 that it the event happened. Events are divided into three groups:

  • A play is a player's attempt to transfer control of the ball to a friend on the team. It can be carried out in the form of a short pass, cross (a strong low or high pass of the ball from the flank into the penalty area), corner and free kick.

  • Throw-In is the restart of play after the ball has gone over the sideline. The ball must be thrown by hand.

  • A challenge is a situation where two players from opposing teams try to gain control of the ball at the same time. This event occurs when players try to take the ball from each other.

Data

The training dataset contained 12 videos, each 60 minutes long. Attached was a csv file with the following information:

  • video_id

  • time: the exact time to the millisecond when the event occurred

  • event: what kind of event is in the frame. In addition to Play, Challenge and Throw-In themselves, there were classes start and end, that is, in the csv document there was the following order: start – play – end; start – challenge – end and so on. The average time between start and end was approximately 0.7 seconds

  • event_attributes: Additional event attributes that I didn't use at all. For example ball_action_forced, opponent_dispossessed, pass, openplay, etc.

Class division:

More details about the training data can be found here laptop.

The test dataset contained 32 videos of 30 seconds each. They were not marked and I did not use them in any way, for the reason that the videos were recorded from one camera that stood in place. In reality, the match is recorded from several cameras, so I had concerns that the organizers’ test set would include recordings from real matches, and this is a little different. So I marked up part of a 35-minute football match that I downloaded from the Internet.

Metrics

Accuracy was calculated using the metric average precisionwhich takes into account the average accuracy of recognition of various events, taking into account permissible errors in time for each category of events and is averaged over all categories. Permissible time errors in seconds for each class:

Challenge: [ 0.30, 0.40, 0.50, 0.60, 0.70 ]
Play: [ 0.15, 0.20, 0.25, 0.30, 0.35 ]
Throw-In: [ 0.15, 0.20, 0.25, 0.30, 0.35 ]

To make it more clear how the metric is calculated, the organizers provided laptop to calculate it. I suggest looking at a couple of examples:

df_true = pd.DataFrame({
    "video_id": ["video_1"] * 9,
    "time": [15, 16, 17] + [21, 22, 23] + [27,28,29],
    "event": ["start", "play", "end"] + 
             ["start", "challenge", "end"] + 
             ["start", "throwin", "end"]
})

df_pred = pd.DataFrame({
    "video_id": ["video_1"] * 3,
    "time": [16, 22, 28],
    "event": ["play", "challenge", "throwin"],
    "score": [1.0, 1.0, 1.0]
})


mean_ap = event_detection_ap(solution=df_true, 
                             submission=df_pred, 
                             tolerances=tolerances)

In this case, mean_ap = 1. I would like to clarify that tolerances are permissible time errors, which I wrote about just above. In the following examples, df_true will remain unchanged, only df_pred will change. Consider the following example:

df_pred = pd.DataFrame({
    "video_id": ["video_1"] * 3,
    "time": [16, 22.2, 28],
    "event": ["play", "challenge", "throwin"],
    "score": [1.0, 1.0, 1.0]
})

Here mean_ap remains equal to 1, since the tolerance for challenge [ 0.30, 0.40, 0.50, 0.60, 0.70 ]but if the time is changed from 22.2 to 22.3, then mean_ap will already be equal to 0.933. And the last example that shows how score affects the final accuracy:

df_pred = pd.DataFrame({
    "video_id": ["video_1"] * 4,
    "time": [16, 22.3, 22, 28],
    "event": ["play", "challenge", "challenge", "throwin"],
    "score": [1.0, 0.7, 1.0, 1.0]
})

In this situation, mean_ap is equal to 1, but if you swap the score of the challenge classes, that is, do it like this:

df_pred = pd.DataFrame({
    "video_id": ["video_1"] * 4,
    "time": [16, 22.3, 22, 28],
    "event": ["play", "challenge", "challenge", "throwin"],
    "score": [1.0, 1.0, 0.7, 1.0]
})

then the mean_ap value will change to 0.966. This is due to the following statement on the competition page:

Detections are matched to ground-truth events by class-specific error tolerances, with ambiguities resolved in order of decreasing confidence

In effect, this means that if two or more predictions correspond to the same true event, then the prediction with the highest score is selected. More examples can be found here laptop.

Pipeline

My solution looks like this:

Considering that video recordings are supplied as input, and I am working with a picture, then accordingly I need to extract frames from the video recordings. For this purpose, two options were considered: use either cv2.VideoCapture or ffmpeg. As a result of using ffmpeg, I was able to achieve significantly greater speed, so I opted for it. Here laptop how I did it.

The received frames were transmitted to the classification grid, which produced two classes: event or no_event, that is, some action is happening (Play, Challenge or Throw-In) on the field or not. Next, I decided not to rely on absolute classification values, but focused on relative ones peaks. There are 4 main options for how to look for these peaks:

Looking ahead, I’ll say that in my case, the option with distance was more suitable. Thus, the main goal of this stage was to identify potential frames in which events occur, that is, I do not expect that the accuracy of this model will be very high. At the last stage, I used one grid for both classification and event detection, and took the score from this neuron.

Augmentations

In general, how I approached the markup of data for detection: I divided the data into folders for each event, and created n subfolders in each folder. First, I marked up the data for each event in the first subfolders, then trained a neural network on this data, and ran the resulting grid on data from the 2nd subfolder. I looked through the data that had a low score or the true class did not match the predicted class and, if necessary, corrected errors. I marked the data iteratively in this way. This approach has a name – Noisy Student. I also want to note that for marking I used a grid with more parameters than for inference, since the competition had a limit of 9 hours for inference.

At some point I got tired of marking and decided to move on to augmentations. In addition to standard augmentations such as blurring, rotating an image at a certain angle, flipping an image horizontally, etc., I came up with custom ones.

The essence of the first augmentation is that we take images in which no events occur and harmoniously insert a crop with an event from another picture. The algorithm for this augmentation is as follows:

The first step is to obtain images in which no event occurs. Finding such footage will not be difficult, because we have information about the moment the events take place on the field. In order to smoothly insert one image into another, we need to get a mask of the object that we are inserting (in our case it is the player/players + ball). It was not my plan to mark up more data for segmentation, so I decided to look towards something pre-trained. I tried the yolo and SAM models. I took a few photos and saw that yolo does a better job, although I initially set it to SAM. The result of yolo and SAM using my examples can be seen in this And this laptop accordingly.

I inserted new players next to existing players so that they were approximately the same height in order to avoid gullivers:

To do this, you need to identify players on the field. In this case, I also used the pre-trained yolo model, only for detection. If there were problems with the segmentation of players in some photos, then there are no questions at all to the pre-trained detection model 🙂 At the next stage, we just insert the player into the image. For this you can use this algorithm. Honestly, I was pleasantly surprised how well it worked, but still sometimes it was clearly visible that the player was inserted, so I continued searching for similar algorithms and came across This The solution works great! I decided to use these two solutions in conjunction and this is the result I managed to achieve:

Interesting fact, two identical players appear together in one image:

I would like to note that this solution will not be entirely optimal for throwing the ball in, since the throw is carried out from the side line, so you first need to find the side line or part of it, and insert a player next to the line. Here I didn’t bother too much, and this algorithm was enough for me:

def get_side_line(image):
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    _, thresh_img = cv2.threshold(gray, 130, 200, cv2.THRESH_BINARY)

    lines = cv2.HoughLinesP(thresh_img, 1, np.pi / 180, 10, 
                            np.array([]), 300, 10)

    target_lines = [(x0, y0, x1, y1) for line in lines for x0, y0, x1, y1
                    in [list(line[0])] if x1 - x0 > abs(y1 - y0) 
                    and abs(y1 - y0) < 150]
    max_line = max(target_lines, key=lambda x: x[2] - x[0])
    return max_line

This is how to detect a line:

And the final result:

The essence of the second augmentation is that you can take a small offset forward or backward in time (I took 0.2 seconds) from the main frame with the event:

On the left is an action shot; right - +0.2 seconds from the left frame

On the left is an action shot; right – +0.2 seconds from the left frame

Result

While participating in the competition, you can submit many laptops, which will be run on the organizers' test data. Below in the table I want to briefly describe the chronology of what I did and what results it led to. To better understand which model and where problems may arise, I used my test video.

The goal of the first model, from a metrics point of view, is to ensure that recall is as high as possible, that is, so that there are fewer cases where the frame in which the event occurred was discarded. I evaluated the second model by TP, FP, FN and reduced it to precision and recall. Let me briefly remind you what this means regarding this task:

  • true positive – when the action was classified correctly at the right moment

  • false negative – did not find the action when it actually happened

  • false positive – when an action was recognized where there was none or when, for example, challenge was confused with play

Here, by the way, laptop how I calculated these metrics, and below is a table with the results:

Of course, I conducted many more experiments, but these are the most significant. Unfortunately, during the competition I did not have the opportunity to participate in it, but Late Submission was available there. As you can see, my highest score was 0.661. On the leaderboard I could place myself between 11th and 12th place. Datasets for classifications ( event or no_event ) and for detection + classification events are available on Kaggle, in case it comes in handy for someone 🙂

Problems

Of course, my solution itself is not perfect, but here are a few reasons why my final score could have been better:

  • There are shots where the team is warming up and the players are passing the ball, but this does not count as play:

  • during the markup process I noticed that some classes were marked incorrectly + sometimes they were not marked at all when they should have been

It is possible that such situations could be in the test data of the organizers. In addition, there were situations where it was necessary to have context in order to determine what kind of action was happening on the field, that is, one frame was not enough to determine the event. For this reason, the play and challenge classes were often confused.

In the second image below, it may seem that the player in the white T-shirt is making a pass/kick, although in fact it was a pass from one player in the yellow T-shirt to another, as evidenced by frames 1 and 3, but the neuron determined that in the second frame play :

Conclusion

Most likely, by continuing to mark up the data, it would be possible to get a couple more percent in accuracy, but I had no more desire to do data marking + the organizers of the competition recently banned late applications. Here are some other ideas you could try:

  • instead of dividing the data for the first neuron into event and no_event, you can divide the data into 4 classes: Play, Challenge, Throw-In and no_event;

  • Use special grids to classify actions. Googled for human action recognition.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *