Part 3. Recognizing the time on videos of Dota 2 matches using transformers

In this series of articles, we are implementing a system for automatically searching for highlights in Dota 2 matches. To create it, we need a marked-up dataset with time codes. There are many channels on YouTube where people post cuts of interesting moments from professional Dota 2 matches. Often there are small clocks from the game interface on the video. We will recognize the time on them.

The clock in the interface of the Dota 2 game is framed by a red frame
The clock in the interface of the Dota 2 game is framed by a red frame

In previous parts:

  1. In the first part, we parsed a replay of one Dota 2 match and found highlights using clustering.

  2. In the second part, we wrote a service for parsing replays in Celery and Flask in parallel.

Under the cut

  1. Download videos from YouTube using yt-dlp.

  2. Sampling frames from the video through ffmpeg.

  3. Cropping the image with OpenCV.

  4. Short review Tesseract, EasyOCR and TrOCR.

  5. We recognize time.

  6. Conclusion.

  7. All links to the code and used materials can be found at the end of the article.

Download videos from YouTube

I used yt-dlp – direct successor youtube-dlwhich is able to bypass the problem of slow loading.

Let’s take the DotA Digest channel as an example and download links to all videos.

yt-dlp \
	--print-to-file "%(url)s" youtube/urls.txt \

Option --download-sections allows you to download part of the video.

yt-dlp \
  --format mp4 \
	--download-sections "*00:30-00:45" \

You can also interact with the utility directly from Python. Let’s write a function that downloads the metadata and saves it to jsonand then downloads the video itself.

import yt_dlp

def youtube_download(url):
    options = dict(
    with yt_dlp.YoutubeDL(options) as ydl:
        info = ydl.extract_info(url, download=False)
    video_id = info.get('id')
    info_file = f'{video_id}.json'
    video_info_path = VIDEO_DIR / info_file
    video_file = f'{video_id}.mp4'
    video_path = VIDEO_DIR / video_file
    if video_path.exists():'Video already exists: {video_path}')
        return video_id'Downloading: {url}')
    with yt_dlp.YoutubeDL(options) as ydl:

    with open(video_info_path, 'w') as fout:
        json.dump(info, fout, indent=4)
    return video_id

Sampling frames from a video

Usually in video from 24 to 60 FPS (frames per second), while the time on the clock in the game is updated once a second. In order not to process unnecessary images, we will take 1 frame from the video per second. To do this, you can use the utility ffmpeg

ffmpeg -i <video_path> -r 1/1 frames/$filename%03d.bmp

There is a wrapper in Python, but I preferred to hack my version.

import subprocess

def sample_frames(video_id):
    video_path = VIDEO_DIR / f'{video_id}.mp4'
    output_prefix = FRAMES_DIR / video_id
    for file in os.listdir(FRAMES_DIR):
        if file.startswith(video_id):
  'Frames already exists: {video_id}')
            return'Sampling Frames from: {video_id}')
    cmd = f'''ffmpeg \
        -i {str(video_path)} \
        -r 1/1 \
    ''', shell=True)

Cropping an image with OpenCV

Often, in optical character recognition (OCR) tasks, you first have to detect a section of the image where there is text, and only then recognize it. We are lucky, the clock in the game interface is a static object. The position may change slightly depending on the screen resolution, but we will ignore this fact.

First, let’s load one frame with OpenCV:

import cv2
import matplotlib.pyplot as plt

image = cv2.imread(str(frame_path))
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

And we will display the result on the screen.

fig = plt.figure(figsize=(10, 10))
Frame from video
Frame from video

An object image is an array numpywhich allows you to crop and scale the image in just three lines.

bbox = (16, 23, 619, 653)
crop = image[bbox[0]:bbox[1], bbox[2]:bbox[3]]
crop = cv2.resize(crop, dsize=None, fx=16, fy=16, interpolation=cv2.INTER_CUBIC)
Enlarged clock area
Enlarged clock area

The coordinates of the bounding box are selected using the “poke” method. On the left, a small margin is taken, because the time in the match sometimes exceeds the mark of 100 minutes, in which case the clock will 100:00.

Brief overview of Tesseract, EasyOCR and TrOCR

The reader will detect time in the previous picture with the naked eye. 3:14. In addition, the task of optical character recognition is quite popular, it occurs when working with documents or, for example, license plates of cars. This suggests that there must be an open source model that works well out of the box.


The first thing that turned up in a Google search was its own project, according to Wikipedia, with a rather fascinating history.

Tesseract is a free OCR computer program developed by Hewlett-Packard from the mid-1980s to the mid-1990s and then shelved for 10 years. In August 2006, Google bought it and opened the source code under the Apache 2.0 license for further development.

Let’s execute the command.

tesseract \
	<file_path> \
	stdout \
	--oem 1 \
	--psm 7 \
	-c tessedit_char_whitelist=0123456789
  > 214

As a result, 214 were recognized. Not bad for the first attempt, but not enough. On some similar images, the text is not recognized at all. Simple transformations by forces OpenCV did not help, and even if they did, my soul would still not be calm. The problem is most likely due to the fact that the model is used LSTMtrained back in 2019.


The next thing that came to hand – EasyOCRwhich consists of Detector and Recognizer. The first is responsible for detecting texts in the picture, the second for recognition. To begin with, without fear, without respect, I tried to feed the entire frame, because the library from the box returns not only recognized characters, but also bounding boxes.

import easyocr

image = cv2.imread('youtube/frames/ukbICbM4RR0__033.bmp')
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

reader = easyocr.Reader(['en'])
result = reader.readtext(image)

for (bbox, text, prob) in result: 
    (tl, tr, br, bl) = bbox
    tl = (int(tl[0]), int(tl[1]))
    tr = (int(tr[0]), int(tr[1]))
    br = (int(br[0]), int(br[1]))
    bl = (int(bl[0]), int(bl[1]))
    cv2.rectangle(image, tl, br, (0, 255, 0), 2)
    cv2.putText(image, text, (tl[0], tl[1] - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.8, (255, 0, 0), 2)
plt.rcParams['figure.figsize'] = (16, 16)
Demonstration of EasyOCR
Demonstration of EasyOCR

Indeed, many strings were detected and recognized. But the model missed the watch we were interested in. Moreover, if you look at the bottom of the picture, you can see that the name of the character DAWNBREAKER is not fully highlighted and is recognized as AWNBRIAKER.

I decided to give the models a cropped image as input, but the first time she did not recognize anything. You can get around the problem by leaving some space around the text.

The nuance is that the colon was recognized as 8, and as a result, the model issued 3814. Again a miss. On other frames, there was a problem that the model completely ignored the characters to the left of the colon.

The repository has a diagram of the algorithm.

The key actors are ResNet and LSTM – all the same architecture from the “pre-transformer” era.

Also in the repository there is a script for training the model, but I’m not good at CV I was too lazy to mess around.


Finally I stumbled upon TrOCR. This library is not able to detect text in the picture, it can only recognize it. But that’s enough for us. The architecture looks like this:

  1. Visual Transformer (ViT) as an encoder for image manipulation.

  2. RoBERTa as a text decoder.

Below is the diagram from the article. You need to read it from the right-bottom corner.

TrOCR scheme
TrOCR scheme

Explanation from the article about the decoder.

When loading the RoBERTa models to the decoders, the structures do not exactly match. For example, the encoder-decoder attention layers are absent in the RoBERTa models. To address this, we initialize the decoders with the RoBERTa models and the absent layers are randomly initialized.

Let’s execute the code.

from transformers import TrOCRProcessor, VisionEncoderDecoderModel

model_version = "microsoft/trocr-base-printed"
processor = TrOCRProcessor.from_pretrained(model_version)
model = VisionEncoderDecoderModel.from_pretrained(model_version)

pixel_values = processor(image, return_tensors="pt").pixel_values
generated_ids = model.generate(pixel_values)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

> '3.14'

Hurrah Hurrah! Model recognized 3.14. Practice has shown that in all frames a dot is recognized as a separator, not a colon. In some situations, the delimiter is not recognized at all. But we were lucky, in Dota 2 the time on the clock is always displayed in the format mm:ssso the problem can be solved by converting to timestamp.

def convert_to_timestamp(text):
    sign = -1 if text.startswith('-') else 1
    digits="".join([c for c in text if c.isdigit()])
    seconds = digits[-2:]
    seconds = int(seconds)
    minutes = digits[:-2]
    minutes = int(minutes)
    timestamp = sign * (minutes * 60 + seconds)
    return timestamp


We learned how to download videos with highlights cuts from YouTube, sample frames and recognize the time on them. We also studied several libraries for character recognition in images. Specifically, in our example, transformers rule the ball and work out of the box without additional training and complex transformations of the input image.

In the next part, I plan to shoot sparrows with a cannon and use BERT to match video titles with records in the database in order to match frames and events from Dota 2 text replays.

I invite everyone who cares to comment.


  1. My channel in Telegram

  2. Source code (youtube branch)

  3. yt-dlp

  4. Dota Digest

  5. ffmpeg

  6. OpenCV

  7. Tesseract

  8. EasyOCR

  9. TrOCR, article

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *