Video player with transcript support

while watching something was unclear, you need to read the subtitle, but it is already hidden and at the same time you don’t want to rewind and catch the missing piece
subtitles are divided into fragments in such a way that each of them is not a complete phrase and in order to understand the meaning you need to have two or three lines of subtitles before your eyes

This functionality is available, for example, on YouTube and Corsair.

However, after some research, I was surprised to find that this feature is not available in any Windows player.

I was just itching to try what the bunch was like Cursor+o1. Thus, the decision was spontaneously born to use such a player as an object for experiments with these instruments. I’ll say right away that I have practically no development skills (except for minimal fragmentary ideas about web development).

At first I tried to understand whether this problem could be solved in the form of a web application, but after some time I decided to look for what could be done using Python scripts, and in the end I settled on this option.

In the first prompt, I started with a general statement of the problem that I needed to implement a video player in Python. My mistake was that I immediately explicitly specified the library tkvideoplayeron the basis of which everything was supposed to be implemented. It seemed to me that the maximum amount of detail in the initial formulation of the problem should lead to the most effective solution. As a result, for some time I was unable to even simply play video with sound. After many attempts to fix this, the AI itself switched to using the library python-vlcafter which the “development” process noticeably accelerated.

In the next step, we were able to relatively quickly make the application “two-window”: the video is played in one window, the other window is the playback control panel and the transcript display. At first I tried to change the layout of the elements in the control window using prompts, but then I finally started doing it myself directly in the code.

In the end it turned out something like this:

Most of the time was spent in futile attempts to generate a “subtitle-based transcript” in real time by extracting subtitles from the video directly during playback. Only after some time (and after exhausting the o1 limits) did it dawn on me that this is probably, in principle, a practically unsolvable problem, because extracting subtitles from videos requires some resources and time. As a result, I had to make a crutch – first extract the subtitles into separate files, and then load them into the player from srt files.

Script To extract subtitles, I managed to write it almost on the first try, but even here it was not without pitfalls. It worked well on the first film, but stalled on the second. The AI didn’t answer my question about the freezing and offered to add diagnostic information collection to the implementation. I was too lazy to bother finishing the diagnostics, so I turned on my brain and guessed that the possible reason for the freeze was that the film contained several subtitles in the same language (for example, two versions of Russian subtitles). I asked the AI to modify the script taking into account this feature, as a result of which everything worked fine on the second film.

At the last step, I completed various little things for convenience (hot-keys, remembering movie parameters when closing and reopening) – this required a minimum of body movements.

In the end craft It turned out to be quite poor and very buggy, but it suits me – I still use it every day. Glad I got confused. I hope to someday find the strength to start editing.

During the process I only used Chat Cursor. Accordingly, Composer did not use autocomplete. I avoided manual code edits as much as possible (with the exception of the aforementioned small changes in the layout of the Controls window). I learned about the existence and features of Composer much later, during the next “approach to the projectile” – it turned out to be an interesting thing, but apparently irrelevant for this task, because it's all in one file.

Conclusion

In general, taking into account all the introductions, the first experience of using Cursor+o1 and the resulting “product” rather satisfied me. I don’t know how this problem could be solved using VSCode+Copilot. It would probably work, but I have no desire to check yet.

As for the application itself, it would probably be interesting to see how it works in the following configuration: the video is played on the TV, and the interface with transcripts and playback control buttons works on the tablet that the user is holding in his hands. But due to the lack of a TV and tablet, I personally do not yet plan practical steps in this direction. I can only cautiously suggest that it would be advisable to do this as a web application, and not as a Python script.