Translating audio messages into text in telegram using Wit

I am absolutely sure that soon in telegram – the translation of audio messages into text will be the default function, but for now I would like to show a simple example of how to implement such functionality in a telegram bot (of which there are already hundreds, but why not see how it works on example).

This is not a joke, this is a real dialogue from my working correspondence.
This is not a joke, this is a real dialogue from my working correspondence.

I’ll make a reservation right away that the service used in the example Wit not really intended for translating audio messages, this service has a different purpose, more interesting about which I may write later, but since it has such functionality and it is free, why not?

First, you need to register and create a project, there is nothing complicated about this, I will not dwell on this in more detail, we just need a token to work with the API.

Bot registration

The bot in telegram is created by the bot dad @BotFather, we will replace the bot mom.

And the bot itself will be written in the most “respected” among the habr community – the programming language PHP. The advantage of choosing this language is that we can upload a script to absolutely any of the cheapest hosting.

The very process of creating a bot is not difficult, we just answer questions, and at the end we get a token that will be needed to work with the telegram API.

It's better not to show the token to anyone ;-)
It’s better not to show the token to anyone 😉

Now we need to register a handler for our bot, for this we follow the link: https://api.telegram.org/bot/ setWebhook? url =

Where this is the token of our bot, and path to the handler. Note that the path to the handler must start with https

In response, you should see something like this:

{
    "ok":true,
    "result":true,
    "description":"Webhook is already set"
}

Now, every time the bot receives a message, a POST request with JSON in the request body will be sent to our handler, there are many interesting things, but since our bot will perform only one task, we are interested in the presence of an audio message in it.

By the way, any bot in your group chat has access to all messages, all attachments, photos (I hope this is not a secret for anyone). And it is for this reason that I cannot use third-party bots in my work correspondence.

Writing a script.

And so we define the bot operation scheme:

Everything seems to be simple. Of course, we could send the downloaded audio message directly to Wit without first saving it to disk, but the audio file downloaded from telegram is encoded in OGG (Codec: opus, 48000 Hz, mono, fltp, 26 kb / s) unfortunately Wit does not accept this format, so we need to convert this file to any other format of our choice:

  • audio / wav

  • audio / mpeg3

  • audio / ogg

  • audio / ulaw

  • audio / raw

But I will convert from OGG to OGG using ffmpeg codec vorbis which is just right for Wit.

Now let’s get down to programming. As I said earlier, I will write in PHP version 8.0.

<?php

class VoioverBot {

	private string $url = "https://api.telegram.org/bot";

	function __construct(private string $wit, private string $tg)
	{
		$this->msg = json_decode(file_get_contents("php://input"), true);
		$this->url .= $tg;
	}

	// Ищем в сообщении аудио
	public function getAudio() : bool | string
	{
		if (!isset($this->msg["message"]["voice"]["file_id"])) return false;

		// Получаем информацию о файле аудио-сообщения
		$info = json_decode(@file_get_contents("{$this->url}/getFile?file_id={$this->msg["message"]["voice"]["file_id"]}"), true);
		if (!$info || !isset($info["result"]["file_path"])) return false;

		// Скачиваем аудио-сообщение
		$file = @file_get_contents("https://api.telegram.org/file/bot{$this->tg}/{$info["result"]["file_path"]}");
		if (!$file) return false;

		// Сохраняем аудио-сообщение во временный файл
		if (!file_put_contents("./{$this->msg["message"]["voice"]["file_id"]}", $file)) return false;

		// Конвертируем файл:
		$this->convertAudio();

		// Преобразуем ауди в текст
		return $this->getTranscription();
	}
	
	// Конвертируем аудио в подходящий формат
	private function convertAudio()
	{
		shell_exec("ffmpeg -i ./{$this->msg["message"]["voice"]["file_id"]} -f ogg ./{$this->msg["message"]["voice"]["file_id"]}.ogg");
	}
	
	// Переводим голос в текст используя API wit
	private function getTranscription() : bool | string
	{
		$context = stream_context_create([
			'http' => [
				'method' => 'POST',
				'header' => "Authorization: Bearer {$this->wit}rn" .
							"Content-Type: audio/ogg",
				'content' => file_get_contents("./{$this->msg["message"]["voice"]["file_id"]}.ogg"),
				'timeout' => 20
			],
		]);
		$answer = json_decode(file_get_contents("https://api.wit.ai/speech?v=20200422", false, $context), true);
		// Временные файлы можно удалить:
		unlink("./{$this->msg["message"]["voice"]["file_id"]}");
		unlink("./{$this->msg["message"]["voice"]["file_id"]}.ogg");
		return (isset($answer['_text']) && !empty($answer['_text'])) ? $answer['_text'] : false;

	}

	// Отправляем текст в чат
	public function sendMessage($text) : bool
	{
		$context = stream_context_create([
			'http' => [
				'method' => 'POST',
				'header' => 'Content-Type: application/json' . PHP_EOL,
				'content' => json_encode([
					'chat_id' => $this->msg["message"]["chat"]["id"],
					'text' => "✍ <b>{$this->msg['message']['from']['first_name']} " . 
								"{$this->msg['message']['from']['last_name']}</b>rn{$text}",
					'parse_mode' => "HTML"
				])
			]
		]);
		$result = file_get_contents("{$this->url}/sendMessage", false, $context);
		return $result ? true : false;
	}
}

$vbot = new Voiover("ТОКЕН Wit", "ТОКЕН telegram");
$voice = $vbot->getAudio();
if ($voice) $vbot->sendMessage($voice);

This is basic code and has many flaws:

  • The script does not check who the requests come from

  • If the message was not sent for some reason, then it will not be sent again.

  • The maximum length of audio messages for Wit is only 20 seconds.

PS before publishing, I discovered that literally the day before, a similar article appeared on Habré Speech Recognition in Telegram “on the fly”, but in the GO language.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *