On-the-fly speech recognition in Telegram

The task of recognizing voice messages in Telegram is not new for a long time. Many articles have been written on this topic, many Telegram bots have been developed. I got acquainted with some solutions while working on the function of recognizing voice reminders for the bot. @RemindMegaBot and noticed that these solutions use a not always justified approach:

For speech recognition audio file loaded to disk

A fair question arises – is it really impossible to do without writing the file to disk? After all, this will free the operating system from unnecessary operations and reduce the processing time!

Why do developers use this approach?

The fact is that voice messages in Telegram are recorded in ogg format and encoded opus codec… The most popular voice recognition services do not support this format (or codec), so you have to convert it to .wav, .mp3 or even the same .ogg, but use it already vorbis codec… For this, the authors of the solutions recommend using ffmpegwhich, in turn, requires saving audio files to disk.

But using ffmpeg is optional. There are alternative solutions that allow you to decode the opus data. Below I will give one of these solutions, implemented in the Go language.

In our example, we will connect to the speech recognition service wit.ai… The wit.ai service supports the “audio / raw“. To decode the Telegram voice message, we will use the library opus… To work with the wit.ai API, we will use the official library wit-go

The general algorithm of the speech recognition function looks transparent:

func Recognize(fileDirectURL string) (string, error) {
    // 1. Получаем содержимое аудиофайла по ссылке
    fileBody, err := getFileBody(fileDirectURL)
    if err != nil {
        return "", err
    }
    // 2. Преобразовываем данные в формат audio/raw
    audioRawBuffer, err := getAudioRawBuffer(fileBody)
    if err != nil {
        return "", err
    }
    // 3. Распознаем голосовое сообщение		
    client := witai.NewClient("YOUR_WIT_AI_TOKEN")
    msg, err := client.Speech(&witai.MessageRequest{
        Speech: &witai.Speech{
            File:        audioRawBuffer,
            ContentType: "audio/raw;encoding=signed-integer;bits=16;rate=48000;endian=little",
        },
    })
    if err != nil {
        return "", err
    }

    return msg.Text, nil
}

We get the content of the audio file using standard libraries:

func getFileBody(fileDirectURL string) ([]byte, error) {
    resp, err := http.Get(fileDirectURL)
    if err != nil {
        return nil, err
    }
    defer resp.Body.Close()
    fileBody, err := io.ReadAll(resp.Body)
    if err != nil {
        return nil, err
	  }
    return fileBody, nil
}

Opus data is decoded by instructions, laid out by the developers of the library.

func getAudioRawBuffer(fileBody []byte) (*bytes.Buffer, error) {
    channels := 1
    s, err := opus.NewStream(bytes.NewReader(fileBody))
    if err != nil {
        return nil, err
    }
    defer s.Close()

    audioRawBuffer := new(bytes.Buffer)
    pcmbuf := make([]int16, 16384)
    for {
        n, err := s.Read(pcmbuf)
        if err == io.EOF {
            break
        } else if err != nil {
            return nil, err
        }
        pcm := pcmbuf[:n*channels]
        err = binary.Write(audioRawBuffer, binary.LittleEndian, pcm)
        if err != nil {
            return nil, err
        }
    }
    return audioRawBuffer, nil
}

After that, we send the contents of the buffer to wit.ai and get the recognized text.

That’s all for now. Thank you for the attention!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *