Vosk vs Whisper – comparison on raspberry pi 4b

The article proposes to consider the work of junior models of speech-to-text conversion on an edge device – raspberry pi 4b. The phrase will not be easy, although it will be short – it will contain elements of both Russian and English speech. The competition lineup will include representatives of the whisper family:

whisper

,

whisper-cpp

,

whisper-jax

And

wax

. The speed and accuracy of work will be assessed. Also, as a bonus, an attempt will be made to translate a phrase from Tajik into Russian using vosk.

Introduction.

You must immediately refer to the article –

Comparison of Vosk and Whisper

to save time for those who already know what vosk and whisper are.

However, in the mentioned article, only one package (perhaps even a framework) was compared – whisper with another – vosk. Here it is interesting to see how vosk and whisper behave in a limited space – on a single-board raspberry pi4b Raspbian Bullseye aarch64. And also look at the results of other representatives of the whisper family.

Speech synthesized by another package, piper, will be used as a wav file for tests.

The audio clip is short and contains the following phrase: “Welcome to speech synthesis. Welcome to our speech synthesis.”

To level the conditions, this audio fragment was “standardized” with the command:

ffmpeg -i welcome.wav -ar 16000 -ac 1 -c:a pcm_s16le welcome.wav

All whispers and vosk are installed without problems on raspberry pi.

Whisper.

import whisper
from time import time

##Добро пожаловать в синтез речи. Welcome to our speech synthesis.
model = whisper.load_model("tiny")
ts=time()
result = model.transcribe("welcome_.wav")
print(result["text"])
print(time() -ts)

ts=time()
result = model.transcribe("welcome_.wav")
print(result["text"])
print(time() -ts)

*The code runs the same fragment twice. Why this is necessary will be demonstrated later.

Output:
Welcome to synthizerechi. Verkomtodovy uldov speech synthesis“.
With the result: 20.5 sec.

Whisper did a good job with the Russian part of the phrase and even translated English speech in the style of one politician.

I wonder if it is possible to pass audio to whisper as a pickle object?
This object was serialized from the dataset
huggingface.co/datasets/hf-internal-testing/librispeech_asr_dummy command:

from datasets import load_dataset; import pickle
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = ds[4]["audio"]
with open('audio_dump.obj', 'wb') as fp:
    pickle.dump(sample, fp)

*it was not possible to do this on a raspberry with 8GB RAM, because… The dataset did not fit into memory. Therefore, the instance was serialized on a more powerful system.

The phrase from the audio is given in the code below, it is entirely in English and quite long.
Code in which we will try to feed whisper audio as a pickle object:

import whisper
from time import time
import pickle
import numpy as np

"""
LINNELL'S PICTURES ARE A SORT OF UP GUARDS AND AT EM PAINTINGS AND \
MASON'S EXQUISITE IDYLLS ARE AS NATIONAL AS A JINGO POEM MISTER BIRKET \
FOSTER'S LANDSCAPES SMILE AT ONE MUCH IN THE SAME WAY THAT MISTER CARKER \
USED TO FLASH HIS TEETH AND MISTER JOHN COLLIER GIVES HIS SITTER A CHEERFUL \
SLAP ON THE BACK BEFORE HE SAYS LIKE A SHAMPOOER IN A TURKISH BATH NEXT MAN
"""

with open('audio_dump.obj', 'rb') as fp:
    sample = pickle.load(fp)

model = whisper.load_model("tiny")
ts=time()
result = model.transcribe(sample)
print(result["text"])
print(time() -ts)

ts=time()
result = model.transcribe(sample)
print(result["text"])
print(time() -ts)

Unfortunately, this option does not work with whisper. But it will be useful later.

Whisper-cpp

Also installs on raspberry pi without problems and processes audio wav:

cd whisper-cpp
./main -m models/ggml-tiny.en.bin -f samples/welcome.wav

Output:
“The group has always seen this region. Welcome to the world of speech synthesis.”
Whisper-cpp could not cope with the Russian part of the phrase and made minor inaccuracies in the English part.
Time: 11 sec.

whisper-cpp allows you to speed up the time a little due to quantization. We will not consider other possibilities, such as open-vino, etc. Only what is available locally.
So, let’s quantize and run:

./quantize models/ggml-tiny.en.bin models/ggml-tiny.en-q4_0.bin q4_0
./main -m models/ggml-tiny.en-q4_0.bin -f samples/welcome.wav 

Unfortunately, you cannot “shrink” the model to q2 or q1, the minimum is q4.

The result is the same: “The purpose of the scene is raging. Welcome to the world of speech synthesis.”
But the time has been reduced: 6.3 sec.

whisper-cpp does not have small Russian models (not yet?). base model also incorrectly displays Russian speech. But the medium and large options do a good job. But these models are not for raspberry, even if they are quantized.

Jax-whisper

This package is built on a fairly recent idea: jax. Which is also called “numpy on steroids”. For more detailed information, you can refer to the book “Deep Learning with JAX” by Grigory Sapunov.

Our code for raspberry is as follows:

from whisper_jax import FlaxWhisperPipline
from time import time

#Добро пожаловать в синтез речи. Welcome to our speech synthesis.

# instantiate pipeline
pipeline = FlaxWhisperPipline("openai/whisper-tiny")

ts=time()
# JIT compile the forward call - slow, but we only do once
text = pipeline("welcome_.wav")
print(text)
print(time() -ts)

ts=time()
# used cached function thereafter - super fast!!
text = pipeline("welcome_.wav")
print(text)
print(time() -ts)

Result: “Welcome from synthizerechi. Verkomtod ultok with synthesis furnace.”
In terms of execution time – interesting things.
The first “run” is 25 seconds, the subsequent one is 8.8 sec.
This is explained, as already mentioned in the code, by jit compilation, which gives such a delay at the start.

The jax-whisper testing doesn't end there.
Let's look at the following code:


from whisper_jax import FlaxWhisperPipline
from time import time
import time

"""
LINNELL'S PICTURES ARE A SORT OF UP GUARDS AND AT EM PAINTINGS AND \
MASON'S EXQUISITE IDYLLS ARE AS NATIONAL AS A JINGO POEM MISTER BIRKET \
FOSTER'S LANDSCAPES SMILE AT ONE MUCH IN THE SAME WAY THAT MISTER CARKER \
USED TO FLASH HIS TEETH AND MISTER JOHN COLLIER GIVES HIS SITTER A CHEERFUL \
SLAP ON THE BACK BEFORE HE SAYS LIKE A SHAMPOOER IN A TURKISH BATH NEXT MAN
"""

# instantiate pipeline
pipeline = FlaxWhisperPipline("openai/whisper-tiny")

ts=time()
# JIT compile the forward call - slow, but we only do once
text = pipeline("welcome_.wav")
print(text)
print(time() -ts)

ts=time()
# used cached function thereafter - super fast!!
text = pipeline("welcome_.wav")
print(text)
print(time() -ts)

In this example, we took a previously serialized audio fragment and put it into jax-whisper.

The result and time are as follows (remember that this is another audio fragment taken from a dataset with English speech):

{'text': <b>"Lennils, pictures are a sort of upguards and atom paintings, and Mason's exquisite Idols are as national as a jingo poem. Mr. Birkut Foster's landscapes smile at one much in the same way that Mr. Karker used to flash his teeth. And Mr. John Colier gives his sitter a cheerful slap on the back before he says like a shampoo or a Turkish bath. Next man."</b>}
56.28657627105713
{'text': <b>"Lennils, pictures are a sort of upguards and atom paintings, and Mason's exquisite Idols are as national as a jingo poem. Mr. Birkut Foster's landscapes smile at one much in the same way that Mr. Karker used to flash his teeth. And Mr. John Colier gives his sitter a cheerful slap on the back before he says like a shampoo or a Turkish bath. Next man."</b>}
25.147851943969727

Considering that the audio fragment itself is 29 seconds long, the result is not bad, including in terms of the text content.

Wax

Let's run from cli:

vosk-transcriber --model-name vosk-model-small-ru-0.22 -i welcome.wav

Result:

“welcome speech synthesis then you rest with oven syntax”

Time:

4.369

sec

As you can see, the content is a mess, but the execution time is amazing.
Unfortunately, the next model in the vosk family (https://alphacephei.com/vosk/models) is 1.8 GB, compared to the current one – 45 MB, the difference is significant.

The English junior model (vosk-model-small-en-us-0.15) copes, as one would expect, with foreign speech, but not with Russian:
“that up i shall have it seems is it hm welcome to the world of speech synthesis.”

In conclusion, a small experiment with vosk and the Tajik language. An attempt to translate proverbs that were found on the Internet with a ready-made translation:

vosk-transcriber --model-name vosk-model-tg-0.22 -i tadjik.wav

Result on video:

same on

rutube

Thus, to summarize, we can conclude:
— Vosk is ahead of everyone else when translating short audio into text in terms of execution time.
Whisper-cpp, tweaked using quantization, is breathing down his neck, then jax-whisper and whisper is trailing behind him.
— in terms of the quality of Russian speech recognition, the palm is borne (in the author’s humble opinion) by whisper and vosk. At the same time, whisper is slightly ahead of the opponent.

ps

Результаты по 43 сек аудио-фрагменту отрывок из "Демон" Лермонтова М.Ю. :
whisper

печальный демон. Дух изгнания. Летал над грешной землей и лучших дней в воспоминании приеднемтесь нельзя толпой. 
Тех дней, когда в жили еще света блесталом, чистый хирувим. Когда бегущая комета, улыбкой ласковой привета, любила поменяться с ним.
Когда сквозь вечная туманы, познания жадный, он следил качующая караваны в пространстве брошенных светил. 
Когда он верил, и любил, счастливый первенец творения. Не знал ни злубы, ни сомнения, и не грозил моего веков бесплодных ряд он и лэм много, много всего при помнит не имел он сила.
77.7968852519989 сек

jax-whisper

{'text': ' печальный демон. Дух изгнания. Летал над грешной землей и лучших дней в воспоминании при днем теснились от толпой. 
Тех дней, когда вжили еще света блисталым, чистый хирувим. Когда бигущая комета, улыбкой ласковой привета любила поменяться с ним. 
Когда сквозь вечная туманы, познания жадный, он следил качющая караваны в пространстве брошенных светил. 
Когда он верил, и любил, счастливый первенец творения. Не знал ни злубы, ни сомнения, и не грозил моего веков бесплодных ряд он и лямного. 
Много всего при помнит не имело населы.'}
49.801395416259766 сек

vosk

печальный демон дух изгнанье летал над грешной земле и лучших дней воспоминанья пред ним теснились а толпой
тех дней когда в жилища света блистал он чистые херувим когда бегущая комета улыбкой ласковое привета любила поменяться с ним 
когда сквозь вечная туманы познания жадный он следил кочующие караваны в пространстве брошенных светил когда он ведь
верил и любил счастливой первенец творенья не знал ни злобы ни сомнения
и не грозило моего веков бесплодных ряд унылая много многое всего припомнит не имел он силы

13.022 сек

Applications:
welcome.wav
audio_dump.obj
tadjik.wav
demon.wav
— video demonstrating the code given in the article — youtuberutube.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *