Voice control


Alice, Siri, Marusya – this is not the whole list of projects in the field of voice assistants. Every day there are more projects, and the functionality is wider, and it seems that the moment has come when you can seriously think about transferring your computer to voice control.

In this series of articles, I will analyze the creation of a voice assistant that works locally on your computer and has a wide range of functionality, from “start music” to “create a new project in PyCharm”.

Speech recognition

Such a popular topic could not remain without a huge number of articles, but with the advent of the Yandex and Google APIs, a large number of articles begin and end like this:

import speech_recognition

This is the case, but I have an inquisitive nature, and I also have experience in machine learning, so why not make the recognition yourself? Because this is a huge mountain, having spent a lot of time climbing it, you only realize that the peak is very far away.

“What’s wrong with import speech_recognition?” I was asked when I brought the first version of the article to human judgment.

  1. Privacy – Yandex and Google can stubbornly declare that our data will not flow away and will not be used anywhere, but are you ready to put a career on their statement? So the security system of any large company is also not ready, so when working with government contracts or when accessing secrecy, the use of such a solution will be prohibited.

  2. Languages ​​- How long have you spoken Kerek? I think that you have not even heard how this language sounds, all because there are only 2 native speakers of this language in Russia. Now let’s imagine that one of them wants Jarvis. Of course, this is an extreme case, but open APIs do not always cope with the declared languages, what can we say about others?

  3. Internet – Recently I visited a beautiful place near Ryazan, birds, and endless fields. So inspiring! But Alice didn’t really appreciate the lack of internet. Such love for city life is understandable, although it can recognize the voice of any person who speaks Russian, but deploying such a colossus (Sberbank recently announced a Neural Network with 23 billion parameters) on a computer, and even more so on your smartphone, is not a feasible task.

Having decided on the significance, let’s start in order.

Sound is a wave

The computer is not friendly with waves, but adores numbers.

Let’s take some time t (sampling step), for example 1 second. And we will begin each time t to record the noise level on the microphone (Points on the graph below). Then we take the number A = 256. This number will characterize in how many bits we want to write the point.

Maximum noise level (UMSh) – the maximum value that a microphone can give
Silence level (UT) – the value that the microphone gives out in silence
Then UMSH after recording should be equal to (A-1), that is, 255, and UT = 0

Hence the number of SHK = (UMSH – UT) / A
CC – quantization step


Now, every t seconds, we will take the value from the microphone, divide it by CC and write the resulting number to the file. Let’s name the recorded file “Record 1.wav” and try to listen to it. We will not hear anything realized there, since we took a very large sampling step
import soundfile as sf
import os

def convert_opus_to_wav(data):
for index in data.index: # Пробегаем по встроенному манифесту датасета
file = “Data/” + data.loc[index, “Файлы”] # Запоминаем путь к opus файлу
if os.path.exists(file): # Если файл есть, то преобразовываем
audio, sample_rate = sf.read(file, dtype=”int16″) # Читаем opus
sf.write(file.replace(“.opus”, “.wav”), audio, sample_rate) # Сохраняем wav
os.remove(file) # Заметаем следы (Удаляем преобразованный файл)

manifest = pd.read_csv(“Data/public_series_1.csv”, header=None) # Считываем манифест
manifest.columns = [“Файлы”, “Текст”, “Длительность”] # Чтоб по красоте
del manifest[“Длительность”] # Удаляю все что не планирую использовать

At the moment, the training took place on the modules:

We also need to tweak the manifest a little more and save the corrections.

for i in manifest.index:
    # Удаляем расширение и добавляем нужную директорию
    manifest.loc[i, "Файлы"] = "Data/" + manifest.loc[i, "Файлы"].replace(".wav", "").replace(".opus", "")
    # Меняем путь к текстовому файлу на сам текст
    with open("Data/" + manifest.loc[i, "Текст"], "r") as file:
        manifest.loc[i, "Текст"] = file.read().replace("n", "")

Now our manifest looks like this:

If you go through this table carefully, you can find flaws like “aaa”, “yaya”, but they are so rare that too lazy to look I couldn’t even quickly find one for the screen.

Creating your own dataset is also not very difficult, if you are not interested, of course, in the volumes of Open SST. A little later I will release an article on how quickly I coped with this task using Telegram and 150 lines of code.
In general terms, you need to take the text, break it down into phrases, and then voice these phrases by recording 1000 WAV files (I got about 1.5 hours of data). In my experiments, I took “Crime and Punishment” for dubbing, but during the dubbing I realized that there are words that do not occur in everyday life (Thank you, Cap), which slightly devalues ​​the knowledge of the context we were striving for when choosing LSTM. So I think the third step of training will be prepared commands, such as:

CTC loss

Well, here we come to the most important questions:

  1. How to conduct training without complex markup?

  2. How do you know that “Orvlyarlov” is not like “Hello, how are you?” and how to assess the degree of similarity?

In 2006, an article by Alex Graves was published “Connectionist temporal classification”, which tells how it can be done and proves it with mathematics. Since mathematics is an exact science and does not like approximate retellings, I will leave it outside the brackets of my article.

The general meaning of the approach is to calculate the probability of each character in each “window”, then convert it to a string by choosing more probable characters (“” is also a character), and then calculate the Levenshtein distance by issuing it with the similarity metric.


from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM

def build_model(input_dim, output_dim, rnn_layers=2, rnn_units=32, load=False):
 		model = Sequential()
    model.add(layers.Input((None, input_dim), name="input"))
    model.add(layers.Reshape((-1, input_dim), name="expand_dim"))
    model.add(LSTM(512, return_sequences=True))
    for i in range(rnn_layers):
        model.add(LSTM(rnn_units, return_sequences=True))
    model.add(Dense(output_dim + 1, activation='softmax'))
    if load:
    opt = keras.optimizers.Adam(learning_rate=1e-4)
    model.compile(optimizer=opt, loss=CTCLoss)
    return model
 model = build_model(input_dim=fft_length // 2 + 1, output_dim=char_to_num.vocabulary_size(), rnn_units=128, load=True)


Not everything is so simple here, on the one hand:

On the other hand …

I got such a result when studying on my computer, after 2 days of training.


Then I came across an idea to screw a linguistic model on top, which would remove flaws such as the absence of spaces between words.

I will also finish a custom dataset soon and polish minor defects with it.

Select files on which the neuron stumbles and analyze. There are two options:

  1. the file is defective – the solution: we delete it from the dataset, since Open SST is huge

  2. the neuron did not work with it very much – the solution: we add it to the custom dataset

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *