Creating a Voice Assistant in Python with User Classification Based on Neural Networks (analog of FaceID)

We will divide the first part of our large project into several stages:

  1. We will create the main GA base, add functionality for communicating with the user and several auxiliary features to check the GA’s performance.

  2. Let's write and train a neural network for face recognition.

  3. Let's combine the first and second points.

So, let's get started!


Stage 1 – Frosya

I will not post the entire code in the article, since no one will read it without need, and if necessary, you can look at it on my github page. In the article, I will describe the basic logic, and sometimes I will highlight some functions and lines of code.

#Импорт необходимых бибилиотек
import speech_recognition as sr
import pyttsx3
import pyaudio
import sounddevice as sd
import os
os.environ['TF_ENABLE_ONEDNN_OPTS'] = '0'
import pyautogui
import requests
import cv2
import time
import transliterate #Из кириллицы в латиницу

First, we need to come up with a name for our voice assistant. My GA is called Frosya, after my friends' wayward but very cute cat.

Frosya

Frosya

Next we will teach our Frosya to say:

# Функция голоса Фроси
def Frosia_speak(Frosia_audio):
    # Инициализируем Фросю
    start = pyttsx3.init()
    start.say(Frosia_audio)
    start.runAndWait()

And perceive commands:

# Функция команды пользователя
def listen_command():
    # Инициализируем распознование речи
    r = sr.Recognizer()

    #Пауза, после которой начнётся запись голоса
    r.pause_threshold = 0.5
    print('Говорите')
    #Запускаем микрофон
    with sr.Microphone() as mic:
        #Лишний шум источник микрофон,
        r.adjust_for_ambient_noise(source=mic, duration=0.5)
        try:
            #Запрос голосом с микрофона, timeout - время пока Фрося ждет отклика,
            audio = r.listen(source=mic, timeout=3)
            #Распознаем речь в текст с помощью google (требуется интернет)
            user_command = r.recognize_google(audio_data=audio, language="ru-RU").lower()
        except:
            message="Не поняла вас. Повторите, пожалуйста"
            Frosia_speak(message)
            user_command = listen_command()
    return user_command

Although everything is simple here (described in the code), but in fact this is the basis for interaction with Frosya. As you can see, an Internet connection is required for speech recognition, this is Google functionality, but any other will do. We do it according to the principle: “the simpler and more accessible, the better” (at least for the current task).

Using the listen_command function, we say a command and the GA will respond to it. As an example, let's write a weather forecast request function. The function itself:

#Функция погоды
def weather():
    Frosia_speak('Назовите город')
    city = listen_command()
    url="https://api.openweathermap.org/data/2.5/weather?q=" + city + '&units=metric&lang=ru&appid=79d1ca96933b0328e1c7e3e7a26cb347'
    weather_data = requests.get(url).json()
    temperature = round(weather_data['main']['temp'])
    temperature_feels = round(weather_data['main']['feels_like'])
    Frosia_speak(f'Сейчас в городе {city} температура {temperature}, ощущается как {temperature_feels}')

And what Frosya will do if she receives the “Weather” command:

def Frosia_actions(comm):
    if comm == 'погода':
        weather()
        Frosia_speak('Что нибудь ещё?')
        comm = listen_command()
        Frosia_actions(comm)
    else:
        Frosia_speak('До свидания, кожанный ***!')

Using elif we expand the functionality of our GA, write any functions for PC control and any other interaction!

For many, this stage of development will be enough, then you can make your own voice assistant with any set of functions, limited only by your imagination, time and knowledge (but here the Internet will help you, I am sure that with due diligence you will learn a lot!).

For those who want to add some neural networks and make our Frosya a little smarter, let's move on!


Stage 2 – Neural Network

Now let's look at a rather simple, in essence, educational task on multi-class classification of images. I will note that I encountered a similar task as one of the test tasks at an interview, so for those who want to dive into the world of computer vision, this article will be useful in preparing for a new job search. Naturally, each company provides its own data and its own requirements, but if you understand the principles of solving the main issues, then smaller details will be much easier to solve. I repeat, this is not a universal solution, and not even a specific example, but just an analysis of something similar, to understand the issue, as part of our large project to create a Voice Assistant.

The general task of stage 2 will sound like this: It is necessary to make a model of a neural network of multiclass classification. We will do everything on tensotflow (some companies in test tasks require a solution on PyTorch, but we have a different goal now, so we will take what is simpler).

Any neural network needs data first. Let our Voice Assistant take 25 photos of a new user (front, left, right, etc.).

# Делаем фото функция
def make_photo(u_path):
    cap = cv2.VideoCapture(0)
    # "Прогреваем" камеру, чтобы снимок не был тёмным
    for i in range(5):
        cap.read()
    Frosia_speak('Пожалуйста, смотрите ровно в камеру 5 секунда, слегка наклоняя голову')
…

Frosya will tell you what needs to be done.

We will augment and preprocess images to increase our dataset. Some test tasks require identifying small details in images (cracks or various types of material wear), so the conditions include a note not to use random crop (or any other crop) to avoid cutting off important information in the photo. We will augment images as we wish (we will set basic arguments).

data_image_gen = ImageDataGenerator(preprocessing_function=preprocess_input,
                                   rotation_range=40, # поворот
                                   shear_range=0.2, # сдвиг
                                   zoom_range=0.2, # увеличение
                                   horizontal_flip=True, # зеркальный поворот
                                   fill_mode="nearest", #Заполняем пробелы
                                   validation_split=0.2)

This is what will ultimately be fed to our neural network:

As you can see, during the photographing process Frosya asks to turn and tilt her head so that the dataset is more diverse.

As you can see, during the photographing process Frosya asks to turn and tilt her head so that the dataset is more diverse.

Now we can start forming the neural network architecture. This will be a simple custom architecture. Now we are trying to determine the structure of the work, the type of architecture is not so important, we can conduct research, comparing and selecting the best one, but in this case we will limit ourselves to the simplest version (no one forbids experimenting, when everything is ready, you can improve the quality endlessly). We will not fantasize much, we will make several highlights and pooling. Pay attention to adding a Dropout layer to prevent overtraining of the model.

#Модель
model_user_Face = tf.keras.Sequential([tf.keras.layers.ZeroPadding2D((1,1),input_shape=(224,224, 3)),
                                           tf.keras.layers.Convolution2D(64, (3, 3), activation='relu'),
                                           tf.keras.layers.ZeroPadding2D((1,1)),
                                           tf.keras.layers.Convolution2D(64, (3, 3), activation='relu'),
                                           tf.keras.layers.MaxPooling2D((2,2), strides=(2,2)),
...

In the end, I rolled VGGFaces weights onto our neural network (they are easily available on the Internet) for better training of the model:

#загрузим веса VGGFace
model_user_Face.load_weights('vgg_face_weights.h5')

So, now that we have everything ready, let's launch the training. We'll try to minimize the time, since the training will take place every time a new face appears in front of our Frosya and she will need to retrain on new data.

After 35 epochs, training can be stopped.

After 35 epochs, training can be stopped.

To begin with, we had only two classes (two users – me and my wife). The training can hardly be called smooth, we need to tune the model and improve the dataset.

Ideally, of course, we would add Haar cascades here, or YOLO to determine the BoundingBox of the face, but for now the quality of the model satisfies us (let's leave this moment for a future stage of development, now we will define the base).

Let's save our model and we can try testing it.

faces_model_base_user_Face.save(dir_path + f'checkpoint_best_user_count_{n_faces}.h5')

Keep track of tf versions (to load model weights into the GA script, the tensorflow versions must match when saving and loading)


Stage 3 – Voice Assistant

So, what do we have:

Talking machine – yes

The brains responsible for recognizing faces are there

Now we need to make them friends with each other. It will work like this: As soon as Frosi's script is launched (you can set it to launch automatically when you turn on your computer), a photo of the current user will be taken immediately:

A photo of me, if you haven't figured out what I look like yet

A photo of me, if you haven't figured out what I look like yet

Next, Frosya will upload a photo to the loaded NS model and see the probability that a person who has previously participated in Frosya's training is sitting in front of it (we'll set the recognition threshold to 75%). If this is a new person, Frosya will offer to get acquainted and take a photo of the new user and then train the neural network taking into account the new data. That's it, the person is in our Frosya's memory, and now it will recognize him next time. Here you can limit the number of users so as not to lose quality, but all this is already “tweaked” as it is used.

And now you can make it possible for certain people to access certain actions and folders on the computer. That is, in essence, we have made an analogue of FaceID. Well, now everything is limited by your imagination, having taught our Frosya to “see” and “understand” what is happening in front of the screen, you can teach her many interesting functions. In my thoughts there were ideas to make it possible to control the PC using hand gestures, for example, “raise an open palm – minimize all windows”, “show a thumb – turn off the computer” and so on. There is no particular practical benefit from this, but for general development it would be interesting to implement.

Now it's up to you, you can expand and deepen your voice assistant as you wish. Personally, I plan to make another interesting functionality for Frosya, which I will definitely tell you about in the next article. But more on that later…

All the best!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *