Determining a person’s emotional state based on video analysis of his face

The emotional state of a person is one of the most important aspects of human life and personality, which plays a huge role in everyday routine and interaction with the outside world. It permeates every area of ​​our lives, influencing our thoughts, actions and interactions with the environment. It is also important to perceive it as an integral part of human nature and their manifestation as a reaction to external events and internal experiences.

This topic remains extremely relevant in the context of the growing need for the development and application of classification technologies in various spheres of life. With the increasing availability of cameras and the development of computer vision algorithms, it is becoming possible to create systems that can automatically recognize people's emotional states based on their facial expressions. This has potential applications in psychology, medicine, marketing, education and other fields where understanding emotions plays an important role. Such technologies can help improve customer experiences, personalize training, optimize advertising campaigns, and even detect early signs of mental disorders.

In this article, I would like to consider the creation of a system for classification into seven basic emotions and the conclusion of a person’s emotional state from video material. To implement such a system, it was decided to create a convolutional neural network using the pytorch library for the classification task and use this model for video analysis, and compiling an emotional state using the OpenCV library and a chatbot with generative artificial intelligence developed by OpenAI – ChatGPT 4.

Convolutional Neural Networks (SNS, or CNN from English Convolutional Neural Networks) is a class of deep neural networks that are most effectively used in image analysis tasks. The main difference between convolutional neural networks and other types of neural networks is the use of convolution operation instead of conventional matrix multiplications in one or more of their layers. The essence of convolution is to apply a filter (or kernel) to the original image to identify certain characteristics, such as edges, corners, or other textures. This allows the CNN to efficiently process visual information. The input takes values ​​for each pixel of the image in the range of the color palette (as shown in Figure 1), then the data goes through convolution and pooling layers, and is then fed into the activation function and neural network.

Picture 1.  CNN sequence

Picture 1. CNN sequence

To train the model, several datasets available on the Kaggle platform were selected (https://www.kaggle.com/datasets/msambare/fer2013), which include images of faces classified into seven emotional categories: 'Angry', 'Disgust', 'Fear', 'Happiness', 'Neutral', 'Sadness', and 'Surprise'. These datasets contain a variety of facial expressions, providing a comprehensive dataset for training and testing the accuracy of CNNs. The total number of all materials was 34,059 images.

Figure 2 - classes according to the dataset

Figure 2 – classes according to the dataset

The neural network architecture was built for 48 by 48 pixel grayscale images. It is built using PyTorch, a popular deep learning library. The network starts with two convolutional layers (self.conv1 And self.conv2), each following a ReLU activation function and a pooling layer (self.pool).

class EmotionCNN(nn.Module):
    def __init__(self):
        super(EmotionCNN, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, stride=1, padding=1)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1)
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2, padding=0)
        self.fc1 = nn.Linear(64 * 12 * 12, 1024)
        self.dropout = nn.Dropout(0.5)
        self.fc2 = nn.Linear(1024, num_classes)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.pool(self.relu(self.conv1(x)))
        x = self.pool(self.relu(self.conv2(x)))
        x = x.view(-1, 64 * 12 * 12)
        x = self.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.fc2(x)
        return x

Function EmotionCNN() is a convolutional neural network (CNN) architecture in PyTorch for emotion classification in images. It starts with two convolutional layers that are used to extract visual features from input images; Each convolutional layer follows a ReLU activation function and a max-pooling layer to reduce the dimensionality of the spatial data. Next, the elongated features are fed to a fully connected layer, which converts them into a vector of 1024 elements, after which a dropout layer is applied to reduce the risk of overfitting. The architecture ends with a second fully connected layer, which maps 1024 elements to a number of outputs corresponding to the number of emotion classes (num_classes). This network is capable of learning complex visual features thanks to its depth and use of nonlinearities, and the dropout layer and pooling techniques help control overfitting, making it a powerful tool for emotion classification.

for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0
    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)
        outputs = model(images)
        loss = criterion(outputs, labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

    train_loss = running_loss / len(train_loader)
    train_losses.append(train_loss)

    model.eval()
    running_loss = 0.0
    with torch.no_grad():
        for images, labels in test_loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            loss = criterion(outputs, labels)
            running_loss += loss.item()

    test_loss = running_loss / len(test_loader)
    test_losses.append(test_loss)

    print(f'Epoch {epoch + 1}, Train Loss: {train_loss:.4f}, Test Loss: {test_loss:.4f}')

This algorithm describes the process of training and testing a convolutional neural network (CNN) for the task of emotion classification using PyTorch on a given device. The model is transferred to a specific device (CPU or GPU) to perform calculations efficiently. Cross-entropy loss is used as the loss function, the optimizer is Adam with a learning rate coefficient of 0.001. During the training process, for each epoch, it iterates through the training data loader (train_loader), calculates the loss on the training set, and updates the model weights. Then, in evaluation mode (eval), the model is tested on an independent data set without calculating gradients (torch.no_grad()), to determine the loss on the test data set. Training and testing losses are recorded for later analysis. Outputting loss information after each epoch allows you to monitor the training process and adjust model parameters as necessary to minimize the difference between losses on the training and test data sets, thereby improving its ability to generalize to new data.

model.eval()
dummy_input = torch.randn(1, 1, 48, 48, device=device)  
onnx_path = "new_model_emotion_recognition_112.onnx"
torch.onnx.export(model,               
                  dummy_input,        
                  onnx_path,           
                  export_params=True,  
                  opset_version=10,    
                  do_constant_folding=True,  
                  input_names=['input'],  
                  output_names=['output'],  
                  dynamic_axes={'input': {0: 'batch_size'},    
                                'output': {0: 'batch_size'}})
print(f"Модель успешно экспортирована в {onnx_path}")

This code demonstrates the process of exporting a pre-trained neural network model created using PyTorch to the ONNX (Open Neural Network Exchange) format. ONNX is an open format designed to represent machine learning models that allows the models to be portable and usable across a variety of frameworks, tools, platforms, and optimizers.

Also, to draw a conclusion on a person’s emotional state, it was decided to use the Chat GPT 4 chat bot. To implement this, the following function was written using the g4f library (https://github.com/xtekky/gpt4free):

    def get_emotional_state_report(self, emotion_count):
        total_emotions = sum(emotion_count.values())
        emotion_percentages = {emotion: (count / total_emotions) * 100 for emotion, count in emotion_count.items()}
        
        emotions_summary = "\n".join(
            [f"{emotion}: {percentage:.2f}%" for emotion, percentage in emotion_percentages.items()])
        prompt = f"Ты - психолог, специалист по психическому и эмоциональному состоянию с многолетним стажем. Ты должен написать большое заключение и отчет об общем конкретном эмоциональном состоянии человека и его основное состояние" \
                 f" по видео которое тебе показали, без перечисление всех обнаруженных эмоций с довольно большим выводом без перечисления источников на основе следующего распределения эмоций:\n{emotions_summary}\n\nЭмоциональное состояние таково:"
        print(prompt)  # Только для отладки

        self.client = Client()

        response = self.client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
        )

        
        return response.choices[0].message.content

Where a request is sent with a text containing a summary of the detected emotions in the video. Next, a small graphical interface was created to visualize the process and display information about the conclusion on a person’s emotional state by sending a request to the gpt4 neural network.

The completion of an emotion recognition application using artificial intelligence and computer vision technologies highlights the power of modern technology in understanding human emotions. It demonstrates the ability of machine learning algorithms to analyze and classify emotional states based on video images.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *