Measuring empathy in dialogues between people and embodied chatbots

If a voice assistant complains in a traffic jam that it also hates these traffic jams, then the driver and passengers will feel better that they are not the only ones suffering. In psychology, this is called rapport and usually leads to trusting relationships between people. If we imagine that chatbots are also capable of empathy, and most importantly, that people appreciate it, then how is this not a strong emotional AI? Using the example of dialogues between people and chatbots, we will try to figure out how empathetic avatars are today and what features can be added to them.

The beginning is here.

How we will study dialogue emotions this time

Last time we focused on analyzing and visualizing a variety of emotions in dialogues taken from transcripts of YouTube videos in which users interact with avatars – embodied artificial agents – in games and apps. As a reminder, all the data is available in our repositories on Kaggle.

This time we will try to analyze a separate emotion in order to analyze in detail how exactly the emotional interaction between users and avatars occurs, which is manifested in their remarks.

This time we also used 250-word dialogue chunks and asked the llama 70b instruct model to split the chunk into different speakers and use the PAD model to label these fragments. Here is an example of the prompt we used:

Analyze the following text and extract a dialogue between speakers. Then, perform the following tasks:

1. Extract the speaker's speech and enclose it in [speaker_1_speech] [/speaker_1_speech] tags.
2. Extract another speaker's speech and enclose it in [speaker_2_speech] [/speaker_2_speech] tags.
3. Evaluate the speaker's speech using the Pleasure-Arousal-Dominance (PAD) model, with scores ranging from -10 to 10 for each dimension:
   - Pleasure: -10 (extreme displeasure) to 10 (extreme pleasure)
   - Arousal: -10 (completely calm) to 10 (extremely excited)
   - Dominance: -10 (totally submissive) to 10 (totally dominant)
   Enclose the PAD scores in [speaker_1_pad] [/speaker_1_pad] tags, separated by commas.
4. Evaluate another speaker's speech using the same PAD model and scoring system.
   Enclose the PAD scores in [speaker_2_pad] [/speaker_2_pad] tags, separated by commas.

Text to analyze:
{chunk}

Ensure that you only extract relevant dialogue and provide accurate PAD scores based on the emotional content of the speech. If no relevant dialogue is found, return "No relevant dialogue found."

Example output format:
[speaker_1_speech]speaker's extracted speech[/speaker_1_speech]
[speaker_1_pad]5,-2,7[/speaker_1_pad]
[speaker_2_speech]another speaker's extracted speech[/speaker_2_speech]
[speaker_2_pad]-1,3,-4[/speaker_2_pad]

Let's remember that we have a history of dialogues, which means we can consider dialogue emotions in their dynamics. How the conversation develops, where it goes wrong, and where, on the contrary, it hits the mark and contact arises between users and avatars – all this is in the history of dialogues.

We visualized the dialogues as nodes – dialogue cues that are connected to each other by edges at the moment of transition from the previous cue to the next one. This resulted in a directed dynamic graph, or rather two isolated subgraphs, since we have a separate history of user and avatar cues. Here is an example of visualization of such a graph.

With each step there is a transition from one replica to another, and we can see how close or far apart the user's and the chatbot's replicas are in space. PAD coordinates.

Empathy in Dialogues: Adding Metrics and Comparing People and Bots

The fact that we can measure the distance between the agent and user nodes at each transition moment allows us to rank the dialogs by the degree of proximity of the coordinates. Which is what we did.

Our hypothesis is that the closer the user and chatbot's replicas are in coordinates at each moment of transition, the more similarity in emotions is expressed, the more empathy is expressed in the dialogues, in other words, compassion. Why is this important? In dialogues with an empathic chatbot, there is a high probability that the user and chatbot understand each other better.

In addition, we also have a dataset with dialogues between human interlocutors, which means they can be compared by distance with dialogues between people and avatars.

This is what we got when we ranked all the dialogues by the average distance between the interlocutors' lines in both datasets.

The cues of human interlocutors are on average closer in distance to each other than the cues of humans and bots. If the hypothesis is correct, then empathy is more evident in them. And conversations between humans and bots are quite distant compared to conversations between humans. The result is generally obvious – people get along better with humans than with chatbots. But it's good that we confirmed this with data on the graph.

It is important to understand that empathy is determined by various ways of expressing it in a dialogue, and distance is only one of them. Therefore, to get a balanced assessment of the level of empathy between interlocutors, we added the following variables. Here are some of them:

  • Synchronicity – how often a change in emotions in interlocutors leads to similar emotional states in both. We can track whether the emotion in both interlocutors has changed in a positive or negative direction. If so, then we have captured emotional synchronicity in the dialogue.

  • Rapprochement. If the emotional state of both interlocutors converges from the beginning to the end of the dialogue, this may be a sign of growing empathy.

  • Cross-correlation. Even if the interlocutor's emotions do not change immediately, this variable will help to capture the similarity of the change of emotions. Let's take it as an additional sign to emotional synchronicity.

Since dialog lines in our case are also directed subgraphs, we also added graph metrics. Since the distances between previous and subsequent vertices in the graph are different, we can use this property as edge weights, and also calculate centrality measures for graph vertices. Here are some of them:

  • Average weight of edges in the graph. The arithmetic mean of all edge weights.

  • Total path length of a graph. Sum of all edge weights

  • Eigenvector centrality is a measure of influence that ranks vertices in a graph.

Since our data is based on conversations, we also added metrics that are relevant to them, such as a measure of semantic similarity between lines to better understand how users and avatars interact.

Now we can get the correlations of all the indicators in the form of a square matrix and see what is connected with what.

This is what we got:

It is noticeable that the variables form large clusters and are highly correlated with each other. Therefore, we will look at just one of the correlations, which points to the importance of analyzing what exactly people and avatars discuss in dialogues.

  • Semantic similarity and influence level: 0.8824. In fact, this means that similar in meaning remarks of interlocutors have a stronger influence on the course of the conversation. Perhaps, here lies the key to avatar empathy, that is, the adjustment to the interlocutor by the chatbot, which can be used when developing and launching bots in production.

This means that it makes sense to quantify dialogues and add them to other features. We did this by encoding dialogues with a model sentence-transformers. The model-encoder allows us to obtain embeddings, or vector space, in our case 746-dimensional. In this way, we encoded both 250-word chunks and individual user and agent utterances.

Adding non-verbal cues to the analysis of dialogue emotions and finding out how predictable the appearance of an avatar is

For each dialogue in our dataset, there are manually labeled features that are not contained in the dialogue itself.

The most important feature for us is the simplified or realistic image of the avatars with whom users communicate.

Last time we stopped discussing social response theory Clifford Nass. In terms of this theory, we instantly switch to the mode of communication with an equal interlocutor if he has minimal anthropomorphic features. For example, we see that two circles with dots inside them are located close to each other along a horizontal line. We recognize this in the form of the interlocutor's eyes.

This theory has long been used by developers of anthropomorphic interlocutors. But there are also another approachwhen developers create hyper-realistic avatars. In this case, the avatar, using advanced non-verbals – facial expressions, gestures, body position, and so on – reliably conveys the smallest play of emotions. This helps to express secondary or complex emotions, while simplified avatars have mostly primary, i.e. simple, emotions.

The Replika app was a revolution in its niche and paved the way for the development of realistic avatars. Replika users join romantic relationship with their avatars, as well as get married And have childrenthat is, they enter into complex and lengthy forms of interaction with them. Of course, if you want and have the right level of imagination, you can have children with Clippy, but in the case of Replica, this has literally become widespread, as there are numerous posts in thematic communities. Let's not forget about the developers, for whom such vigorous activity means a noticeable increase in the length of the user session and the frequency of sessions, and these are the main metrics for assessing the effectiveness of a chatbot.

picture from here

However, there is a well-known problem with hyperrealism – if the avatar is too similar to a person, the uncanny valley effect occurs. People perceive such images more as images of living beings, but not quite alive, which causes a feeling of fear and disgust towards such avatars.

picture from here

A typical example is various animatronics. Japanese Bunraku dolls can cause unpleasant emotions in an unprepared viewer, because although they look like dolls, their natural movements are more reminiscent of living people. And even too realistic movements robot dogs Boston Dynamics in the distant 2000s initially looked scary, so similar were they to the movements of living creatures, mainly spiders. That is, developers of realistic embodied chatbots need to balance between similarity and excessive similarity of avatars to living creatures.

Another feature that we took for classification is the type of interaction. It can be dyadic, where the avatar and the user communicate face-to-face. Or complex, where there can be many avatars and users, that is, when communicating, the interlocutor keeps many other interlocutors in attention and coordinates his actions with this complexity.

The third feature is the environment or surroundings in which the user and avatar are placed. Let's remember that we worked with videos of YouTubers who talked to avatars – embodied chatbots. In this case, the communication took place in different environments. For example, this is chatter in Character.aiwhere you can choose a bot character or, what is much more interesting, create your own. Another example is a game environment, where communication with an avatar occurs in the logic of a preset gameplay. The sample included such games as Suck Up, Replica Smart NPCs And Yandere AI Girlfriend SimulatorIn other words, chat and game are different frames that influence user behavior, which means they need to be taken into account when analyzing dialogues.

As a result, we got three pairs of features and eight groups, each of which contains a combination of three features.

Environment

Appearance

Type of interaction

chat / game

simplified / realistic

dyadic / complex

In our dataset, these features are presented unevenly. Therefore, we will take the most common ones. This is the appearance of the avatar – realistic and simplified.

realistic: 0 | simplified: 1

realistic: 0 | simplified: 1

It is noticeable that there is little data, so during its preparation we generated additional synthetic data, balanced the classes, performed feature selection and adjusted hyperparameters for various methods of predicting these features. Everything is available as always in laptop on Kaggle.

As a result, the models were trained on the following set of features:

  • Dialogue embeddings

  • emotional metrics

  • graph metrics

Ranking the features by importance yielded the following result:

The most useful for training was cross-correlation. Recall that it can be used to track subtle changes in emotion in dialogues, even if they do not happen immediately. Note: in the next iteration of the study, we will study it in more detail than we have done now.

Here are the results of the avatar appearance type predictions we got:

Method

Test ROC AUC Score

Cross-validation ROC AUC assessment

Random Forest

0.9980

0.9925

XGBoost

0.9947

0.9868

LightGBM

0.9937

0.9856

CatBoost

0.9537

0.9568

SVM

0.9177

0.9287

Decision Tree

0.8749

0.8854

Naive Bayes

0.8213

0.8261

The result of training and testing on test data gave a fairly high score. This means that we can quite accurately predict the appearance of an avatar based on dialog and emotional features. Of course, in a small data set, there is always a risk of overtraining, so in the future we plan to scoop up even more data from YouTube and recheck these results on it.

What does this mean for avatar developers, i.e. what opportunities appear if we can predict such features as the avatar type? If we can predict the appearance of the avatar based on dialogue emotions and lines, this means that during the conversation the avatar will be able to adapt to the interlocutor-user and slightly change its appearance, for example, to become a little more realistic or, on the contrary, cartoonish. If our hypothesis is correct, this means that the user and the avatar will understand each other better and will get even more pleasure from communication.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *