Is it possible to teach a neural network to detect irony?

This question was asked by scientists from Saint Petersburg State University and they conducted a study of the phonetic and paralinguistic characteristics of irony. They analyzed fragments of dialogues from films and TV series, recorded the speech of speakers and studied their gestures and facial expressions from video recordings. To determine the sound characteristics of irony, the scientists used methods of acoustic, perceptual and statistical analysis. Associate Professor of the Department of Phonetics and Methods of Teaching Foreign Languages at Saint Petersburg University Ulyana Kochetkova talks about the results.

Ulyana Kochetkova

Associate Professor of the Department of Phonetics and Methods of Teaching Foreign Languages at St. Petersburg State University

Developers of neural networks based on large multimedia corpora and databases are still faced with the problem of misinterpretation of ironic statements in which the final meaning is not equal to the original lexical one. Such errors lead to a decrease in the effectiveness of developed programs and do not allow for full communication between man and machine. To solve this problem, you first need to study the elements that distinguish irony from ordinary speech, and then train a neural network or other model to use this information.

Irony is not such a funny subject for research as it seems at first glance. Let's imagine communicating with a foreigner or with a voice assistant. We can agree with the interlocutor (“Yes, of course!”) or praise him (“Wonderful!”). But the same words can be pronounced ironically, and the meaning will change to the opposite. At the same time, neither the foreigner (unless he has mastered all the subtleties of the Russian intonation system) nor the voice assistant will notice any trick, since in both cases the meaning is deciphered based on the words. It is this type of irony that we at the Department of Phonetics and Methods of Teaching Foreign Languages at St. Petersburg State University decided to study, since at the moment the problem of recognizing ironic statements by phonetic characteristics has not been fully solved in any language. And no one has ever conducted a detailed analysis of the sound characteristics of irony using the Russian language before us. We won the RFBR grant “Acoustic characteristics of irony in functional intonation models” and started working on the project in early 2020. Fascinating but challenging work lay ahead of us, given that the project’s timing partially coincided with the covid-19 pandemic.

Creation of a corpus of ironic speech

First, my students and I collected a collection of ironic statements from open sources. To do this, we had to watch and listen to over 200 hours of Russian films, TV series, TV and radio plays, audio books, and performances by comedians and pop artists. Oddly enough, the latter were not particularly fond of the type of irony that interested us, namely, negation irony. They joked a lot, there was plenty of sarcasm, and cases where words said one thing and intonation expressed another were rare. We added fiction texts to them, looked for author's remarks like “he said it ironically” or “she answered sarcastically.” We collected over 700 excerpts, which we analyzed in terms of their sound and context. Then, based on the analysis, the students composed their own texts with interspersed ironic remarks: satirical, in the style of Paustovsky, in the form of a fairy tale or essay, and others. The texts did not contain the word “irony” itself, otherwise the differences in its interpretation by the speakers could have affected the results of the study. The plot of the narrative was thought out so that the ironic design of a certain remark would appear spontaneously in the reading and follow from the logic of the text.

At the beginning of working on the texts, I wanted to include homonymous ones, i.e. replicas that are identical in spelling and sound, but different in meaning: with irony and without irony. But this required dialogues in which one interlocutor mimics the other, and such text turned out to be “toxic.” Even if replicas with homonymous fragments were spaced out in time, the narrative became boring and artificial, and this affected the reading style – the natural sound and interest of the reader disappeared. However, the need for ironic and non-ironic utterances with identical verbal composition remained, since only by comparing them could we draw a conclusion about the features that distinguish an ironic sound design from a utterance without irony, when a native speaker speaks “seriously”, and we must take it literally .

So, we compiled a large number of short monologues and dialogues of 2-4 phrases, which included the same fragments. For example, “He spoke so brilliantly! What a great guy!” and “What a great guy! He did nothing, didn't pass a single test, and now he asks not to be expelled!”

The announcer read a set of 60-80 such mini-texts, in which different fragments were randomly presented in ironic and non-ironic contexts. The announcer was asked to read them as he would have said similar lines in natural communication. Sometimes, absolutely neutral, in our opinion, phrases were read with irony or simply very emotionally. This was especially typical for professional actors or those who had once studied in a theater studio. In such cases, we asked to reread and give an additional version. The recording engineer and the experimenter were responsible for the audio recording. The experimenters were mainly students who had gained extensive experience in creating a speech corpus, both in terms of text material and in terms of working with announcers.

The texts were recorded in the department's sound recording studio using professional equipment. With the consent of the speaker, a video recording was made simultaneously with the audio recording on a Sony Handycam FDR-AX700 at a frequency of 100 frames per second (frames from the video recording are shown in Fig. 1). A total of 60 speakers were recorded, and more than 15 thousand target fragments were obtained. At the next stage, the orthographic and phonetic annotation of the material was carried out.

Rice. 1. Recording speakers on a video camera to record facial expressions and gestures. — Fig. 1. Recording the speakers on a video camera to capture facial expressions and gestures.

It was also possible to obtain additional material: French speech without irony and with irony, as well as with emotional meanings close to it: surprise and doubt. It turned out that both Russian and French have most in common questions. Some similarity was also observed in the design of ironic exclamatory phrases such as “How nice!” (“Que c'est gentil!”). But the melodic design of narrative sentences differs so much that a native Russian speaker is unlikely to hear irony in a French phrase, and vice versa: a French speaker does not recognize irony in Russian.

Perceptual experiments

There is an opinion that irony cannot appear out of context and without an interlocutor, albeit an imaginary one, who must be able to recognize this irony. In addition, we want to be understood so much that, as a rule, we try to convey the meaning to our counterpart in all possible ways: we add lexical and grammatical markers that indicate irony (like “oh, well, I need it too,” etc. .), change the order of words. We can compare the sentence “I need your bag,” which will be read neutrally, and “I need your bag,” in which we will first suspect irony and only then imagine the version with persistent persuasion. But the whole point is that such redundancy of information is not always present and not everywhere. Especially often, ambiguous, contradictory remarks arise in a telephone conversation, and this type of communication with a voice assistant is the most common today.

Therefore, we were interested in the question: will listeners be able to recognize irony in those fragments that will be “taken out” of context, when it is possible to rely only on the sound of the fragment itself? To answer it, we cut out excerpts from the sound recording with identical verbal composition from ironic and non-ironic contexts. On the computer screen, participants in the experiment saw both contexts (two mini-texts), but were given only one sound fragment to listen to. It was necessary to answer which passage of text it refers to. As with recording speakers, we avoided mentioning the term “irony” itself so that differences in its interpretation would not influence participants’ responses.

Thanks to the acoustic analysis of those pairs of fragments in which both the fragment with irony and the homonymous fragment without irony were correctly assessed by the majority of listeners, we were able to identify the most striking perceptually relevant (i.e., significant for the listener) phonetic characteristics of irony.

Results of acoustic analysis

We found that there is no single intonation pattern that would lead to the appearance of ironic meaning in any utterance. The fact is that when expressing irony, we try to create a contrast in comparison with a neutral design, and at the same time we can use both an increase in some characteristics (for example, we begin to speak louder, stretch out words, increase the melodic range) and their decrease (we can “mutter”, as if “to oneself”, mimicking the words of the interlocutor). The choice to increase or decrease is associated with the communicative type of utterance (narration, question, exclamation), the lexical and grammatical composition of the utterance, as well as the individual habits of the speaker. A similar situation occurs in French, although the meaning of the specific parameters will differ.

Another interesting difference in ironic statements, which turned out to be universal for the Russian and French languages, is the appearance of a “broken” melodic contour (see Fig. 2, 3).

Fig. 2. The intonation contour of the interrogative utterance “Is this my neighbor?”, pronounced without irony (left) and with irony (right), constructed using Prosogram. — Rice. 2. Intonation contour of the interrogative statement “Is this my neighbor?”, pronounced without irony (left) and with irony (right), constructed using Prosogram.

Rice. 3. Intonation contour of the interrogative statement “Elle a oublié?” (French: “Did she forget?”), spoken without irony (left) and with irony (right), constructed using Prosogram. — Fig. 3. The intonation contour of the interrogative utterance “Elle a oublié?” (French: “Has she forgotten?”), pronounced without irony (left) and with irony (right), constructed using Prosogram.

Thus, in the course of our study, we obtained data on the acoustic characteristics responsible for the perception of irony by native Russian speakers. These data can be used by developers of human-machine dialogue systems. It is also important that speakers who did not express irony very vividly always differently designed ironic and neutral statements with the same lexical composition. Apparently, when expressing irony, the speaker necessarily departs from the model of a neutral statement existing in his or her mind. This observation can help in the development and improvement of user-oriented applications. In addition, in the course of a comparative study using the material of the Russian and French languages, a conclusion was made about some universal characteristics of ironic speech.

Sound signal modifications, resynthesis

In order to understand what parameters can be used to turn a neutral statement into an ironic one and vice versa, we applied methods of modification of the sound signal and resynthesis of the melodic contour. The characteristics of ironic ones were transplanted onto neutral statements, then the same was done with the original passages in which irony was present. Using Praat and Wave Assistant software, we modified individual parameters (duration, intensity, melody), and also made complex changes to them in a variety of combinations. We presented the resulting resynthesized stimuli to listeners in perceptual experiments. The results showed that it is not enough to change the duration or intensity of the signal in order for its modality to change. It is also necessary to transplant the melodic characteristics. At the same time, it was more difficult to turn ironic statements into neutral ones than vice versa. This is due to the fact that when expressing irony, a special timbre coloring often arises, which until now has been difficult to modify. Therefore, it was difficult to exclude it from the signal. However, the department is currently working on changing the prosodic timbre in the sound signal; these studies are at the cutting edge of the scientific field.

Analysis of irony in actor's speech (based on films and TV series)

Among the problems faced by developers of artificial intelligence systems today, one can name a new problem that arose with the emergence of the first audiovisual synthesis systems – synchronization of verbal and non-verbal information.

We were interested in considering whether the gesture (more precisely, the top of the gesture) would coincide with the most important part of the utterance – the stressed syllable of the word, which is the informational focus of the utterance. This part of the utterance is usually called the core or intonation center. Why is this the most important part of the statement? Because our perception of the communicative type of utterance depends on what movement of the melodic curve is observed in this part: whether we perceive it as a question, as a narrative, or as an exclamation. Of course, when we (or the machine) hear a private question, the question word itself already forces us to unambiguously determine the communicative type. But in the case of a general question with direct word order, everything becomes more complicated. Hence the errors in its perception both in the speech of foreigners learning Russian and in automatic recognition by a machine. In addition, the melodic figure, not only in the intonation center, but also beyond it, is responsible for the expression and perception of emotional coloring or modality, for example, the speaker’s confidence or uncertainty in his words.

In addition to the coincidence or non-coincidence of the gesture with the boundaries of the phonetic unit, we were interested in considering whether there is a parallelism between the movement of the tone (melodic curve) and the gesture in the intonation center of the utterance in the actor’s speech. And finally, it was important to find out which channel listeners and viewers rely on when perceiving ironic meaning: the video sequence or the acoustic characteristics. For this purpose, a series of 3 pilot perceptual experiments was carried out. In the first, only audio fragments were presented, cut out of context and not containing any grammatical markers of irony, i.e. devoid of any “clue”. In the second experiment, only the video sequence for the same fragments was presented without sound. In the third experiment, viewers were presented with a complete audiovisual signal containing the same fragments, which still lacked any context or clue. Participants in the experiments were asked to correlate an audio, video or audiovisual stimulus with one of the contexts – ironic or neutral. As material, we selected excerpts with irony-negation from modern films and TV series; we previously conducted an expert semantic and auditory analysis. The choice of material was complicated by the fact that the lines often overlapped each other, there was background noise, for example, musical accompaniment, there was no character in the frame saying the line, or, conversely, there were other characters in the frame. In order for the material to consist not only of ironic statements, remarks without irony were also selected.

results

Acoustic and paralinguistic analysis showed that all well-recognized ironic and non-ironic utterances in the audio experiment were synchronized with one of the gestures. The direction of the gesture in 100% of cases corresponded to the direction of tone movement in both ironic and non-ironic utterances. The top of the gesture coincided with the beginning of the nucleus (intonation center). The majority of correctly assessed ironic utterances were accompanied by head movement; in about a third of cases, an additional rounding was observed. For example, when pronouncing the vowel “a” or “i” the lips were rounded, as when pronouncing “u”, and slightly less often the movements of the hands or eyes coincided with the intonation center. It is interesting that such a coincidence was also typical for the actor’s speech without irony. The main difference between ironic facial expressions and gestures was its complexity – i.e. simultaneous implementation of several movements. In neutral speech, such coincidences were observed much less frequently.

Another interesting fact was that the direction of movement of the gesticulator in most of the passages studied, both ironic and non-ironic, coincided with the direction of movement of the tone – the melodic curve. For example, an actor or actress, simultaneously with a drop in tone, lowered his hand down, lowered his gaze, nodded his head, etc. (See Fig. 4 and 5).

Fig. 4. Head down movement on the intonation center – the stressed syllable of the ironic exclamatory utterance “Poor thing!”

Rice. 5. Melodic design of the intonation center - the stressed syllable of the ironic exclamation statement “Poor thing!” — Fig. 5. Melodic design of the intonation center – the stressed syllable of the ironic exclamatory utterance “Poor thing!”

This is probably due to the psychological attitude – to convey a certain state as much as possible, to express emotion vividly, to emphasize the meaning of a word – in the acting profession. Hence the parallelism when using various verbal and non-verbal means. In addition, the actor's speech is a prepared speech, which can be contrasted with spontaneous speech. There is an opinion that in spontaneous speech, synchronization with the intonation center, parallelism in the direction of the movement of the gesticulator and the melodic curve will be less common.

A comparison of three experiments with the presentation of different versions of the same passages (sound only, image only, sound and image together) showed that the video sequence is decisive in the perception of actor’s speech. However, this simple conclusion hides some interesting facts. For example, in some passages, listeners recognized irony well by ear, but when they were presented in video format without sound, the recognition was erroneous. When shown the same excerpt in a familiar audiovisual format, there was not complete agreement among the participants, which somewhat surprised us as researchers. One such excerpt is given below: the ironic remark “Well, thank you!” the actor makes a descending tone and simultaneously raises his hands (Fig. 6).

Rice. 6. Melodic and gestural design of the ironic exclamation statement “Well, thank you!” — Fig. 6. Melodic and gestural design of the ironic exclamatory statement “Well, thank you!”

In the experiment without video, the majority of participants correctly assessed the presence of irony in this stimulus; in the experiment with video without sound, the majority rated the passage as non-ironic. Probably native speakers have an idea of how video sequences can be combined with sound design. And inconsistencies between them lead to confusion, inaccurate understanding in the case of natural speech, or to the perception of unnaturalness and even the appearance of the “uncanny valley” effect in the case of a synthesized audiovisual signal. Therefore, the study of options for synchronizing gestures, facial expressions and intonation, as well as manifestations of parallelism between them, seems to be an extremely important and urgent task in the era of the development of voice assistants and audiovisual interfaces created using artificial intelligence systems.

So, is it possible to teach a neural network to detect irony?

Our answer is yes, you can. You have already learned from this material how we collected data and studied the acoustic and paralinguistic characteristics of irony in speech. We also made our first attempts at training a neural network to automatically recognize irony, and they were successful. We will talk about this in more detail in the next article.

But the main thing is that as a result of our fundamental research, valuable information was obtained about the differences between ironic speech and non-ironic speech. During the expert analysis, we established the values of those parameters that can be used both when retraining a neural network, and directly – without resorting to neural network technologies – in expert systems and metrics to automatically determine the presence or absence of irony in a signal. After all, when solving a specific problem, a neural network is usually necessary in cases where there is not enough knowledge about the exact parameters in a particular subject area, and this problem is solved by processing a large amount of data.