Interactive NPC on Unreal Engine

Currently, many people are hearing about the use of neural networks in the gaming industry (including the generation of music, images, 3D models).

One possible application of neural networks in the gaming industry is to create “interactive” NPCs that the player can interact with via voice.

A number of well-known companies are already experimenting in this area. So, NVidia introduced NVidia ACEand Ubisoft – Neo NPC.

There are a number of solutions from lesser-known companies: once, two.

Frankly, I was not very impressed with the result and I thought that at least I could repeat it =]. Whether it worked out or not is up to you to decide.

I decided to do a little RnD and prepare a “Proof of Concept” application to demonstrate the possibilities of using neural networks to create “interactive” NPCs in games.

You can find a video with the results under the cut.
There is also a link to the demo application and instructions.
The application does not require a network connection to other services.

Result.

Interactive_NPC_Demo

Instructions

The demo application was tested on a laptop with the following characteristics:
AMD Ryzen 7 5800H
16Gb DDR4-3200
RTX3080 Laptop

Concept

The player asks the NPC a question by voice, the NPC formulates a response and answers by voice.

The NPC should have a backstory and should react to the player in a certain way.

The player's interaction with NPCs should resemble as closely as possible a dialogue with a real person. Those. There should be no pre-prepared remarks, interaction should be carried out only by voice and, in addition to this, it would be nice to observe the interlocutor’s facial expressions or, at least, lip movements.

In addition, I would like the NPC to remember the player, so that when starting a new game session, they do not have to start communication from the very beginning.

Based on the above, our “interactive” NPC should perform the following functions:

Listen – convert the player’s speech into text;
Understand – analyze the player’s request and formulate an answer within the framework of their background;
Reply – convert the text of the answer into an audio file and synchronize the audio with the movement of the lips of the 3D model;
Remember – store the history of communication with the player.

Third party services (ChatGPT, Google Speech etc.) we will not use – today the service is available, but tomorrow it is not (tokens are invalidated, sanctions are introduced, etc.). We swam, we know.

Instead, we will select ready-made open source solutions and adapt them to our task.

Implementation

Unreal Engine was chosen as the game engine 5.3.

One of the reasons for this choice is the ability “out of the box” to use highly detailed models of people with a ready-made set of poses and animations – MetaHuman.

The functionality is divided into several plugins:

SpeechRecognitionSystem – voice recognition. The library is at the core vosk.api.
UGpt – generating a response to a player’s text request. Used L.L.M. model – Mistral (7b). The model is used as reference llama.cpp (CPU).
SpeechGenerationSystem – convert text to audio. The plugin is based on the library piper
LipSyncSystem – lip movement animation synchronization with audio. The plugin is a redesigned version of the plugin Oculus LipSync. The original plugin is not UE5 compatible and does not work (almost) in real time.

There were a lot of nuances when writing plugins, but their analysis deserves a separate article…

How it works (in general)

All plugins work according to the same principle:

– when a plugin component is initialized, a separate thread is created (FRunnableThread for FRunnable object) with an infinite loop in which the component waits for a task to be executed;

– the component receives a request and puts it in the request queue;

– at the next iteration of the cycle, the request is removed from the queue and processed;

– the result of the request is added to the results queue.

Below the cut, for clarity, blueprints are given in a simplified form for each system component and links to documentation for the plugins.

Blueprints

SpeechRecognitionSystem (Doc)

UGpt (Doc)

SpeechGenerationSystem (Doc)LipSyncSystem

Separately, I would like to touch upon such a function as setting the background (behavior) of the “interactive” NPC.

In order to “write a character“it is necessary to resort to prompt engineering.

For example, I decided to simulate the dialogue between Dovahkiin and Jarl Ulfric from, where you might think, the game Skyrim and prepared this simple prompt.

Imagine that you are one of the jarls of Skyrim.
You are talking to the great warrior {USER}.
{AI} answers politely and is willing to answer any questions.
{AI} has a job for the hero.
{USER}:
-------------
USER: Dovakin
AI: Ulfric

You can see the result under the cut.

Dialogue. Dovakin – Ulfric

Dovakin: Greetings respected Jarl.
Ulfric: Greetings mighty Dovakin, what brings you to my hall?

Dovakin: Is there any work for me? I'm a good fencer.
Ulfric: Indeed, your reputation precedes you.
Ulfric: But I have a task for someone with your skills.
Ulfric: The Stormcloaks have taken over Fort Sungard, and we need it back.
Ulfric: They've been using it as a base to launch attacks against our people.
Ulfric: Can you help us retake it?

Dovakin: What should I do?
Ulfric: Gather some allies and head to Fort Sungard.
Ulfric: The Stormcloaks are led by a high-ranking officer named Captain Arvel.
Ulfric: He's a formidable opponent, but with your skills and the help of your allies, you should be able to defeat him and retake the fort.

Dovakin: What will I get for completing the task?
Ulfric: Our gratitude and the knowledge that you have helped protect the people of Skyrim.
Ulfric: And if you prove yourself to be a valuable ally, perhaps there will be other opportunities for you in the future.

Dovakin: Where should I look for them?
Ulfric: You can find potential allies among the local militia or among the other jarls.
Ulfric: Some may join you out of loyalty or for rewards.
Ulfric: Be cautious, however, as not all will be trustworthy.
Ulfric: Use your judgment wisely.

Dovakin: Where can I get supplies for this mission?
Ulfric: You can acquire supplies from various merchants or by scavenging abandoned structures.
Ulfric: Remember, preparation is key to success on the battlefield.

Dovakin: Good. I am ready to complete this mission.
Ulfric: Excellent.
Ulfric: May the gods go with you, Dovakin.

It’s scary to imagine what will happen if you purposefully train an LLM model using all the lore Warhammer 40000 and ask an ardent fan of the universe to do prompt engineering…

What's next

Let's learn how to automatically (well, almost) create a double in MetaHuman from a photograph of a face using 3D face mesh reconstruction;
master voice-cloning;
we will try to “revive” our NPC even more by adding the function sentiment analysis for both player requests and NPC responses. As a result, the facial expressions and “behavior” of our “interactive” NPC will change;
Let's improve the lipsync function – let's try to make the animation of the lips smoother and more “accurate”.

The maximum task is to create a semblance of a digital double of a person who will visually, voice and behavior resemble the original.

Conclusion

Personally, I see “interactive” NPCs as having a future in the gaming industry. This future will not come today or tomorrow, but it will come. You can already experiment in an attempt to create a new genre, gameplay, etc.

It’s hard to imagine how this will affect the work of game designers, screenwriters, and the industry as a whole.

What do you think? Share your thoughts in the comments.