How to make a voice interface for LLM

While OpenAI delaying the release audio modality for ChatGPT, I want to share how we built our LLM voice interaction app and integrated it into the interactive booth.

Talk to the AI ​​in the jungle

At the end of February, a festival took place in Bali Lampuorganized according to the principles of the famous Burning Man. According to its tradition, participants create installations and art objects themselves.

My friends and I from camp 19:19inspired by the idea of ​​Catholic confessionals, came up with the idea of ​​creating our own AI Confession Room, where anyone could talk to artificial intelligence. Here's how we imagined it:

  • The user enters the booth, we determine that a new session needs to be started.

  • The user asks a question, the AI ​​listens and answers it. We wanted to create an atmosphere of trust and confidentiality, where everyone could openly talk about their thoughts and experiences.

  • When the user leaves the booth, the system ends the session and forgets all the details of the dialogue. This is necessary to maintain the privacy of all dialogues.

Demo version

To test the concept and start experimenting with the LLM prompt, I created a naive implementation in one evening:

  • Listen to the microphone

  • Recognize user speech using Speech-to-Text (STT) models.

  • Generate response via LLM

  • Synthesize a voice response using Text-to-Speech (TTS) models

  • Reproducing the response to the user

To implement this demo, I decided to rely entirely on cloud models from OpenAI: Whisper, GPT-4 And TTS. Thanks to the cool library speech_recognition Such a script can be assembled in just a few dozen lines of code.

Python demo with speech_recognition
import os
import asyncio
from dotenv import load_dotenv
from io import BytesIO
from openai import AsyncOpenAI
from soundfile import SoundFile
import sounddevice as sd
import speech_recognition as sr


load_dotenv()

aiclient = AsyncOpenAI(
    api_key=os.environ.get("OPENAI_API_KEY")
)

SYSTEM_PROMPT = """
  You are helpfull assistant. 
"""

async def listen_mic(recognizer: sr.Recognizer, microphone: sr.Microphone):
    audio_data = recognizer.listen(microphone)
    wav_data = BytesIO(audio_data.get_wav_data())
    wav_data.name = "SpeechRecognition_audio.wav"
    return wav_data


async def say(text: str):
    res = await aiclient.audio.speech.create(
        model="tts-1",
        voice="alloy",
        response_format="opus",
        input=text
    )
    buffer = BytesIO()
    for chunk in res.iter_bytes(chunk_size=4096):
        buffer.write(chunk)
    buffer.seek(0)
    with SoundFile(buffer, 'r') as sound_file:
        data = sound_file.read(dtype="int16")
        sd.play(data, sound_file.samplerate)
        sd.wait()


async def respond(text: str, history):
    history.append({"role": "user", "content": text})
    completion = await aiclient.chat.completions.create(
        model="gpt-4",
        temperature=0.5,
        messages=history,
    )
    response = completion.choices[0].message.content
    await say(response)
    history.append({"role": "assistant", "content": response})


async def main() -> None:
    m = sr.Microphone()
    r = sr.Recognizer()
    messages = [{"role": "system", "content": SYSTEM_PROMPT}]
    with m as source:
        r.adjust_for_ambient_noise(source)
        while True:
            wav_data = await listen_mic(r, source)
            transcript = await aiclient.audio.transcriptions.create(
                model="whisper-1",
                temperature=0.5,
                file=wav_data,
                response_format="verbose_json",
            )
            if transcript.text == '' or transcript.text is None:
                continue
            await respond(transcript.text, messages)

if __name__ == '__main__':
    asyncio.run(main())

After the first tests of our demo, the problems we had to solve immediately became clear:

  • Response delay. In a naive implementation, the delay between the user's question and the answer is 7-8 seconds or more. This is not good, but there are obviously many ways to optimize the response time.

  • Ambient noise. We realized that in noisy environments we can't rely on the microphone to automatically detect when a user starts and stops speaking. Automatically detecting the beginning and end of a phrase (endpoint) is a non-trivial task in itself. Multiply it by the noisy environment of a music festival and it becomes clear that a conceptually different approach is needed here.

  • Simulation of live communication. We wanted to leave the user the ability to interrupt the AI. To do this, we would have to keep the microphone on all the time. But in this case, we would have to separate the user's voice not only from background sounds, but also from the AI's voice.

  • Feedback. Due to the delay in response, we sometimes felt like the system was hanging. We realized that we needed to somehow inform the user at what stage the response was being processed.

Each of these problems could have been solved by technical means or bypassed from the product side.

Thinking through the UX of a booth

Before we started coding, we needed to decide how the user would interact with our booth:

  • How to determine that there is a new user in the booth to reset the history of the previous dialogue?

  • How to recognize the beginning and end of a user's speech, and what to do if he wants to interrupt the AI?

  • How to implement feedback when there is a delay in response from AI?

To determine that there is a new user in the booth, we considered several options: door opening sensors, a weight sensor on the floor, a distance sensor, a camera + YOLO model. The distance sensor behind the back seemed the most reliable to us, since it excluded accidental activations, for example, when the door was not closed tightly enough and did not require complex installation, unlike the weight sensor.

To make our lives easier and avoid the problem of recognizing the beginning and end of a dialogue, we decided to add a big red button to control the microphone. This solution also allowed the user to interrupt the AI ​​at any time.

We had many different ideas on how to implement feedback on request processing. We settled on a version with a screen that shows what the system is doing: listening to the microphone, processing the question, or responding.

We also considered a pretty cool option with an old landline phone, where the session would start when the receiver was picked up, and the system would listen to the user while the receiver was picked up. However, we decided that it would be much cooler when the user was “answered” by the booth itself, and not by a voice in the receiver.

During editing and at the festival

During editing and at the festival

In the end, the final script came out like this:

  • The user enters the booth. The proximity sensor behind him is triggered and we greet him.

  • To start a conversation, the user presses the red button. While the button is pressed, we listen to the microphone. When the user releases the button, we start processing the request and signal this on the screen.

  • If the user wants to ask a new question while the AI ​​is answering, they can press the button again and the AI ​​will stop answering immediately.

  • When the user leaves the booth, the distance sensor is triggered again and the dialogue history is reset.

Architecture

Arduino tracks the state of the proximity sensor and whether the red button has been pressed. All changes are transmitted to our backend via HTTP API, which allows the system to determine whether the user has entered or left the booth, and whether to activate microphone listening or start generating a response.

Web interface — a page opened in a browser that continuously receives the current state of the system from the backend and displays it to the user.

Backend controls the microphone, interacts with all the necessary models, and also voices the AI's responses. The main logic is concentrated here.

Iron

How to write a sketch for Arduino, correctly connect the distance sensor and button, assemble it all in the booth is a topic for a separate article. For now, let's briefly consider what we got without going into technical details.

We used Arduino, or more precisely the model ESP32 with a built-in Wi-Fi module. The single-board computer was connected to the same Wi-Fi network as the laptop with the backend running on it.

in the process of assembly

in the process of assembly

Here is a complete list of the hardware we used:

Backend

The main components of the pipeline are Speech-To-Text (STT), LLM, Text-To-Speech (TTS). For each of these tasks, there are many different models available both locally and via the cloud.

We didn't have a powerful video card at hand, so we decided to use cloud versions of the models. The weak point of this approach is the need for a good Internet connection. However, even with the mobile Internet that we had at the festival, the interaction speed after all the optimizations turned out to be acceptable.

Now let's take a closer look at each component of the pipeline.

Speech Recognition

Speech recognition is a feature that many modern devices have supported for a long time. For example, for iOS and macOS devices, Apple's Speech APIand for browsers it is suggested Web Speech API.

Unfortunately, they are very much inferior in quality to the same Whisper or Deepgram and, moreover, they cannot automatically detect the language.

To reduce processing time, the best option is to recognize speech in real time as the user speaks. Here are some projects with examples of how this can be implemented: whisper_streaming, whisper.cpp

On our weak laptop, the speech recognition speed using this approach turned out to be far from real-time. As a result, after several experiments, we settled on the cloud-based Whisper model from OpenAI.

LLM and Prompt Engineering

The output of the Speech-To-Text model from the previous step is the text that we send to LLM along with the dialogue history.

When choosing LLM, we compared GPT-3.5, GPT-4, and Claude. It turned out that the key factor was not so much the specific model as its proper configuration. In the end, we settled on GPT-4, whose answers we liked more than others.

Setting up a prompt for LLM models has become a separate art form. You can find many guides on the Internet that will help you set up the model as you need:

We had to experiment with the prompt and settings for a long time. temperaturesto make the model answer interestingly, concisely and with a bit of humor.

Text-to-Speech

The response received from LLM is voiced using the Text-To-Speech model and played back to the user. This step was the main source of delays in our demo.

LLMs take a long time to respond. However, they can generate a response in streaming mode – token by token. We can use this feature to optimize the waiting time by speaking individual phrases as they are received, without waiting for the full response from the LLM.

We voice individual proposals

We voice individual proposals

  • We make a request to LLM in streaming mode.

  • We accumulate the answer in the buffer token by token until we get a complete sentence of minimum length. By the way, the minimum length parameter is very important, as it affects both the intonation of the voiceover and the time of the initial delay.

  • We send the generated sentence to the TTS model, and play the result to the user. At this step, it is necessary to ensure that there is no race condition — observe the playback order.

  • We repeat the previous step until we receive the entire response from LLM

We use the time while the user is listening to the initial fragment to hide the delay in processing the remaining parts of the response from LLM. Thanks to this approach, the response delay occurs only at the very beginning and is ~3 seconds.

  async generateResponse(history) {
    const completion = await this.ai.completion(history);
    
    const chunks = new DialogChunks();
    for await (const chunk of completion) {
      const delta = chunk.choices[0]?.delta?.content;
      if (delta) {
        chunks.push(delta);
        if (chunks.hasCompleteSentence()) {
          const sentence = chunks.popSentence();
          this.voice.ttsAndPlay(sentence);
        }
      }
    }
    const sentence = chunks.popSentence();
    if (sentence) {
      this.voice.say(sentence);
    }
    return chunks.text;
  }

The final touches

Even with all our optimizations, a delay of 3-4 seconds is still quite significant. To save the user from the feeling that the answer is frozen, we decided to take care of the UI with feedback. We considered several approaches:

  • Indicators in the form of light bulbs. We needed to reflect as many as 5 different states: idle, waiting, listening, thinking, speaking. But we couldn't figure out how to do it conveniently and clearly using light bulbs.

  • Voice inserts “Let me think about it,” “Mmm,” and so on, imitating live speech. We abandoned this option because such insertions often did not match the tone of the model’s responses

  • Place a screen in the booth. And display different states with animation on it

We settled on the last option with a simple web page that polls the backend and shows animations according to the current state.

screen in the booth

screen in the booth

Results

Our AI Confession ran for four days and attracted hundreds of participants. After spending about $50 on the OpenAI API, we received a huge amount of positive feedback and valuable impressions.

This small experiment showed that it is possible to add an intuitive and effective voice interface to LLM even with limited resources and challenging external conditions.

By the way, backend sources available on github.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *