voice AI assistant

The story is about how I am trying to create a voice AI assistant for my 5 year old son.

Creating an AI assistant is not a new idea, especially given the massive spread of AI in the last year and the emergence of a voice assistant from OpenAI and their Realtime API which allows developers to create multimodal interfaces with low speech-to-speech latency.

Although the OpenAI API offers amazing capabilities, the high cost ($100 for 1 million input tokens and $200 for 1 million output tokens) is pushing the search for more affordable solutions. That's why I paid attention to the open source project LiveKitoffering scalable, real-time communications. One of the interesting features is the ability to integrate AI agents. And I decided to try to use their solution to create a state assistant.

A typical voice agent pipeline in LiveKit looks like this:

Source: https://docs.livekit.io/agents/quickstarts/voice-agent/

If it's simple: The user's voice stream is converted to text. The received text is passed to a large language model (LLM), which, following the given instructions (prompt), generates a text response. The text response generated by the LLM is converted into an audio stream. The generated audio stream is played.

My project uses the following LiveKit components:

  1. VAD: voice activity detector – Silero VAD.

  2. SST (Speech-to-Text): Multilingual model nova-2-general from Deepgram.

  3. LLM (Large Language Model): GPT-4o from OpenAI. However, LiveKit also supports other models that are compatible with the OpenAI API (Groq, Perplexity, TogetherAI, etc.).

  4. TTS (Text-to-Speech): The OpenAI service with the “alloy” voice was used. Other options are possible (Deepgram, etc.).

Project goals

My goal is to create an interactive teacher who can not only explain the material, but also control interactive tools in real time, like a live teacher who is simultaneously tells And shows. The key task is to implement a voice assistant with “streaming” control of interactive tools.

Math with interactive table

To test the idea of ​​controlling interactive tools using a voice assistant, I chose an exercise from the Oxford International Primary Maths textbook on the topic of tens. The assistant must explain the material and at the same time, using an interactive table, highlight columns, rows or individual cells, illustrating his explanations.

Implementation stages:

1. Create an interactive table

Created an interactive 10×10 table in Vue.js that lets you highlight columns, rows, and individual cells.

2. Statement of the problem for LLM

The LLM was instructed to explain the concept of tens and how to count with them to a five-year-old child using an interactive table.

Description of the API for interacting with the table:

I described interaction with an interactive table via OpenAPI. For example, to select a column the following call should be used: http POST /highlight-column?column=1.

###API You may call interactive table use openAPI:
    openapi: 3.0.0
    info:
      title: Interactive Table API
      version: 1.0.0
      description: API for managing a 10x10 interactive number table
    paths:
      /highlight-column:
        post:
          summary: Highlight Column
          description: Highlights the specified column in the table
          operationId: highlightColumn
          parameters:
            - name: column
              in: query
              description: Column number to highlight (1-10)
              required: true
              schema:
                type: integer
                minimum: 1
                maximum: 10
          responses:
            '200':
              description: Successfully highlighted the column
      /highlight-row:
        post:
          summary: Highlight Row
          description: Highlights the specified row in the table
          operationId: highlightRow
          parameters:
            - name: row
              in: query
              description: Row number to highlight (1-10)
              required: true
              schema:
                type: integer
                minimum: 1
                maximum: 10
          responses:
            '200':
              description: Successfully highlighted the row
      /highlight-number:
        post:
          summary: Highlight Number
          description: Highlights the specified number in the table
          operationId: highlightNumber
          parameters:
            - name: number
              in: query
              description: Number to highlight (1-100)
              required: true
              schema:
                type: integer
                minimum: 1
                maximum: 100
          responses:
            '200':
              description: Successfully highlighted the number
              
###CALL API          
If you need to set an example for a child, always use an interactive table, for example:
```http POST /highlight-column?column=1```

LLM's response: LLM generates the function call directly in the response rather than by a separate process.

Sample answer

Sample answer

3. Synchronization of function call and audio playback

The main difficulty was synchronizing the function call with audio playback. Synchronization should occur during playback, and not during LLM response generation or text-to-speech conversion.

Data Flow Architecture

The basic data flow in the system is as follows:

STT → LLM → TTS → PlayoutAudio

To implement synchronization, it was necessary to make changes to the source code of the LiveKit project at the following stages LLM – TTS – PlayoutAudio. Namely, find and forward the “function call” to all further stages.

  1. Analysis of the response from LLM: At the stage of receiving a response from LLM (in stream mode), a character-by-character check for the presence of a function call mask has been added, which determines whether the text contains a function call. The found function is assembled as a separate sentence.

  2. TTS processing: The text generated by LLM (in stream mode) is accumulated in a buffer and periodically, when enough data is collected (the minimum sentence length has been reached), it is sent for audio generation.

    At the same time, if, when breaking down into sentences, we encounter a “function call,” then we add a separate call_tools parameter to the next sentence, and the “function call” itself is not sent to the TTS (it does not need to be voiced).

    This way they emulate streaming by breaking the text into sentences and sending each sentence as a TTS. This allows long texts to be processed while delivering audio frames gradually.

  3. Playback and function call: At the playback stage (PlayoutAudio), a check occurs call_tools in each audio frame. If call_tools contains data, the corresponding function is called at the moment the current audio frame starts playing. This ensures synchronization with the content being spoken at the moment.

Schematic representation (generated using Claude 3.5 Sonnet - thanks to the AI ​​for that)

Schematic representation (generated using Claude 3.5 Sonnet – thanks to AI for this)

Thus, this architecture allows us to achieve synchronization: breaking the text into sentences and processing each sentence in a separate instance is a key implementation element that allows you to associate commands with audio and ensure their synchronous execution.

PS Splitting into separate sentences has already been implemented in LiveKit for working with non-streaming TTS.

Why did the standard function call not work?

As you know LLM can call functions to control external services and actually LiveKit has implemented call function support.

However, the standard call of LLM functions did not provide the necessary synchronization; problems arose with multiple calls, making it impossible to clearly determine the moment of their execution.

A small test of how function calls work in “streaming” mode

Watch here – telegram

These are just the first experiments, and I plan to further develop this project. Any of your suggestions and criticism will help me improve the assistant faster. Thank you!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *