DE-1. DIY assistant on LLM

Tools

I think it’s worth immediately briefly describing the environment:

  • To work more efficiently within a Linux environment, I use WSL2 on Windows. The current distribution is Ubuntu-22.04.

  • Regarding the main device that will calculate our tensors. An 8gb GPU (example gtx 1080 and higher) should be enough. In fact, if it is very unclear where and how to look at the memory requirements of the LLM you have chosen, then you can use software such as LM Studio.

  • In order for all calculations to run on the video card, it is also worth taking care of cuDNN drivers. The installation topic is worth a separate article, but fortunately there are already such: option 1 (all by yourself)option 2 (using conda).

  • Ollama is a framework for running large language models locally. This is what is required to run the assistant kernel – LLM. The framework installation process is described at official website.

To implement the assistant, I chose three key neural networks:

  • STT – Whisper. This is a speech recognition model developed by OpenAI. It can process audio files and translate them into text, supports multiple languages, and can work even in noisy environments.

  • LLM – Llama3. This is a relatively new LLM, compared to its predecessors, it has improved performance and more advanced model parameters. She is able to answer questions, provide information, and even conduct conversations based on a given context.

  • TTS – Coqui AI. The text-to-speech system allows you to voice text responses. Of all the open source solutions, it offers a fairly natural sound and flexibility in voice and intonation settings in a variety of languages.

Speech recognition. Whisper

Let's get started. The very first module is needed to convert voice to text, and the Whisper model is perfect for this task. It has several configurations: base, small, medium and large. The best results are shown by the base model, which provides an optimal balance between performance and recognition quality.
The functionality of the following code is very simple. Inside the class WhisperService The model is loaded to convert audio to text using the Whisper library. Method transcribe takes the path to a WAV audio file and, using the model, converts it to text.

from abc import ABC, abstractmethod

class BaseService(ABC):
    def __init__(self, model):
        self.s2t = model

    @abstractmethod
    def transcribe(self, path_to_wav_file: str):
        """
        Abstract method to process audio files (in wav format) to text
        """
        pass


class WhisperService(BaseService):
    _BASE_MODEL_TYPE = 'base'

    def __init__(self, model_type: str = _BASE_MODEL_TYPE) -> None:
        import whisper

        model = whisper.load_model(model_type)
        super().__init__(model)

    def use_model(self, path_to_wav_file: str, language=None):
        return self.s2t.transcribe(path_to_wav_file, language=language)

    def transcribe(self, path_to_wav_file: str, language=None) -> str:
        result = self.use_model(path_to_wav_file, language=language)
        return result['text']

Processing requests. Llama3

The next important link is the text generation module. Currently using basic LLM, in my configuration _BASE_MODEL = llama3.1:latest. The code below implements a module that interacts with the language model using the library langchain_ollama. The main purpose of the module is to send questions to the model and receive answers. In method ask_modelwhich is responsible for generating queries to the model, a regular expression is used to determine the end of sentences. The method receives a question, sends it to the model and processes the streaming response. The responses are accumulated in a buffer, and once a completed sentence is found in the buffer, it is retrieved and returned. Thus, the method efficiently processes long responses and allows you to transfer the created sentence to the TTS module as quickly as possible.

import re
from langchain_ollama import ChatOllama

from config import LLM_MODEL


class LangChainService:
    _BASE_MODEL = LLM_MODEL

    def __init__(self, model_type: str = _BASE_MODEL):
        self.model = ChatOllama(model=model_type)
        self.context=""

    def ask_model(self, question: str):
        buffer=""
        sentence_end_pattern = re.compile(r'[.!?]')

        for chunk in self.model.stream(f'{self.context}\n{question}'):
            buffer += str(chunk.content)
            while True:
                match = sentence_end_pattern.search(buffer)
                if match:
                    end_idx = match.end()
                    sentence = buffer[:end_idx].strip()
                    sentence = sentence[0 : len(sentence) - 1]
                    yield sentence
                    buffer = buffer[end_idx:].strip()
                else:
                    break

Speech synthesis. Coqui AI

Well, the last step is to convert the response from the bot into audio format. This can be achieved with a text-to-speech module using the XTTS library. XTTSService initializes the TTS model, loading it onto an available device, be it a GPU or CPU. The main function of this service is the method processingwhich takes text and saves it as a WAV audio file. The method also allows you to specify the language and speaker and playback speed for more flexible settings.

from abc import ABC, abstractmethod

import torch

from config import TTS_XTTS_MODEL, TTS_XTTS_SPEAKER, TTS_XTTS_LANGUAGE


class BaseService(ABC):
    def __init__(self, model):
        self.t2s = model

    @abstractmethod
    def processing(self, text: str):
        """
        Abstract method to process text to audio files (in wav format)
        """
        pass


class XTTSService(BaseService):
    _BASE_MODEL_TYPE = TTS_XTTS_MODEL
    _BASE_MODEL_SPEAKER = TTS_XTTS_SPEAKER
    _BASE_MODEL_LANGUAGE = TTS_XTTS_LANGUAGE

    def __init__(self, model_type: str = _BASE_MODEL_TYPE) -> None:
        from TTS.api import TTS

        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        print(f'Apply {device} device for XTTS calculations')

        model = TTS(model_type).to(device)

        super().__init__(model)

    def processing(
        self,
        path_to_output_wav: str,
        text: str,
        language: str = _BASE_MODEL_LANGUAGE,
        speaker: str = _BASE_MODEL_SPEAKER,
    ):
        self.t2s.tts_to_file(text=text, file_path=path_to_output_wav, language=language, speaker=speaker, speed=2)

Main.py script. Telegram API

To quickly and easily assemble the modules described above and launch the assistant, you can communicate with it via the TelegramAPI. Pros: no need to implement a client to record and play audio. Cons: not very convenient UX, you constantly have to click the record button in the interface)

Telegram bot developed using the library python-telegram-bot.

Brief operating logic:

  1. Team /start: The user begins interacting with the bot by receiving a welcome message.

  2. Voice message processing: The bot receives voice messages from users, checks their availability and converts them to wav and saves them.

  3. Speech recognition: Using the service WhisperService audio files are converted to text.

  4. Generating responses: By using LangChainService text commands are processed and text responses are generated.

  5. Text to Speech: Replies are converted to voice messages using XTTSService.

  6. Submitting responses: Generated voice messages are sent back to the user.

Below is a sheet that implements the logic described above:

from telegram import Update
from telegram.ext import filters, Application, CommandHandler, CallbackContext, MessageHandler

from config import TELEGRAM_BOT_TOKEN

from src.generative_ai.services import LangChainService
from src.speech2text.services import WhisperService
from src.fs_manager.services import TelegramBotApiArtifactsIO
from src.audio_formatter.services import PydubService
from src.text2speech.services import XTTSService
from src.telegram_api.services import user_verification
from src.shared.hash import md5_hash


speech_to_text = WhisperService()
text_to_speech = XTTSService()
file_system = TelegramBotApiArtifactsIO()
formatter = PydubService()
langchain = LangChainService()


async def verify_user(update: Update) -> None:
    user_id: str = str(update.effective_user.id)  # type: ignore
    user_verification(user_id)


async def start(update: Update, _: CallbackContext) -> None:
    await verify_user(update)
    await update.message.reply_text('Hello! I am your personal assistant. Let is start)')  # type: ignore


async def handle_audio(update: Update, context: CallbackContext) -> None:
    await verify_user(update)

    artifact_paths = []

    user_id: str = str(update.effective_user.id)  # type: ignore
    chat_id = update.message.chat_id  # type: ignore
    voice_message = update.message.voice  # type: ignore

    if not voice_message:
        await update.message.reply_text('Please, send me audio file.')  # type: ignore
        return

    input_file_path = await file_system.write_user_audio_file(user_id, voice_message)
    artifact_paths.append(input_file_path)
    output_file_path = formatter.processing(input_file_path, '.wav')  # type: ignore
    artifact_paths.append(output_file_path)
    text_message = speech_to_text.transcribe(output_file_path)

    for text_sentence in langchain.ask_model(text_message):
        sentence_hash = md5_hash(text_sentence)
        wav_ai_answer_filepath = file_system.make_user_artifact_file_path(
            user_id=user_id, filename=f'{sentence_hash}.wav'
        )
        artifact_paths.append(wav_ai_answer_filepath)
        text_to_speech.processing(wav_ai_answer_filepath, text_sentence)
        ogg_ai_answer_filepath = formatter.processing(wav_ai_answer_filepath, '.ogg')
        artifact_paths.append(ogg_ai_answer_filepath)
        await send_voice_message(context=context, chat_id=chat_id, file_path=ogg_ai_answer_filepath)

    file_system.delete_artifacts(user_id=user_id, filename_array=artifact_paths)


async def send_voice_message(context: CallbackContext, chat_id, file_path: str):
    with open(file_path, 'rb') as voice_file:
        await context.bot.send_voice(chat_id=chat_id, voice=voice_file)


def main() -> None:
    application = Application.builder().token(TELEGRAM_BOT_TOKEN).build()

    application.add_handler(CommandHandler('start', start))
    application.add_handler(MessageHandler(filters.VOICE & ~filters.COMMAND, handle_audio))

    application.run_polling()


if __name__ == '__main__':
    main()

Browser client. WebSockets

After working with the Telegram bot, I came to the conclusion that its functionality is not entirely convenient when implementing a full-fledged voice assistant. The bot, although it provides basic interaction capabilities, limits me in terms of user experience. Therefore, I began to think about how to implement a client application as less painfully as possible.
The most obvious solution for me was a WebSocket-based browser client. Pros: connecting audio recording and playback devices via a browser, the ability to implement the client on any device.

This is the kind of client we got in a hurry. Here, all simply recorded frames are sent to the backend on an ongoing basis, while audio responses are collected in a queue and synchronously played back using the function playNextAudio. Below is the client code:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>Chekov</title>
</head>
<body>
    <button id="startBtn">Start Recording</button>
    <button id="stopBtn">Stop Recording</button>
    <button id="enableAudioBtn">Enable Audio Playback</button>
    <script>
        const TIME_SLICE = 100;
        const WS_HOST = "localhost";
        const WS_PORT = 8765;

        const WS_URL = `ws://${WS_HOST}:${WS_PORT}`;
        const ws = new WebSocket(WS_URL);

        ws.onopen = () => console.log("WebSocket connection established");
        ws.onclose = () => console.log("WebSocket connection closed");
        ws.onerror = (e) => console.error("WebSocket error:", e);
        ws.onmessage = (event) => collectVoiceAnswers(event);

        let mediaRecorder;
        let audioEnabled = false;
        const audioQueue = [];
        let isPlaying = false;

        async function startRecord() {
            const userMediaSettings = { audio: true };
            const stream = await navigator.mediaDevices.getUserMedia(userMediaSettings);
            mediaRecorder = new MediaRecorder(stream);
            mediaRecorder.ondataavailable = streamData;
            mediaRecorder.start(TIME_SLICE);
        }

        function streamData(event) {
            if (event.data.size > 0 && ws.readyState === WebSocket.OPEN) {
                wsSend(event.data);
            }
        }

        function wsSend(data) {
            ws.send(data);
        }

        function stopRecord() {
            if (mediaRecorder) {
                mediaRecorder.stop();
            }
        }

        function collectVoiceAnswers(event) {
            if (!audioEnabled) return;

            const { data, type } = JSON.parse(event.data);
            const audioData = atob(data);
            const byteArray = new Uint8Array(audioData.length);

            for (let i = 0; i < audioData.length; i++) {
                byteArray[i] = audioData.charCodeAt(i);
            }

            const audioBlob = new Blob([byteArray], { type: "audio/wav" });
            const audioUrl = URL.createObjectURL(audioBlob);
            const audio = new Audio(audioUrl);
            audioQueue.push({ audio, type });
            if (!isPlaying) {
                playNextAudio();
            }
        }

        async function playNextAudio() {
            if (audioQueue.length === 0) {
                isPlaying = false;
                return;
            }

            isPlaying = true;
            const { audio, type } = audioQueue.shift();
            try {
                await new Promise((resolve, reject) => {
                    audio.onended = resolve;
                    audio.onerror = reject;
                    audio.play().catch(reject);
                });
                playNextAudio();
            } catch (e) {
                console.error("Error playing audio:", e);
                isPlaying = false;
            }
        }

        document.getElementById("startBtn").addEventListener("click", startRecord);
        document.getElementById("stopBtn").addEventListener("click", stopRecord);
    </script>
</body>
</html>

For an implementation of the server part that handles WebSocket connections and interacts with the rest of the assistant, you can get the corresponding server file at the specified link. You can also find a quick start guide at this link in the repository.

Conclusion

That's it. To further improve the assistant, including adding new functions (namely functions for assisting, so that the bot begins to live up to its name), such as saving notes, searching for information and other useful features, it is worth considering the possibility of fine-tuning LLM in order to issue unified answers in the format {command, message}. It will also be useful to implement post-processing to process commands using classical algorithms based on LLM output.

And that's all. Thank you for reading to the end!

Here I will leave a link to the entire assistant code

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *