Creating an AI Secretary with Whisper and ChatGPT

The process of recording the progress of a call (photo in color)

The process of recording the progress of a call (photo in color)

Greetings! My name is Gregory, and I am the head of special projects in the AllSee team. In the modern world, artificial intelligence has become an indispensable assistant in various areas of our lives. However, I believe that you should always strive for more by automating all processes that are possible. In this article, I will share my experience using Whisper and ChatGPT to create an AI secretary that can optimize the storage and processing of corporate calls.

Motivation

It is not only obvious, but also scientifically proven that using AI in the workflow increases productivity: for white collar workers up to 37%!

Source: https://joshbersin.com/2023/03/new-mit-research-shows-spectacular-increase-in-white-collar-productivity-from-chatgpt/

Source: https://joshbersin.com/2023/03/new-mit-research-shows-spectacular-increase-in-white-collar-productivity-from-chatgpt/

Let’s try to “knock out” our productivity percentages by automating the routine task of viewing call recordings.

What introductory notes?

There is a fairly large database of calls that sometimes have to be reviewed, spending 30-40 minutes per recording, and for some videos, especially for interviews or requirements gathering, processing can take even longer than the meeting itself.

I want to make this process more efficient by retrieving meeting transcripts, highlighting participants for quick review, and asking the AI ​​assistant to ask specific questions. To interact with our system, I will use a Telegram bot.

Solution

The technical structure of the solution includes:

Technical structure of the solution

Technical structure of the solution

I will omit many details related to the UX of the bot, but I will try to touch on important points that you will definitely have to deal with when developing and integrating such a project.

Sqiud Proxy

To work with ChatGPT API from Russia we will need to set up a proxy server. I will use Squid — caching proxy server for HTTP, HTTPS, FTP and Gopher protocols.

Instructions for setting up Squid Proxy

The installation on a server with OS Ubuntu is described, but for other Linux distributions the instructions will be similar, differing only in the package manager used.

Installing squid

sudo apt update
sudo apt-get -y install squid

Network pre-configuration

sudo ufw allow squid
sudo iptables -P INPUT ACCEPT

Activation of the squid service

sudo systemctl enable squid

Setting up squid

Create a backup of the squid configuration and delete everything from the original file 8000 comment lines.

sudo cp /etc/squid/squid.conf /etc/squid/squid_back.conf
sudo  grep -v '^ *#\|^ *$' /etc/squid/squid.conf > /etc/squid/squid.conf

Open the squid configuration file.

sudo apt-get -y install nano
sudo nano /etc/squid/squid.conf

The squid.conf file needs to be modified as follows:

...
http_access allow localhost
acl whitelist src 111.111.111.111   # Добавляем IP адрес нашего основного сервера (нужно указать реальный IP) в список доступа
http_access allow whitelist         # Разрешаем основному серверу использовать наш прокси
http_access deny all
...

Restarting the squid service

To apply our settings, simply restart squid.

sudo systemctl restart squid

WhisperX

To solve this problem, it is not enough to simply transcribe the text; it is also necessary to solve the problem of diarization: to select individual speakers from the audio stream. The task is complicated by the fact that several interlocutors can participate in negotiations and calls, and many algorithms are limited to 2 conversation participants.

Comes to our aid WhisperX — a “wrapper” over the standard Whisper model for transcription, highlighting timestamps of individual words, as well as diarization of up to 100 interlocutors.

To integrate WhisperX into our solution, I wrote small FastAPI serverwhich processes incoming audio files.

Implementation and launch of the server
Imports and settings
import asyncio
import os
import uuid
from dataclasses import dataclass
from datetime import datetime
from queue import Queue
from threading import Thread
from typing import Any

from fastapi import HTTPException, BackgroundTasks, FastAPI, status, Request
from fastapi.middleware.cors import CORSMiddleware
from pydantic_settings import BaseSettings
from streaming_form_data import StreamingFormDataParser
from streaming_form_data.targets import FileTarget
from streaming_form_data.validators import MaxSizeValidator

import whisperx


@dataclass
class WhisperXModels:
    whisper_model: Any
    diarize_pipeline: Any
    align_model: Any
    align_model_metadata: Any


class TranscriptionAPISettings(BaseSettings):
    tmp_dir: str="tmp"
    cors_origins: str="*"
    cors_allow_credentials: bool = True
    cors_allow_methods: str="*"
    cors_allow_headers: str="*"
    whisper_model: str="large-v2"
    device: str="cuda"
    compute_type: str="float16"
    batch_size: int = 16
    language_code: str="auto"
    hf_api_key: str=""
    file_loading_chunk_size_mb: int = 1024
    task_cleanup_delay_min: int = 60
    max_file_size_mb: int = 4096
    max_request_body_size_mb: int = 5000

    class Config:
        env_file="env/.env.cuda"
        env_file_encoding = 'utf-8'


class MaxBodySizeException(Exception):
    def __init__(self, body_len: int):
        self.body_len = body_len


class MaxBodySizeValidator:
    def __init__(self, max_size: int):
        self.body_len = 0
        self.max_size = max_size

    def __call__(self, chunk: bytes):
        self.body_len += len(chunk)
        if self.body_len > self.max_size:
            raise MaxBodySizeException(self.body_len)


settings = TranscriptionAPISettings()

app = FastAPI()
# noinspection PyTypeChecker
app.add_middleware(
    CORSMiddleware,
    allow_origins=settings.cors_origins.split(','),
    allow_credentials=settings.cors_allow_credentials,
    allow_methods=settings.cors_allow_methods.split(','),
    allow_headers=settings.cors_allow_headers.split(','),
)
Transcription logic
trancription_tasks = {}
trancription_tasks_queue = Queue()

whisperx_models = WhisperXModels(
    whisper_model=None,
    diarize_pipeline=None,
    align_model=None,
    align_model_metadata=None
)


def load_whisperx_models() -> None:
    global whisperx_models

    whisperx_models.whisper_model = whisperx.load_model(
        whisper_arch=settings.whisper_model,
        device=settings.device,
        compute_type=settings.compute_type,
        language=settings.language_code if settings.language_code != "auto" else None
    )

    whisperx_models.diarize_pipeline = whisperx.DiarizationPipeline(
        use_auth_token=settings.hf_api_key,
        device=settings.device
    )

    if settings.language_code != "auto":
        (
            whisperx_models.align_model,
            whisperx_models.align_model_metadata
        ) = whisperx.load_align_model(
            language_code=settings.language_code,
            device=settings.device
        )


def transcribe_audio(audio_file_path: str) -> dict:
    global whisperx_models

    audio = whisperx.load_audio(audio_file_path)

    transcription_result = whisperx_models.whisper_model.transcribe(
        audio,
        batch_size=int(settings.batch_size),
    )

    if settings.language_code == "auto":
        language = transcription_result["language"]
        (
            whisperx_models.align_model,
            whisperx_models.align_model_metadata
        ) = whisperx.load_align_model(
            language_code=language,
            device=settings.device
        )

    aligned_result = whisperx.align(
        transcription_result["segments"],
        whisperx_models.align_model,
        whisperx_models.align_model_metadata,
        audio,
        settings.device,
        return_char_alignments=False
    )

    diarize_segments = whisperx_models.diarize_pipeline(audio)

    final_result = whisperx.assign_word_speakers(
        diarize_segments,
        aligned_result
    )

    return final_result


def transcription_worker() -> None:
    while True:
        task_id, tmp_path = trancription_tasks_queue.get()

        try:
            result = transcribe_audio(tmp_path)
            trancription_tasks[task_id].update({"status": "completed", "result": result})

        except Exception as e:
            trancription_tasks[task_id].update({"status": "failed", "result": str(e)})

        finally:
            trancription_tasks_queue.task_done()
            os.remove(tmp_path)
FasAPI logic
@app.on_event("startup")
async def startup_event() -> None:
    os.makedirs(settings.tmp_dir, exist_ok=True)
    load_whisperx_models()
    Thread(target=transcription_worker, daemon=True).start()


async def cleanup_task(task_id: str) -> None:
    await asyncio.sleep(settings.task_cleanup_delay_min * 60)
    trancription_tasks.pop(task_id, None)


@app.post("/transcribe/")
async def create_upload_file(
        request: Request,
        background_tasks: BackgroundTasks
) -> dict:
    task_id = str(uuid.uuid4())
    tmp_path = f"{settings.tmp_dir}/{task_id}.audio"

    trancription_tasks[task_id] = {
        "status": "loading",
        "creation_time": datetime.utcnow(),
        "result": None
    }

    body_validator = MaxBodySizeValidator(settings.max_request_body_size_mb * 1024 * 1024)

    try:
        file_target = FileTarget(
            tmp_path,
            validator=MaxSizeValidator(settings.max_file_size_mb * 1024 * 1024)
        )
        parser = StreamingFormDataParser(headers=request.headers)
        parser.register('file', file_target)
        async for chunk in request.stream():
            body_validator(chunk)
            parser.data_received(chunk)

    except MaxBodySizeException as e:
        raise HTTPException(
            status_code=status.HTTP_413_REQUEST_ENTITY_TOO_LARGE,
            detail=f"Maximum request body size limit exceeded: {e.body_len} bytes"
        )

    except Exception as e:
        if os.path.exists(tmp_path):
            os.remove(tmp_path)
        raise HTTPException(
            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
            detail=f"Error processing upload: {str(e)}"
        )

    if not file_target.multipart_filename:
        if os.path.exists(tmp_path):
            os.remove(tmp_path)
        raise HTTPException(
            status_code=status.HTTP_422_UNPROCESSABLE_ENTITY,
            detail="No file was uploaded"
        )

    trancription_tasks[task_id].update({"status": "processing"})
    trancription_tasks_queue.put((task_id, tmp_path))

    background_tasks.add_task(cleanup_task, task_id)

    return {
        "task_id": task_id,
        "creation_time": trancription_tasks[task_id]["creation_time"].isoformat(),
        "status": trancription_tasks[task_id]["status"]
    }


@app.get("/transcribe/status/{task_id}")
async def get_task_status(task_id: str) -> dict:
    task = trancription_tasks.get(task_id)

    if not task:
        raise HTTPException(status_code=404, detail="Task not found")

    return {
        "task_id": task_id,
        "creation_time": task["creation_time"],
        "status": task["status"]
    }


@app.get("/transcribe/result/{task_id}")
async def get_task_result(task_id: str) -> dict:
    task = trancription_tasks.get(task_id)

    if not task:
        raise HTTPException(status_code=404, detail="Task not found")

    if task["status"] == "pending":
        raise HTTPException(status_code=404, detail="Task not completed")

    return {
            "task_id": task_id,
            "creation_time": task["creation_time"],
            "status": task["status"],
            "result": task["result"]
    }
Dockerfile
FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu20.04

RUN apt-get update && apt-get install -y software-properties-common
RUN add-apt-repository ppa:deadsnakes/ppa
RUN apt-get update && apt-get install -y python3.10 python3.10-venv python3-pip ffmpeg

WORKDIR /whisperx-fastapi
COPY . .

RUN python3.10 -m venv venv
RUN /bin/bash -c "source venv/bin/activate && pip install --upgrade pip"
RUN /bin/bash -c "source venv/bin/activate && pip install -e ."
RUN /bin/bash -c "source venv/bin/activate && pip install -r fastapi/requirements-fastapi-cuda.txt"

WORKDIR /whisperx-fastapi/fastapi

EXPOSE 8000

CMD ["../venv/bin/uvicorn", "api.app:app", "--host", "0.0.0.0", "--port", "8000"]
Starting the server

To deploy a server, just copy repositoryindicate in ./fastapi/env/.env.cudaHuggingFace access token, build and run the Docker container.

sudo docker build -f fastapi/dockerization/dockerfile.fastapi.cuda -t whisperx-fastapi-cuda .
sudo docker run -p 8000:8000 --env-file ./fastapi/env/.env.cuda  --gpus all --name whisperx-fastapi-cuda-container whisperx-fastapi-cuda

Telegram Bot API

A big problem was the limitation for Telegram bots to process files larger than 50 MB, which in the context of processing voice and video recordings of hour-long calls is a ridiculous figure.

This problem can be solved quite simply: deploy your local API for the Telegram bot using telegram-bot-api, which will allow us to process files up to 2000MB. In this case, all the bot’s information will be stored directly on our machine, which will allow us to work with files faster, without waiting for multipart to be received from the server.

How can I do that?

It's actually very simple:

apt-get install -y --no-install-recommends \
    build-essential \
    libssl-dev \
    zlib1g-dev \
    git \
    cmake \
    gperf \
    g++

git clone --recursive https://github.com/tdlib/telegram-bot-api.git && \
    cd telegram-bot-api && \
    mkdir build && \
    cd build && \
    cmake .. && \
    cmake --build . --target install

cd telegram-bot-api/build 
./telegram-bot-api --api-id=${TELEGRAM_API_ID} --api-hash=${TELEGRAM_API_HASH} --local

Python Telegram Bot

And finally, to process requests from users I used python-telegram-bot.

I won’t highlight anything special in the logic of work, so we’ll just focus on integrating a ready-made Telegram bot into our solution.

Deploying a Telegram bot

Loading the repository

git clone https://github.com/allseeteam/ai-secretary.git

Container assembly

docker build --build-arg HTTP_PROXY=${HTTP_PROXY} --build-arg HTTPS_PROXY=${HTTPS_PROXY} --build-arg NO_PROXY=${NO_PROXY} -t ai-secretary .
  • HTTP_PROXY — Proxy server address to bypass geographic restrictions

  • HTTPS_PROXY – See HTTP_PROXY

  • NO_PROXY — Addresses to which we will send requests without a proxy

Creating docker volume for sqlite database

docker volume create ai_secretary_sqlite_db

Running a container

docker run -d --network host --volume ai_secretary_sqlite_db:/ai-secretary/database --env-file env/.env --name ai-secretary-container ai-secretary
  • TELEGRAM_API_ID — Telegram application ID

  • TELEGRAM_API_HASH — Hash of the Telegram application

  • TELEGRAM_BOT_TOKEN — Telegram bot token

  • TELEGRAM_BOT_API_BASE_URL — Base address of your bot server (for a local server: http://localhost:8081/bot)

  • OPENAI_API_KEY — OpenAI token

  • SQLITE_DB_PATH — The path where we want to store our SQLite database (standard address: bot/database/ai-secretary.db)

  • TRANSCRIPTION_API_BASE_URL — Base address of the server for transcription (you can deploy the server according to the instructions from the repository, for a local server the address (for this case, we must specify –network host when starting the image): http://127.0.0.1:8000)

Result

Example of working with a bot

Example of working with a bot

As you can see, our AI secretary processes incoming video files without any problems, and also discusses their contents with the user, taking into account the content of the recording and the context of past questions.

Conclusion

With the help of ChatGPT, Whisper, and a dash of pro-girl magic, we were able to increase productivity while working with the corporate database and eliminate the need to watch hour-long video calls to refresh our memory before work.

And also cool news for all those who like to “poke”: within the next week the bot will be available to everyone for review and work. If you have any problems or suggestions regarding the functionality of the bot, feel free to write to the contacts in the description.

The entire codebase can be found at project repositories. If someone gets inspired by the project and decides to modify our work, I will be glad to accept your pull request.

What specific aspects of your work require automation? Share your ideas and thoughts in the comments – I look forward to discussing them with you. Be good people and leave the rest to the machines. Good luck and we'll be in touch✌️

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *