Creating an AI Secretary with Whisper and ChatGPT
Greetings! My name is Gregory, and I am the head of special projects in the AllSee team. In the modern world, artificial intelligence has become an indispensable assistant in various areas of our lives. However, I believe that you should always strive for more by automating all processes that are possible. In this article, I will share my experience using Whisper and ChatGPT to create an AI secretary that can optimize the storage and processing of corporate calls.
Motivation
It is not only obvious, but also scientifically proven that using AI in the workflow increases productivity: for white collar workers up to 37%!
Let’s try to “knock out” our productivity percentages by automating the routine task of viewing call recordings.
What introductory notes?
There is a fairly large database of calls that sometimes have to be reviewed, spending 30-40 minutes per recording, and for some videos, especially for interviews or requirements gathering, processing can take even longer than the meeting itself.
I want to make this process more efficient by retrieving meeting transcripts, highlighting participants for quick review, and asking the AI assistant to ask specific questions. To interact with our system, I will use a Telegram bot.
Solution
The technical structure of the solution includes:
I will omit many details related to the UX of the bot, but I will try to touch on important points that you will definitely have to deal with when developing and integrating such a project.
Sqiud Proxy
To work with ChatGPT API from Russia we will need to set up a proxy server. I will use Squid — caching proxy server for HTTP, HTTPS, FTP and Gopher protocols.
Instructions for setting up Squid Proxy
The installation on a server with OS Ubuntu is described, but for other Linux distributions the instructions will be similar, differing only in the package manager used.
Installing squid
sudo apt update
sudo apt-get -y install squid
Network pre-configuration
sudo ufw allow squid
sudo iptables -P INPUT ACCEPT
Activation of the squid service
sudo systemctl enable squid
Setting up squid
Create a backup of the squid configuration and delete everything from the original file 8000 comment lines.
sudo cp /etc/squid/squid.conf /etc/squid/squid_back.conf
sudo grep -v '^ *#\|^ *$' /etc/squid/squid.conf > /etc/squid/squid.conf
Open the squid configuration file.
sudo apt-get -y install nano
sudo nano /etc/squid/squid.conf
The squid.conf file needs to be modified as follows:
...
http_access allow localhost
acl whitelist src 111.111.111.111 # Добавляем IP адрес нашего основного сервера (нужно указать реальный IP) в список доступа
http_access allow whitelist # Разрешаем основному серверу использовать наш прокси
http_access deny all
...
Restarting the squid service
To apply our settings, simply restart squid.
sudo systemctl restart squid
WhisperX
To solve this problem, it is not enough to simply transcribe the text; it is also necessary to solve the problem of diarization: to select individual speakers from the audio stream. The task is complicated by the fact that several interlocutors can participate in negotiations and calls, and many algorithms are limited to 2 conversation participants.
Comes to our aid WhisperX — a “wrapper” over the standard Whisper model for transcription, highlighting timestamps of individual words, as well as diarization of up to 100 interlocutors.
To integrate WhisperX into our solution, I wrote small FastAPI serverwhich processes incoming audio files.
Implementation and launch of the server
Imports and settings
import asyncio
import os
import uuid
from dataclasses import dataclass
from datetime import datetime
from queue import Queue
from threading import Thread
from typing import Any
from fastapi import HTTPException, BackgroundTasks, FastAPI, status, Request
from fastapi.middleware.cors import CORSMiddleware
from pydantic_settings import BaseSettings
from streaming_form_data import StreamingFormDataParser
from streaming_form_data.targets import FileTarget
from streaming_form_data.validators import MaxSizeValidator
import whisperx
@dataclass
class WhisperXModels:
whisper_model: Any
diarize_pipeline: Any
align_model: Any
align_model_metadata: Any
class TranscriptionAPISettings(BaseSettings):
tmp_dir: str="tmp"
cors_origins: str="*"
cors_allow_credentials: bool = True
cors_allow_methods: str="*"
cors_allow_headers: str="*"
whisper_model: str="large-v2"
device: str="cuda"
compute_type: str="float16"
batch_size: int = 16
language_code: str="auto"
hf_api_key: str=""
file_loading_chunk_size_mb: int = 1024
task_cleanup_delay_min: int = 60
max_file_size_mb: int = 4096
max_request_body_size_mb: int = 5000
class Config:
env_file="env/.env.cuda"
env_file_encoding = 'utf-8'
class MaxBodySizeException(Exception):
def __init__(self, body_len: int):
self.body_len = body_len
class MaxBodySizeValidator:
def __init__(self, max_size: int):
self.body_len = 0
self.max_size = max_size
def __call__(self, chunk: bytes):
self.body_len += len(chunk)
if self.body_len > self.max_size:
raise MaxBodySizeException(self.body_len)
settings = TranscriptionAPISettings()
app = FastAPI()
# noinspection PyTypeChecker
app.add_middleware(
CORSMiddleware,
allow_origins=settings.cors_origins.split(','),
allow_credentials=settings.cors_allow_credentials,
allow_methods=settings.cors_allow_methods.split(','),
allow_headers=settings.cors_allow_headers.split(','),
)
Transcription logic
trancription_tasks = {}
trancription_tasks_queue = Queue()
whisperx_models = WhisperXModels(
whisper_model=None,
diarize_pipeline=None,
align_model=None,
align_model_metadata=None
)
def load_whisperx_models() -> None:
global whisperx_models
whisperx_models.whisper_model = whisperx.load_model(
whisper_arch=settings.whisper_model,
device=settings.device,
compute_type=settings.compute_type,
language=settings.language_code if settings.language_code != "auto" else None
)
whisperx_models.diarize_pipeline = whisperx.DiarizationPipeline(
use_auth_token=settings.hf_api_key,
device=settings.device
)
if settings.language_code != "auto":
(
whisperx_models.align_model,
whisperx_models.align_model_metadata
) = whisperx.load_align_model(
language_code=settings.language_code,
device=settings.device
)
def transcribe_audio(audio_file_path: str) -> dict:
global whisperx_models
audio = whisperx.load_audio(audio_file_path)
transcription_result = whisperx_models.whisper_model.transcribe(
audio,
batch_size=int(settings.batch_size),
)
if settings.language_code == "auto":
language = transcription_result["language"]
(
whisperx_models.align_model,
whisperx_models.align_model_metadata
) = whisperx.load_align_model(
language_code=language,
device=settings.device
)
aligned_result = whisperx.align(
transcription_result["segments"],
whisperx_models.align_model,
whisperx_models.align_model_metadata,
audio,
settings.device,
return_char_alignments=False
)
diarize_segments = whisperx_models.diarize_pipeline(audio)
final_result = whisperx.assign_word_speakers(
diarize_segments,
aligned_result
)
return final_result
def transcription_worker() -> None:
while True:
task_id, tmp_path = trancription_tasks_queue.get()
try:
result = transcribe_audio(tmp_path)
trancription_tasks[task_id].update({"status": "completed", "result": result})
except Exception as e:
trancription_tasks[task_id].update({"status": "failed", "result": str(e)})
finally:
trancription_tasks_queue.task_done()
os.remove(tmp_path)
FasAPI logic
@app.on_event("startup")
async def startup_event() -> None:
os.makedirs(settings.tmp_dir, exist_ok=True)
load_whisperx_models()
Thread(target=transcription_worker, daemon=True).start()
async def cleanup_task(task_id: str) -> None:
await asyncio.sleep(settings.task_cleanup_delay_min * 60)
trancription_tasks.pop(task_id, None)
@app.post("/transcribe/")
async def create_upload_file(
request: Request,
background_tasks: BackgroundTasks
) -> dict:
task_id = str(uuid.uuid4())
tmp_path = f"{settings.tmp_dir}/{task_id}.audio"
trancription_tasks[task_id] = {
"status": "loading",
"creation_time": datetime.utcnow(),
"result": None
}
body_validator = MaxBodySizeValidator(settings.max_request_body_size_mb * 1024 * 1024)
try:
file_target = FileTarget(
tmp_path,
validator=MaxSizeValidator(settings.max_file_size_mb * 1024 * 1024)
)
parser = StreamingFormDataParser(headers=request.headers)
parser.register('file', file_target)
async for chunk in request.stream():
body_validator(chunk)
parser.data_received(chunk)
except MaxBodySizeException as e:
raise HTTPException(
status_code=status.HTTP_413_REQUEST_ENTITY_TOO_LARGE,
detail=f"Maximum request body size limit exceeded: {e.body_len} bytes"
)
except Exception as e:
if os.path.exists(tmp_path):
os.remove(tmp_path)
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
detail=f"Error processing upload: {str(e)}"
)
if not file_target.multipart_filename:
if os.path.exists(tmp_path):
os.remove(tmp_path)
raise HTTPException(
status_code=status.HTTP_422_UNPROCESSABLE_ENTITY,
detail="No file was uploaded"
)
trancription_tasks[task_id].update({"status": "processing"})
trancription_tasks_queue.put((task_id, tmp_path))
background_tasks.add_task(cleanup_task, task_id)
return {
"task_id": task_id,
"creation_time": trancription_tasks[task_id]["creation_time"].isoformat(),
"status": trancription_tasks[task_id]["status"]
}
@app.get("/transcribe/status/{task_id}")
async def get_task_status(task_id: str) -> dict:
task = trancription_tasks.get(task_id)
if not task:
raise HTTPException(status_code=404, detail="Task not found")
return {
"task_id": task_id,
"creation_time": task["creation_time"],
"status": task["status"]
}
@app.get("/transcribe/result/{task_id}")
async def get_task_result(task_id: str) -> dict:
task = trancription_tasks.get(task_id)
if not task:
raise HTTPException(status_code=404, detail="Task not found")
if task["status"] == "pending":
raise HTTPException(status_code=404, detail="Task not completed")
return {
"task_id": task_id,
"creation_time": task["creation_time"],
"status": task["status"],
"result": task["result"]
}
Dockerfile
FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu20.04
RUN apt-get update && apt-get install -y software-properties-common
RUN add-apt-repository ppa:deadsnakes/ppa
RUN apt-get update && apt-get install -y python3.10 python3.10-venv python3-pip ffmpeg
WORKDIR /whisperx-fastapi
COPY . .
RUN python3.10 -m venv venv
RUN /bin/bash -c "source venv/bin/activate && pip install --upgrade pip"
RUN /bin/bash -c "source venv/bin/activate && pip install -e ."
RUN /bin/bash -c "source venv/bin/activate && pip install -r fastapi/requirements-fastapi-cuda.txt"
WORKDIR /whisperx-fastapi/fastapi
EXPOSE 8000
CMD ["../venv/bin/uvicorn", "api.app:app", "--host", "0.0.0.0", "--port", "8000"]
Starting the server
To deploy a server, just copy repositoryindicate in ./fastapi/env/.env.cuda
HuggingFace access token, build and run the Docker container.
sudo docker build -f fastapi/dockerization/dockerfile.fastapi.cuda -t whisperx-fastapi-cuda .
sudo docker run -p 8000:8000 --env-file ./fastapi/env/.env.cuda --gpus all --name whisperx-fastapi-cuda-container whisperx-fastapi-cuda
Telegram Bot API
A big problem was the limitation for Telegram bots to process files larger than 50 MB, which in the context of processing voice and video recordings of hour-long calls is a ridiculous figure.
This problem can be solved quite simply: deploy your local API for the Telegram bot using telegram-bot-api, which will allow us to process files up to 2000MB. In this case, all the bot’s information will be stored directly on our machine, which will allow us to work with files faster, without waiting for multipart to be received from the server.
How can I do that?
It's actually very simple:
apt-get install -y --no-install-recommends \
build-essential \
libssl-dev \
zlib1g-dev \
git \
cmake \
gperf \
g++
git clone --recursive https://github.com/tdlib/telegram-bot-api.git && \
cd telegram-bot-api && \
mkdir build && \
cd build && \
cmake .. && \
cmake --build . --target install
cd telegram-bot-api/build
./telegram-bot-api --api-id=${TELEGRAM_API_ID} --api-hash=${TELEGRAM_API_HASH} --local
Python Telegram Bot
And finally, to process requests from users I used python-telegram-bot.
I won’t highlight anything special in the logic of work, so we’ll just focus on integrating a ready-made Telegram bot into our solution.
Deploying a Telegram bot
Loading the repository
git clone https://github.com/allseeteam/ai-secretary.git
Container assembly
docker build --build-arg HTTP_PROXY=${HTTP_PROXY} --build-arg HTTPS_PROXY=${HTTPS_PROXY} --build-arg NO_PROXY=${NO_PROXY} -t ai-secretary .
HTTP_PROXY — Proxy server address to bypass geographic restrictions
HTTPS_PROXY – See HTTP_PROXY
NO_PROXY — Addresses to which we will send requests without a proxy
Creating docker volume for sqlite database
docker volume create ai_secretary_sqlite_db
Running a container
docker run -d --network host --volume ai_secretary_sqlite_db:/ai-secretary/database --env-file env/.env --name ai-secretary-container ai-secretary
TELEGRAM_API_ID — Telegram application ID
TELEGRAM_API_HASH — Hash of the Telegram application
TELEGRAM_BOT_TOKEN — Telegram bot token
TELEGRAM_BOT_API_BASE_URL — Base address of your bot server (for a local server: http://localhost:8081/bot)
OPENAI_API_KEY — OpenAI token
SQLITE_DB_PATH — The path where we want to store our SQLite database (standard address: bot/database/ai-secretary.db)
TRANSCRIPTION_API_BASE_URL — Base address of the server for transcription (you can deploy the server according to the instructions from the repository, for a local server the address (for this case, we must specify –network host when starting the image): http://127.0.0.1:8000)
Result
As you can see, our AI secretary processes incoming video files without any problems, and also discusses their contents with the user, taking into account the content of the recording and the context of past questions.
Conclusion
With the help of ChatGPT, Whisper, and a dash of pro-girl magic, we were able to increase productivity while working with the corporate database and eliminate the need to watch hour-long video calls to refresh our memory before work.
And also cool news for all those who like to “poke”: within the next week the bot will be available to everyone for review and work. If you have any problems or suggestions regarding the functionality of the bot, feel free to write to the contacts in the description.
The entire codebase can be found at project repositories. If someone gets inspired by the project and decides to modify our work, I will be glad to accept your pull request.
What specific aspects of your work require automation? Share your ideas and thoughts in the comments – I look forward to discussing them with you. Be good people and leave the rest to the machines. Good luck and we'll be in touch✌️