How a Simple Python Script Using AI Can Optimize Your Workflow

Programmers know firsthand how important it is to give tired wrists a rest every now and then. And in this case, the ability to dictate text – whether during long programming sessions or in the pursuit of a more ergonomic work organization – can be a real salvation. In this tutorial, I will walk you through the process of creating a modern speech-to-text transcription tool in Python that is distinguished by high speed and accuracy thanks to the use of AInamely the Whisper API from Groq.

Our goal is to develop a script that runs in the background and allows you to activate voice input in any application by simply pressing a button. After releasing the button, the script instantly converts speech to text and automatically inserts it into the active input field. Thus, we get the ability to voice input in almost any of your applications.

Enjoy reading!

Prerequisites

Before we start implementing the project, we need to make sure that we have certain tools. First of all, Python must be installed on our system, and we will also need to install the following libraries:

      pip install keyboard pyautogui pyperclip groq pyaudio

Each of these libraries plays an important role in our project:

  • PyAudio, which provides processing of the audio signal coming from the microphone;

  • Keyboard, which allows you to track keyboard events and respond to keystrokes;

  • PyAutoGUI, which simulates keyboard input to automatically insert transcribed text;

  • Pyperclip, which interacts with the operating system's clipboard;

  • And finally, Groq, namely the Groq API client, which provides access to the Whisper implementation.

We will also need a Groq API key. If you don't have one yet, you can get one for free by signing up at website.

Code

If you want to see the finished project, you can find the full code in my GitHub repositoryIn this article, we will focus on the key components of the script and analyze their interaction, which ensures the functioning of the tool for transcribing speech into text.

It is important to note that in this implementation we do not use the Atomic Agents library.

Setting up the environment

The first step in implementing our project is to import the necessary libraries and set up the Groq client. To initialize the Groq client, we will use an API key stored in an environment variable. This approach is a common practice for working with sensitive information, particularly API keys, as it avoids storing keys directly in the source code. So, before proceeding, make sure you have created a .env file containing your API key.

import os
import tempfile
import wave
import pyaudio
import keyboard
import pyautogui
import pyperclip
from groq import Groq

client = Groq(api_key=os.environ.get("GROQ_API_KEY"))
    

Recording audio

Function record_audioas its name suggests, is designed to capture audio from a microphone:

def record_audio(sample_rate=16000, channels=1, chunk=1024):
    p = pyaudio.PyAudio()
    stream = p.open(
        format=pyaudio.paInt16,
        channels=channels,
        rate=sample_rate,
        input=True,
        frames_per_buffer=chunk,
    )

    print("Нажмите и удерживайте кнопку PAUSE, чтобы начать запись...")
    frames = []
    keyboard.wait("pause")  # Ожидание нажатия кнопки PAUSE
    print("Запись... (Отпустите PAUSE, чтобы остановить)")
    while keyboard.is_pressed("pause"):
        data = stream.read(chunk)
        frames.append(data)
    print("Запись завершена.")
    stream.stop_stream()
    stream.close()
    p.terminate()
    return frames, sample_rate
    

In this case, we are using a sampling rate of 16000 Hz, which is optimal for working with Whisper. Since Whisper automatically downsamples to 16000 Hz, using a higher value will only result in a larger file size and, as a result, a reduction in the recording duration available for transcription.

The function initializes the thread PyAudio and waits for the PAUSE key to be pressed. Once pressed, audio recording begins and continues as long as the key is held down. The PAUSE key was chosen because it is rarely used in modern applications, but you can easily change this setting and use any other key at your discretion.

Saving audio to a temporary file

Once the audio recording is complete, the next step is to save it to a temporary file for further processing. For this purpose, we use the function save_audio:

def save_audio(frames, sample_rate):
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as temp_audio:
        wf = wave.open(temp_audio.name, "wb")
        wf.setnchannels(1)
        wf.setsampwidth(pyaudio.PyAudio().get_sample_size(pyaudio.paInt16))
        wf.setframerate(sample_rate)
        wf.writeframes(b"".join(frames))
        wf.close()
        return temp_audio.name
    

This function creates a temporary WAV file using tempfile. The use of temporary files in this case is due to the fact that we only need the audio data for a short period of time to perform the transcription. After that, the file will be deleted, which we will return to later.

Transcribe Audio with Groq

The core of our script is the transcription process, which is implemented in the function transcribe_audio:

def transcribe_audio(audio_file_path):
    try:
        with open(audio_file_path, "rb") as file:
            transcription = client.audio.transcriptions.create(
                file=(os.path.basename(audio_file_path), file.read()),
                model="whisper-large-v3",
                prompt="""Аудиозапись программиста, обсуждающего проблемы программирования. Программист в основном использует Python и может упоминать библиотеки Python или ссылаться на код в своей речи.""",
                response_format="text",
                language="en",
            )
        return transcription
    except Exception as e:
        print(f"Произошла ошибка: {str(e)}")
        return None
    

Function transcribe_audio transcribes an audio file using the Groq API. To achieve high speech recognition accuracy, we choose the “whisper-large-v3” model, which, thanks to the optimization of the Groq API, works with impressive speed. Parameter prompt allows you to set a context for the model, which helps it more accurately understand the content of the audio recording. In our case, we specified that it was about programming, which allows the model to more effectively recognize specific terms, such as library names.

Processing transcription results

Once the transcription is complete and the text is received, it must be pasted into the active application. This task is performed by the function copy_transcription_to_clipboard:

def copy_transcription_to_clipboard(text):
    pyperclip.copy(text)
    pyautogui.hotkey("ctrl", "v")
    

The function uses the library pyperclip to copy the transcribed text to the clipboard, and then, using the library pyautoguisimulates pressing the keys “Ctrl+V”, which results in pasting the text into the active application. This approach ensures the universality of our tool and allows it to work correctly with any text input field, regardless of the application used.

Main loop

And the last element of our script is the function main()which combines all components into a single workflow:

def main():
    while True:
        # Запись аудио
        frames, sample_rate = record_audio()

        # Сохранение аудио во временный файл
        temp_audio_file = save_audio(frames, sample_rate)

        # Транскрибация аудио
        print("Транскрибация...")
        transcription = transcribe_audio(temp_audio_file)

        # Копирование транскрипции в буфер обмена
        if transcription:
            print("\nТранскрипция:")
            print(transcription)
            print("Копирование транскрипции в буфер обмена...")
            copy_transcription_to_clipboard(transcription)
            print("Транскрипция скопирована в буфер обмена и вставлена в приложение.")
        else:
            print("Транскрибация не удалась.")

        # Удаление временного файла
        os.unlink(temp_audio_file)
        print("\nГотов к следующей записи. Нажмите PAUSE для начала.")
    

The main() function is organized as an infinite loop, allowing the user to execute multiple entries without having to restart the script. Each iteration of the loop includes the following steps:

  1. Pressing and holding the PAUSE key records audio;

  2. The recorded audio data is saved to a temporary file;

  3. Groq's Whisper API transcribes audio into text;

  4. If the transcription is successful, the text is copied to the clipboard and pasted into the active application;

  5. The temporary audio data file is deleted to free up resources.

Thanks for reading!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *