Recognizing speech using the IBM Speech-to-Text API

Hello, Habr. Within the course “Machine Learning. Advanced” prepared a translation of interesting material for you.

We also invite everyone to watch an open lesson on the topic “Multi-armed bandits to optimize AB testing.”

Extract conversations from audio recordings with ease using Python.

In this article, you will learn how to use the IBM Speech to Text API to recognize speech from an audio recording file. We will be using the free version of the API, which has some limitations, such as the length of the sound file. I’ll tell you more about the API later in this article. Let me start by providing you with some background information on the application of speech recognition in our daily life.


If you are reading this article, I am sure that you are familiar with the term “Artificial Intelligence»And understand its importance. It won’t shock anyone if I say that one of the best uses of artificial intelligence in everyday life is speech recognition.

Speech recognition from audio allows us to at least save time – we speak, instead of typing something. This makes using our technology devices more fun and easy. This technology also helps us interact with these devices without writing any code. Imagine that people need to know programming in order to give commands to Alexa or Siri. That would be very dumb.

I can’t wait to show you the speech recognizer in action. Let’s get to work. Here are the steps we will follow in this project.


  • Cloud Speech Recognition Services

  • Step 1 – Library

  • Step 2 – Importing an audio clip

  • Step 3 – Defining a Recognizer

  • Step 4 – Speech Recognition in Action

  • Final Step – Export Result

Cloud Speech Recognition Services

Many giant tech companies have their own recognition models. I’ll share some of them here so you can see the big picture. These APIs work over the cloud and can be accessed from anywhere in the world as long as you have an internet connection. Also, most of them are paid, but you can test them for free. For example, Microsoft offers one year free access for an Azure cloud account.

Some of the more popular cloud-based speech-to-text services are:

Step 1 – Library

For this project we only need one library. And this SpeechRecognitionSpeechRecognition distributed free and open source. It supports several speech recognition engines and APIs. Such as the; Microsoft Azure Speech, Google Cloud Speech, IBM Watson Speech to Text API, and more. In this project, we will be testing the IBM Watson Speech to Text API. Feel free to explore the source code and documentation of the SpeechRecognition package here

Let’s start by installing the package. We are going to use pip, the Python library manager.

pip install SpeechRecognition

After the installation process is complete, we can open our code editor. You can also use Jupyter Notebook.

import speech_recognition as s_r

Step 2 – Import

I recorded a voice memo using a computer. It was in the format m4a, but the recognizer doesn’t work with the format m4a… This is why I had to convert it to wav format.

audio_file = s_r.AudioFile('my_clip.wav')

Step 3 – Defining a Recognizer

At this point, all we will do is define the speech recognizer. Earlier we imported the library. Now we will create a new variable and assign a recognition attribute to it.

rcgnzr = s_r.Recognizer()

Step 4 – Speech Recognition in Action

It’s time for action! We will run IBM speech to text on our audio file. Before starting the recognizer, I will run functions called “adjust_for_ambient_noise”And“record”To suppress noise and improve sound. Thus, our recognizer will be able to produce more accurate results.

with audio_file as source: 
   clean_audio = rcgnzr.record(source)

Great, now we have a fairly clean audio recording. Now let’s launch the IBM Speech Recognizer. (It took me several hours to figure out how the IBM Speech-to-Text API integrates with the library Python SpeechRecogniton). Here’s the best way to call the recognizer through the API:

recognized_speech_ibm = r.recognize_ibm(clean_audio, username="apkikey", password= "your API Key")

Note: The IBM API does not work without an API key. We will need to get it from the IBM Watson page. I had to create an account to test this Speech-to-Text model. What I liked about the IBM model is that I can handle 500 minutes of recordings per month using a trial account, which is more than enough for educational purposes.

The last step is exporting the result

We’re almost done. It’s time to check the result. Our recognizer detected speech in the audio file in the previous step. We’ll go ahead and check how it worked. If the result suits us, we export it to a text document.

To check the recognized speech, we display the variable with the recognized text:


Looks nice. My audio file was recognized correctly. I read a paragraph from this article. If you are not satisfied with the result, there are many ways to preprocess an audio file for better results. Here is a good article in which provides more detailed information on speech recognition and how to improve the quality of recognition.

I am now exporting the recognized speech to a text document. We will see the message “ready!»In our terminal upon completion of export.

with open('recognized_speech.txt',mode="w") as file:    
   file.write("Recognized Speech:") 

Congratulations! If you are reading this paragraph, you have created a speech recognizer. We hope you enjoyed this how-to guide and learned something new today. The best way to practice your programming skills is to pursue interesting projects. I have written many other practical projects like this. Do not be shy contact me, if you have any questions while implementing the program.

Connect. Visit my blog and YouTube, to get a boost of inspiration. Thank.

Learn more about the course “Machine Learning. Advanced”.

View open lesson on the topic “Multi-armed bandits to optimize AB testing.”

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *