How to Convert Text to Speech Using Google Tesseract and Arm NN on Raspberry Pi

Hello, Habr! Today, especially for the start of a new stream of Mсhine Learning we share with you a post, the author of which creates a text-to-speech device. This text-to-speech (TTS) engine is a key element of systems that seek to create natural interactions between humans and machines based on embedded devices. Built-in devices can, for example, help visually impaired people read signs, letters and documents. In particular, the device can, using optical character recognition, make it clear to the user what is visible in the image. However, let’s start crafting …

TTS applications have been available on desktop computers for many years and are widely used on most modern smartphones and mobile devices. These applications can be found among the accessibility tools in the operating system, and are also widely used for screen readers, custom alerts, and more.

Typically, such systems start with some machine-readable text. What if you don’t have a text source ready for your document, browser, or application? Optical character recognition (OCR) software can convert scanned images to text. In the context of a TTS application, these are glyphs – individual characters. The OCR software itself is only concerned with the accurate extraction of numbers and letters.

Deep learning AI techniques can be used to accurately detect text in real time – recognizing sets of glyphs as spoken words. In this case, to recognize words in text captured by OCR one could use recurrent neural network (RNS). What if it could be done on an embedded device that is lighter and more compact than even a smartphone?

Such a lightweight, powerful TTS device can help people with visual impairments. It can be embedded in tamper-proof devices for literacy or storytelling and many other uses.

In this article, I’ll show you how to do this with TensorFlow, OpenCV, Festival, and Raspberry Pi. For OCR, I will be using the TensorFlow machine learning framework along with a pre-trained Keras-OCR model. The OpenCV library will be used to capture images from the webcam. Finally, the Festival speech synthesis system will act as a TTS module. Then let’s put everything together to create a Python application for the Raspberry Pi.

Along the way, I’ll cover how typical OCR models work and how to further optimize your solution with TensorFlow Lite, a set of tools for running optimized TensorFlow models in confined environments such as embedded and IoT devices. The complete source code provided here is available at my github page

Beginning of work

First, you need a Raspberry Pi to create the device and app for this tutorial. For this example, versions 2, 3, or 4 are fine. You can also use your own development computer (we tested the code for Python 3.7).

Two packages need to be installed: tensorflow (2.1.0) and keras_ocr (0.7.1). Here are some useful links:

OCR using recurrent neural networks

Here, for OCR on images, I use the package keras_ocr… This package is based on the TensorFlow platform and convolutional neural network, which was originally published as example OCR on the Keras website.

The network architecture can be divided into three important phases. The first one takes an input image and then extracts the elements using multiple convolutional layers. These layers divide the input image horizontally. For each part, these layers define a set of image column elements. This sequence of column elements is used at the second stage by recurrent layers.

Recurrent neural networks (RNS) usually consist of long short term memory layers (LTSM). Long short term memory has revolutionized many applications of AI, including speech recognition, image captioning, and time series analysis. OCR models use RNNs to create a so-called symbol probability matrix. This matrix determines the degree of confidence that a given character is in a specific part of the input image.

Thus, in the last step, this matrix is ​​used to decode the text in the image. Usually people use an algorithm Connectionist Temporal Classification (CTC)… CTC seeks to transform a matrix into a meaningful word or sequence of such words. Such a transformation is not a trivial task, since the same characters can be found in adjacent parts of the image. In addition, some input parts may not contain characters.

While PHN-based OCR systems are effective, there are many challenges when trying to incorporate them into your projects. Ideally, you need to perform transformation training to tune the model to fit your data. The model is then converted to TensorFlow Lite format to optimize for output to the endpoint. This approach has proven successful in mobile computer vision applications. For example, many pretrained MobileNet networks effectively classify images on mobile and IoT devices.

However, TensorFlow Lite is a subset of TensorFlow and therefore not every operation is supported at this time. This incompatibility becomes an issue when it is necessary to perform OCR, such as the one included in the keras-ocr package on an IoT device. List of possible solutions provided on the official TensorFlow website.

In this article, I will show you how to use the TensorFlow model since bidirectional LSTM layers (used in keras-ocr) are not yet supported in TensorFlow Lite.

Pretrained OCR Model

First, I wrote a test script ( that shows how to use the neural network model from keras-ocr:

# Imports
import keras_ocr
import helpers
# Prepare OCR recognizer
recognizer = keras_ocr.recognition.Recognizer()
# Load images and their labels
images_with_labels = helpers.load_images_from_folder(
	dataset_folder, image_file_filter)
# Perform OCR recognition on the input images
predicted_labels = []
for image_with_label in images_with_labels:
# Display results
rows = 4
cols = 2
font_size = 14
helpers.plot_results(images_with_labels, predicted_labels, rows, cols, font_size)

This script creates an instance of a Recognizer object based on the keras_ocr.recognition module. The script then loads the images and their labels from the attached test case (Dataset folder). This dataset contains eight randomly selected images from set of synthetic words (Synth90k). The script then runs optical character recognition (OCR) on each image in that dataset and then displays the prediction results.

To load images and their tags, I use the load_images_from_folder function, which I implemented in the helpers module. This method assumes two parameters: path to the folder with images and a filter. Here I am assuming the images are in the Dataset subfolder and I am reading all the images in JPEG format (with a .jpg filename extension).

In the Synth90k dataset, each image file name contains an image label between the underscores. For instance: 199_pulpiest_61190.jpg… So, to get the image label, the load_images_from_folder function splits the filename by the underscore character and then takes the first element of the resulting string collection. Also note that the load_images_from_folder function returns an array of tuples. Each element of such an array contains an image and a corresponding label. For this reason, I only pass the first element of this tuple to the OCR handler.

For recognition, I use the Recognizer object recognition method. This method returns the predicted label, which I store in the predicted_labels collection.

Finally, I pass the collection of predicted labels, images, and source labels to another helper function, plot_results, that displays the images in a row x column rectangular grid. The appearance of the grid can be changed by changing the corresponding variables.


After testing the OCR model, I implemented the camera class. This class uses the OpenCV library that was installed along with the keras-ocr module. OpenCV provides a user-friendly software interface for accessing the camera. Explicitly, you first initialize the VideoCapture object and then call its read method to get the camera image.

import cv2 as opencv
class camera(object):
	def __init__(self):
		# Initialize the camera capture
		self.camera_capture = opencv.VideoCapture(0)
	def capture_frame(self, ignore_first_frame):
		# Get frame, ignore the first one if needed
		(capture_status, current_camera_frame) =
		# Verify capture status
			return current_camera_frame
			# Print error to the console
			print('Capture error')

In this code, I have created a VideoCapture object in the initializer of the camera class. I am passing 0 to the VideoCapture object to point to the system’s default camera. Then I save the resulting object in the camera_capture field of the camera class.

To get images from the camera, I have implemented the capture_frame method. It has an additional parameter, ignore_first_frame. When this parameter is True, I call twice, but ignore the result of the first call. The idea behind this operation is that the first frame returned by my camera is usually empty.

The second call to the read method gives the capture status and the frame. If the data collection was successful (capture_status = True), I return the camera frame. Otherwise, I print the “Capture error” line.

Convert text to speech

The last element of this application is the TTS module. It was decided to use the Festival system here because it can work offline. Other possible approaches to TTS are well documented in the Adafruit article Speech Synthesis on the Raspberry Pi (Synthesis of speech on the Raspberry Pi).
To install Festival on your Raspberry Pi, run the following command:

sudo apt-get install festival -y

You can verify that everything is working correctly by entering the following command:

echo "Hello, Arm" | Festival –tts

Your Raspberry Pi should say “Hello Arm.”
Festival provides an API. However, for simplicity, it was decided to interact with Festival through the command line. For this purpose, the helpers module has been supplemented with one more method:

def say_text(text):
os.system('echo ' + text + ' | festival --tts')

Putting it all together

Finally, we can put everything together. I did it in a script

import keras_ocr
import camera as cam
import helpers
if __name__ == "__main__":
	# Prepare recognizer
	recognizer = keras_ocr.recognition.Recognizer()
	# Get image from the camera
	camera =
	# Ignore the first frame, which is typically blank on my machine
	image = camera.capture_frame(True)
	# Perform recognition
	label = recognizer.recognize(image)
	# Perform TTS (speak label)
	helpers.say_text('The recognition result is: ' + label)

First, I create an OCR recognizer. Then I create a Camera object and read the frame from the default webcam. The image is transmitted to the recognizer, and the resulting label is pronounced by the auxiliary TTS module.


So, we have created a robust system that is capable of optical character recognition using deep learning, and then transmit the results to users through a text-to-speech engine. We used the pre-trained keras-OCR package.

In a more complex scenario, OCR may be preceded by text detection. First, lines of text are detected in the image, and then each of them is recognized. This requires only the text detection capabilities of the keras-ocr package. This has been shown in this version of the Keras CRNN implementation and the published CRAFT text detection model Fausto Morales.

By extending the above application with text detection, you can create an IoT system that supports PHN and performs OCR to help visually impaired people read menus in restaurants or documents in government offices. Moreover, such a translation service-enabled application could serve as an automatic translator.

I would like to conclude this material with a quote from the third law of Arthur Clarke:

Any sufficiently advanced technology is indistinguishable from magic.

If you follow it, then you can safely say that in our SkillFactory we teach people real magic, it’s just called data science and machine learning.


Similar Posts

Leave a Reply