“Reading is the head of everything!” Translating .EPUB eBooks with Python

Let’s figure out what is “under the hood” of the EPUB format and how to translate the text, but not translate the code in the book. Get to know the library ebook liband also find out why we need a library beautiful soup.

Being engaged in programming in the Russian-speaking segment of the Internet, I came across the fact that there is a lot of literature on topics of interest to me in English. Or there is a translation, but the specifics of the industry are such that everything changes very quickly, and if foreign authors of books regularly release updates, then the translation often lags behind by 2-3 years, which is quite critical. I perfectly understand that such books and documentation must be able to read in English, which I actually work hard on. On the other hand, when reading monumental literature in the original language, you still want to open the translation in the next window and check whether you have correctly caught the author’s idea.

What seems to be the problem? I threw PDF in any translator, and even in the browser itself, the translation is automatically pulled up, only such translators basically do not recognize the code in the text. Here the main problem arises, which prompted me to find a solution and automate the entire process. That’s what the Python programming language is for.

How to translate

To translate the text I used the library Googletrans and wrote a small function to make it easier to use.

def translation_func(text):   
    translator = Translator()   
    result = translator.translate(text, dest="ru")   
    return result.text

So we approach the subject of our study, which is one of the most popular e-book formats – EPUB. The thing is that PDF does not contain any information about text parameters. But EPUB includes a set of XHTML or HTML pages, which greatly facilitates the translation of text according to the parameters we need.

To see the structure of the e-book, I used the program Sigil-EPUB Editor.

Here you can determine what parts the document is divided into, its formats (XHTML, HTML or PDF), and most importantly, see the markup in which tags we contain the code and on what grounds it can be excluded from the translation.

Here is an example of such tags:

tag_exeption = ['code', 'a', 'strong', 'pre', 'span', 'html', 'div', 'body', 'head']

Now let’s use the library ebook libyou can see examples of its use here.

Using the function ebooklib.epub.read_epub() read the file and get an instance of the class ebooklib.epub.epubBook.

from ebooklib import epub 
book = epub.read_epub('book.epub')

All resources in an eBook (stylesheets, images, videos, sounds, scripts, and HTML files) are elements. They can be retrieved by type using the function ebooklib.epub.epubBook.get_items_of_type().

Here is a list of elements that can be used:

  • ITEM_UNKNOWN = 0

  • ITEM_IMAGE = 1

  • ITEM_STYLE = 2

  • ITEM_SCRIPT = 3

  • ITEM_NAVIGATION = 4

  • ITEM_VECTOR = 5

  • ITEM_FONT = 6

  • ITEM_VIDEO = 7

  • ITEM_AUDIO = 8

  • ITEM_DOCUMENT = 9

  • ITEM_COVER = 10

We will use the method book.get_items() which allows us to get an iterator over all elements of the book – objects ebooklib.epub.EpubItem. For translation, we need the navigation elements ITEM_NAVIGATION = 4 and the chapters of the book that are contained in the elements ITEM_DOCUMENT = 9, to get them by type, use the method item.get_type().

for item in book.get_items():
    if item.get_type() == 4:
    …
    if item.get_type() == 9:
    …

Also we can get element name item.get_name()the unique identifier for this element item.get_id() and its content item.get_content().

for item in book.get_items():
    if item.get_type() == ebooklib.ITEM_DOCUMENT:
        print('==================================')
        print('NAME : ', item.get_name())  
        print('----------------------------------')  
        print('ID : ', item.get_id())    
        print('----------------------------------')   
        print(item.get_content())    
        print('==================================')
...
==================================
NAME :  Text/Chapter_6.xhtml
----------------------------------
ID :  Chapter_6
----------------------------------
b'<?xml version="1.0" encoding="utf-8"?>\r\n<ncx version="2005-1" xmlns="http://www.daisy.org/z3986/2005/ncx/">\r\n<head>\r\n
==================================
...

Having received the contents of the chapter in XHTML format, it remains to separate the flies from the cutlets. For this, the library will help us beautiful soup. We get the soup object:

soup = BeautifulSoup(item.get_content(), features="xml")

Now we need to iterate through all the elements inside this object, for this we will use the attribute .descendants. It is good because, unlike attributes .contents and .children which only consider direct children, allows you to recursively iterate over all children of direct children. What these elements are can be seen using the attributes: .name – tag name, .attrs – tag attributes (class, id) in dictionary format.

for child in soup.descendants:
   if child.name and child.string:
       print(child.name, '->', child.attrs)
***
h1 -> {'class': 'chapterNumber'}
h1 -> {'class': 'chapterTitle', 'id': '_idParaDest-65'}
p -> {'class': 'normal'}
li -> {'class': 'bulletList'}
a -> {'href': 'https://github.com/example/tree/main/Chapter02'}
***

attribute.descendants iterates through all the individual elements that soup contains, including lines between tags and empty tags. Through the condition, we select the necessary elements for us, excluding tag_exceptionbare text (child.name) and tags that do not directly contain text (child string). Received by attribute .string the text is translated by the function translation_func() and then we assign the translated text to our child element with the same attribute .string .

for child in soup.descendants:
    if child.name not in tag_exeption and child.name and child.string:
    	child.string = translation_func(child.string)

Tags that do not contain direct text are separately run through the attribute .contentsexcluding tag names (not content.name), spaces and hyphens [‘\n’, ‘ ‘].

elif not child.name in tag_exeption and child.name: #and count < 10:
    for content in child.contents:
        new_contents = []
        if content.string and content.string not in ['\n', ' '] and not content.name:
            translation_text = translation_func(content.string)
            content = NavigableString(translation_text)
            new_contents.append(content)
            new_contents.append(" ")
    child.clear()
    child.extend(new_contents)

Beautiful Soup uses a class to store text fragments NavigableString, translated text make objects of this class, clearing the contents of our child child.clear()add these objects to the content of the child using child.extend(new_contents).

It remains to assign new content to the book element in the form of our soup object using the method .set_content()without forgetting to recode.

item.set_content(soup.encode())

Additionally, I liked to use the view of the content of the book elements in the browser using the method .open_in_browser(contents) lxml library, for this you need to first recode our content using the utility from the library ebook libutils.parse_string(item.get_content()).

from ebooklib import epub, utils
…
contents = utils.parse_string(item.get_content())
html.open_in_browser(contents)

And the last thing we need is to save the translated book.

epub.write_epub('new_book.epub', book, {})
The whole code looks like this:
from googletrans import Translator
from ebooklib import epub, utils
from bs4 import BeautifulSoup, NavigableString
import lxml.html as html

def open_epub():
    tag_exeption = ["code", 'a', 'strong', 'pre', 'span', 'html',
                    'div', 'body', "head"]
    book = epub.read_epub('Django 4 By Example 2022.epub')
    for item in book.get_items():
        if item.get_id() == "Chapter_7":
            print('NAME : ', item.get_name())
            print('----------------------------------')
            print('ID : ', item.get_id())
            print('----------------------------------')
            print('ITEM : ', item.get_type())

            soup = BeautifulSoup(item.get_content(), features="xml")

            for child in soup.descendants:
                if child.name not in tag_exeption and child.name and child.string:
                    tag_text_before = child.string
                    translation_text = translation_func(tag_text_before)
                    child.string = translation_text
                elif not child.name in tag_exeption and child.name:
                    new_contents = []
                    class_attr = child.attrs.get('class')
                    for content in child.contents:
                        if content.string and content.string not in ['\n', ' '] and not content.name:
                            content = NavigableString(translation_func(content.string))
                            new_contents.append(content)
                            new_contents.append(" ")
                    child.clear()
                    child.extend(new_contents)
            item.set_content(soup.encode())
            contents = utils.parse_string(item.get_content())
            html.open_in_browser(contents)
            print('==================================')
    epub.write_epub('new_book.epub', book, {})

def translation_func(text):
    translator = Translator()
    result = translator.translate(text, dest="ru")
    return result.text

def main():
    open_epub()

if __name__ == "__main__":
    main()

In addition, it should also be noted about CSS files, they can be read in books in files of type ITEM_STYLE = 2 or viewed in the program Sigil-EPUB Editor in the headings of book elements,

and they are in the Styles folder.

<head>
  <title>Example book</title>
  <link href="https://habr.com/ru/post/706118/Styles/epub.css" rel="stylesheet" type="text/css"/>
  <link href="https://habr.com/ru/post/706118/Styles/syntax-highlighting.css" rel="stylesheet" type="text/css"/>
</head>

After overwriting the elements of the book, the links to CSS in the title disappear, they can be returned using the program Sigil-EPUB Editoryou need to select all the elements of the book in the text folder and right-click in the context menu select “Link to style sheet …”.

It’s all! Our book is ready!

In conclusion, I would like to say that automating processes is interesting, it increases general erudition, teaches you how to work with different libraries, what is called getting under the “under the hood”, and just diversify the routine.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *