Backend Development 9 min read

Automate Bilingual eBook Translation with Python and DeepL API

This tutorial shows how to extract Kindle eBooks, convert them to HTML, clean the markup with BeautifulSoup, and batch‑translate the content line‑by‑line using DeepL API Pro, producing a bilingual eBook with minimal manual effort.

Python Crawling & Data Mining

Dec 10, 2021

Automate Bilingual eBook Translation with Python and DeepL API

Introduction

The author, a Python enthusiast, shares a small project found on GitHub that automates the translation of eBook text from Chinese to English (or vice‑versa) using the DeepL API. The workflow covers extracting the eBook, converting formats, cleaning HTML, and submitting each line for translation.

eBook Extraction and Format Conversion

First, the Kindle eBook is exported and its DRM removed with ePubor Ultimate , producing an .azw file that is then converted to .epub. The Calibre tool converts the .epub into an .htmlz archive, which is unpacked with the unzip command.

Why Use HTML for Translation

Preserves footnotes, endnotes, and hyperlinks.

DeepL’s tag_handling="xml" parameter correctly processes HTML tags.

CSS can control display styles flexibly.

JavaScript can be used to show language‑specific content.

The cleaned HTML can be converted to any eBook format later.

Cleaning HTML with BeautifulSoup

BeautifulSoup, originally a web‑scraping library, is employed to tidy the HTML. The script removes stray newlines, inserts blank lines before headings, <div>, and <p> tags, and writes the cleaned file.

import bs4
import re

path = "John Law/"  # folder name ends with /
source_filename = "index.html"
target_filename = "index2.html"

html = open(path+source_filename)
htmltext = html.read()

soup = bs4.BeautifulSoup(htmltext)

# Remove all 

htmltext = str(bs4.BeautifulSoup(htmltext)).replace("
", "")

# Add blank lines before tags
pttn = r'<h'
rpl = r'

<h'
htmltext = re.sub(pttn, rpl, htmltext)

pttn = r'<div'
rpl = r'

<div'
htmltext = re.sub(pttn, rpl, htmltext)

pttn = r'</div>'
rpl = r'

</div>'
htmltext = re.sub(pttn, rpl, htmltext)

pttn = r'<p'
rpl = r'

<p'
htmltext = re.sub(pttn, rpl, htmltext)

fileSave = open(path+target_filename, "w")
fileSave.write(htmltext)
print(htmltext)

Translating Line‑by‑Line with DeepL API Pro

The cleaned HTML is read line by line. Each line is sent to DeepL via a GET request with tag_handling="xml". The script retries on connection errors, skips lines that do not need translation, and adds language‑specific CSS classes ( en and cn) to the original and translated lines.

import re
import requests

auth_key = "<your DeepL API Pro authentication key>"
target_language = "ZH"

path = "John Law/"
source_filename = "index2.html"
target_filename = "index3.html"

def translate(text):
    result = requests.get(
        "https://api.deepl.com/v2/translate",
        params={
            "auth_key": auth_key,
            "target_lang": target_language,
            "text": text,
            "tag_handling": "xml",
        },
    )
    return result.json()["translations"][0]["text"]

def add_language_tag_en(html):
    pttn = re.compile(r'^<(.*?) class="(.*?)">', re.M)
    rpl = r'<\1 class="\2 en">'
    return re.sub(pttn, rpl, html)

def add_language_tag_cn(html):
    pttn = re.compile(r'^<(.*?) class="(.*?)">', re.M)
    rpl = r'<\1 class="\2 cn">'
    return re.sub(pttn, rpl, html)

lines = open(path+source_filename, "r").readlines()
new_lines = []
line_count = 0
startline = 16
endline = 4032

for line in lines:
    line_count += 1
    if line_count < startline or line_count > endline or line.strip() == '':
        new_lines.append(line)
        continue
    succeeded = False
    while not succeeded:
        try:
            line_translated = translate(line)
            line_translated = line_translated.replace("
", "")
            succeeded = True
        except:
            succeeded = False
    if line.strip() == line_translated.strip():
        new_lines.append(line)
    else:
        line = add_language_tag_en(line)
        line_translated = add_language_tag_cn(line_translated)
        new_lines.append(line)
        new_lines.append(line_translated)

with open(path+target_filename, 'w') as f:
    f.write("
".join(new_lines))

Result

After running the scripts, the original HTML file is transformed into a bilingual version where each Chinese line is followed by its English translation, ready to be repackaged into an eBook.

Another screenshot shows the final translated output.

Python automation translation HTML deepl ebook

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.