Translate Full PDFs While Preserving Layout Using LLMs – Core Code Included

This article presents a two‑stage, cache‑enabled pipeline that extracts text blocks from a PDF with PyMuPDF, translates them via a large‑language‑model API, and re‑renders each page as an image with Chinese text overlaid to keep the original layout, along with full Python code and usage instructions.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
Translate Full PDFs While Preserving Layout Using LLMs – Core Code Included

The author compares three recent PDF‑translation pipelines: Gemini‑3‑Pro extracts text, translates it, and rebuilds a Markdown document before rendering HTML to PDF; Claude‑Opus‑4.5 converts PDF → DOCX → translate → DOCX → PDF; both are simple but lose the original styling. The proposed GPT‑5.2‑Codex solution aims to preserve the original layout by following a “two‑stage + cache” strategy.

Two‑Stage + Cache Strategy

Text extraction and translation stage

Use PyMuPDF to open the PDF and extract each text block’s bounding box, font size, and content.

Send the block text to a model API (e.g., MiniMax‑M2) with a system prompt “Translate English to Simplified Chinese. Keep line breaks. Return JSON.”

Store translations in a JSON cache keyed by block ID to allow interruption and resume.

Re‑layout and export stage

Render each original page to a bitmap (avoiding font‑embedding issues that cause question‑mark glyphs).

Cover the original English block area with a white rectangle.

Scale the Chinese text to fit the original block dimensions and draw it with a suitable CJK font.

Save all processed pages as a new PDF.

The only remaining imperfection is a persistent white background behind the Chinese text, which the author could not eliminate after multiple attempts.

Core Python Implementation

import json
import re
import fitz
import requests
from PIL Image, ImageDraw, ImageFont

def extract_blocks(input_pdf):
    doc = fitz.open(str(input_pdf))
    blocks = []
    for p in range(len(doc)):
        d = doc[p].get_text("dict")
        for b in d.get("blocks", []):
            if b.get("type") != 0:
                continue
            lines, sizes = [], []
            for line in b.get("lines", []):
                spans = line.get("spans", [])
                t = "".join(s.get("text", "") for s in spans)
                if t.strip():
                    lines.append(t)
                for s in spans:
                    if "size" in s:
                        sizes.append(s["size"])
            text = "
".join(lines).strip()
            if not text:
                continue
            blocks.append({
                "page": p,
                "bbox": b.get("bbox"),
                "text": text,
                "font_size": sizes[len(sizes)//2] if sizes else 10
            })
    return blocks

def call_translate(api_base, model, api_key, items):
    payload = {
        "model": model,
        "messages": [
            {"role": "system", "content": "Translate English to Simplified Chinese. Keep line breaks. Return JSON."},
            {"role": "user", "content": json.dumps({"items": items}, ensure_ascii=False)}
        ],
        "temperature": 0.2,
        "response_format": {"type": "json_object"}
    }
    headers = {"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"}
    r = requests.post(f"{api_base.rstrip('/')}/chat/completions", json=payload, headers=headers, timeout=120)
    r.raise_for_status()
    content = r.json()["choices"][0]["message"]["content"]
    m = re.search(r"\{.*\}\s*$", content, re.S)
    if m:
        content = m.group(0)
    return json.loads(content)["items"]

def translate_blocks(blocks, api_base, model, api_key, batch_size=8):
    cache = {}
    pending = [(i, b["text"]) for i, b in enumerate(blocks)]
    for i in range(0, len(pending), batch_size):
        batch = pending[i:i+batch_size]
        items = [{"id": idx, "text": txt} for idx, txt in batch]
        translated = call_translate(api_base, model, api_key, items)
        for it in translated:
            cache[int(it["id"])] = it["text"]
    return cache

def build_pdf(input_pdf, output_pdf, blocks, translations, fontfile, dpi=200, min_font_size=3):
    doc = fitz.open(str(input_pdf))
    scale = dpi / 72.0
    font_cache = {}
    def get_font(size):
        key = int(size * scale)
        if key not in font_cache:
            font_cache[key] = ImageFont.truetype(fontfile, key)
        return font_cache[key]
    pages = []
    for p in range(len(doc)):
        pix = doc[p].get_pixmap(dpi=dpi, alpha=False)
        img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
        draw = ImageDraw.Draw(img)
        for idx, b in enumerate(blocks):
            if b["page"] != p:
                continue
            rect = fitz.Rect(b["bbox"])
            x0, y0, x1, y1 = rect.x0*scale, rect.y0*scale, rect.x1*scale, rect.y1*scale
            draw.rectangle([x0, y0, x1, y1], fill=(255,255,255))
            text = translations.get(idx, b["text"]).replace("\t", "    ")
            size = max(float(b.get("font_size") or 10), min_font_size)
            font = get_font(size)
            draw.text((x0, y0), text, font=font, fill=(0,0,0))
        pages.append(img)
    first, rest = pages[0], pages[1:]
    first.save(str(output_pdf), "PDF", save_all=True, append_images=rest, resolution=dpi)

Usage Instructions

Install dependencies: python3 -m pip install pymupdf pillow requests Set the API key (example): export SILICONFLOW_API_KEY="YOUR_API_KEY" Optional configuration:

export SILICONFLOW_API_BASE="https://api.siliconflow.cn/v1"
export SILICONFLOW_MODEL="MiniMaxAI/MiniMax-M2"

Run the script with the minimal command:

python3 -B /path/to/translate_pdf.py \
  --input-pdf /absolute/path/to/source.pdf

Or specify output and working directories:

python3 -B /path/to/translate_pdf.py \
  --input-pdf /absolute/path/to/source.pdf \
  --output-pdf /absolute/path/to/source-zh.pdf \
  --work-dir /absolute/path/to/tmp/pdfs

Specify a CJK font to improve rendering (recommended):

python3 -B /path/to/translate_pdf.py \
  --input-pdf /absolute/path/to/source.pdf \
  --font /System/Library/Fonts/Supplemental/Songti.ttc

Common Issues

Question marks or garbled characters – caused by missing Chinese font support; fix by providing a suitable CJK font via --font.

Very small or cramped Chinese text – Chinese strings are usually longer than English, so the original block height may be insufficient; increase --dpi or adjust --min-font-size.

Slow re‑run – the script re‑translates all text; keep the <stem>.translations.json cache file so the script can resume.

English text not fully hidden – the script forces a white background over detected text blocks; any remaining English is likely inside images and not recognized as text.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonLLMlarge language modelPyMuPDFPDF translationlayout preservation
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.