Artificial Intelligence 10 min read

Translate Full PDFs While Preserving Layout Using LLMs – Core Code Included

This article presents a two‑stage, cache‑enabled pipeline that extracts text blocks from a PDF with PyMuPDF, translates them via a large‑language‑model API, and re‑renders each page as an image with Chinese text overlaid to keep the original layout, along with full Python code and usage instructions.

Old Zhang's AI Learning

Feb 14, 2026

Translate Full PDFs While Preserving Layout Using LLMs – Core Code Included

The author compares three recent PDF‑translation pipelines: Gemini‑3‑Pro extracts text, translates it, and rebuilds a Markdown document before rendering HTML to PDF; Claude‑Opus‑4.5 converts PDF → DOCX → translate → DOCX → PDF; both are simple but lose the original styling. The proposed GPT‑5.2‑Codex solution aims to preserve the original layout by following a “two‑stage + cache” strategy.

Two‑Stage + Cache Strategy

Text extraction and translation stage

Use PyMuPDF to open the PDF and extract each text block’s bounding box, font size, and content.

Send the block text to a model API (e.g., MiniMax‑M2) with a system prompt “Translate English to Simplified Chinese. Keep line breaks. Return JSON.”

Store translations in a JSON cache keyed by block ID to allow interruption and resume.

Re‑layout and export stage

Render each original page to a bitmap (avoiding font‑embedding issues that cause question‑mark glyphs).

Cover the original English block area with a white rectangle.

Scale the Chinese text to fit the original block dimensions and draw it with a suitable CJK font.

Save all processed pages as a new PDF.

The only remaining imperfection is a persistent white background behind the Chinese text, which the author could not eliminate after multiple attempts.

Core Python Implementation

import json
import re
import fitz
import requests
from PIL Image, ImageDraw, ImageFont

def extract_blocks(input_pdf):
    doc = fitz.open(str(input_pdf))
    blocks = []
    for p in range(len(doc)):
        d = doc[p].get_text("dict")
        for b in d.get("blocks", []):
            if b.get("type") != 0:
                continue
            lines, sizes = [], []
            for line in b.get("lines", []):
                spans = line.get("spans", [])
                t = "".join(s.get("text", "") for s in spans)
                if t.strip():
                    lines.append(t)
                for s in spans:
                    if "size" in s:
                        sizes.append(s["size"])
            text = "
".join(lines).strip()
            if not text:
                continue
            blocks.append({
                "page": p,
                "bbox": b.get("bbox"),
                "text": text,
                "font_size": sizes[len(sizes)//2] if sizes else 10
            })
    return blocks

def call_translate(api_base, model, api_key, items):
    payload = {
        "model": model,
        "messages": [
            {"role": "system", "content": "Translate English to Simplified Chinese. Keep line breaks. Return JSON."},
            {"role": "user", "content": json.dumps({"items": items}, ensure_ascii=False)}
        ],
        "temperature": 0.2,
        "response_format": {"type": "json_object"}
    }
    headers = {"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"}
    r = requests.post(f"{api_base.rstrip('/')}/chat/completions", json=payload, headers=headers, timeout=120)
    r.raise_for_status()
    content = r.json()["choices"][0]["message"]["content"]
    m = re.search(r"\{.*\}\s*$", content, re.S)
    if m:
        content = m.group(0)
    return json.loads(content)["items"]

def translate_blocks(blocks, api_base, model, api_key, batch_size=8):
    cache = {}
    pending = [(i, b["text"]) for i, b in enumerate(blocks)]
    for i in range(0, len(pending), batch_size):
        batch = pending[i:i+batch_size]
        items = [{"id": idx, "text": txt} for idx, txt in batch]
        translated = call_translate(api_base, model, api_key, items)
        for it in translated:
            cache[int(it["id"])] = it["text"]
    return cache

def build_pdf(input_pdf, output_pdf, blocks, translations, fontfile, dpi=200, min_font_size=3):
    doc = fitz.open(str(input_pdf))
    scale = dpi / 72.0
    font_cache = {}
    def get_font(size):
        key = int(size * scale)
        if key not in font_cache:
            font_cache[key] = ImageFont.truetype(fontfile, key)
        return font_cache[key]
    pages = []
    for p in range(len(doc)):
        pix = doc[p].get_pixmap(dpi=dpi, alpha=False)
        img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
        draw = ImageDraw.Draw(img)
        for idx, b in enumerate(blocks):
            if b["page"] != p:
                continue
            rect = fitz.Rect(b["bbox"])
            x0, y0, x1, y1 = rect.x0*scale, rect.y0*scale, rect.x1*scale, rect.y1*scale
            draw.rectangle([x0, y0, x1, y1], fill=(255,255,255))
            text = translations.get(idx, b["text"]).replace("\t", "    ")
            size = max(float(b.get("font_size") or 10), min_font_size)
            font = get_font(size)
            draw.text((x0, y0), text, font=font, fill=(0,0,0))
        pages.append(img)
    first, rest = pages[0], pages[1:]
    first.save(str(output_pdf), "PDF", save_all=True, append_images=rest, resolution=dpi)

Usage Instructions

Install dependencies: python3 -m pip install pymupdf pillow requests Set the API key (example): export SILICONFLOW_API_KEY="YOUR_API_KEY" Optional configuration:

export SILICONFLOW_API_BASE="https://api.siliconflow.cn/v1"
export SILICONFLOW_MODEL="MiniMaxAI/MiniMax-M2"

Run the script with the minimal command:

python3 -B /path/to/translate_pdf.py \
  --input-pdf /absolute/path/to/source.pdf

Or specify output and working directories:

python3 -B /path/to/translate_pdf.py \
  --input-pdf /absolute/path/to/source.pdf \
  --output-pdf /absolute/path/to/source-zh.pdf \
  --work-dir /absolute/path/to/tmp/pdfs

Specify a CJK font to improve rendering (recommended):

python3 -B /path/to/translate_pdf.py \
  --input-pdf /absolute/path/to/source.pdf \
  --font /System/Library/Fonts/Supplemental/Songti.ttc

Common Issues

Question marks or garbled characters – caused by missing Chinese font support; fix by providing a suitable CJK font via --font.

Very small or cramped Chinese text – Chinese strings are usually longer than English, so the original block height may be insufficient; increase --dpi or adjust --min-font-size.

Slow re‑run – the script re‑translates all text; keep the <stem>.translations.json cache file so the script can resume.

English text not fully hidden – the script forces a white background over detected text blocks; any remaining English is likely inside images and not recognized as text.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python LLM Large Language Model PyMuPDF PDF translation layout preservation

Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Two‑Stage + Cache Strategy

Core Python Implementation

Usage Instructions

Common Issues

Old Zhang's AI Learning

How this landed with the community

Was this worth your time?

0 Comments

Two‑Stage + Cache Strategy