How BabelDOC Preserves PDF Layout While Translating & OneAIFW Shields Your Data
Two open‑source projects—BabelDOC, a Python‑based PDF translator that retains original formatting using AI models, and OneAIFW, a Zig‑and‑Rust local AI firewall that anonymizes sensitive data before LLM queries—offer practical, privacy‑preserving solutions for researchers and developers.
BabelDOC: PDF translation with layout preservation
BabelDOC is a Python‑based open‑source utility that parses the structural elements of a PDF (titles, body text, figures, captions, formulas and tables), translates the extracted text with large language models, and reinserts the translated strings into their original positions. The process preserves the original pagination, column layout and visual formatting, making it suitable for academic papers and technical reports.
Key technical features
Dual‑online pipeline: structural parsing and LLM‑based translation run concurrently.
Bilingual side‑by‑side view aligns the source language on the left with the target language on the right.
Supports OpenAI‑compatible APIs (e.g., GPT‑4o, DeepSeek, Qwen) for high‑quality, domain‑aware translation.
Installation
Clone the repository and inspect the CLI help:
git clone https://github.com/funstory-ai/BabelDOC
cd BabelDOC
uv run babeldoc --helpInstall the package either with uv or pip: uv tool install babeldoc or
pip install babeldocTranslation command example (DeepSeek model)
babeldoc \
--files paper.pdf \
--openai \
--openai-model "deepseek-chat" \
--openai-base-url "https://api.deepseek.com" \
--openai-api-key "sk-YOUR_KEY" \
--lang-out zh-CNRepository: https://github.com/funstory-ai/BabelDOC (latest release v0.5.22, AGPL‑3.0 license, >480 k PyPI downloads)
OneAIFW: Local AI firewall for zero data leakage
OneAIFW is a lightweight open‑source AI firewall written in Zig and Rust. It intercepts outgoing LLM requests, replaces detected personally identifiable information (PII) with unique placeholders, forwards the anonymized prompt, and restores the original values in the model’s response. All processing occurs locally, preventing raw sensitive data from leaving the host.
Core principle
Before a prompt is sent, the engine scans for entities such as email addresses, phone numbers, bank card numbers and cryptographic keys. Each match is substituted with a token like __PII_EMAIL_ADDRESS_00000001__. After the LLM returns a reply, the firewall post‑processes the text, replacing each token with the original value.
Sensitive data detection
The detector recognises multiple PII categories with confidence scores up to 90 %. In a test string containing an email, a phone number and a bank card number, all three entities are correctly identified and mapped to placeholders.
Architecture
Core engine built with Zig + Rust, supporting both native execution and WebAssembly.
Language bindings: JavaScript (libs/aifw-js) and Python (libs/aifw-py).
Demo applications include a web UI, a browser extension, and backend services based on Presidio/LiteLLM.
Quick start guide
Clone the repository: git clone https://github.com/funstory-ai/aifw.git && cd aifw Build the core library: zig build Install JavaScript dependencies for the demo: pnpm -w install Build the JavaScript package: pnpm -w --filter @oneaifw/aifw-js build Run the web demonstration (open the printed local URL in a browser): cd apps/webapp && pnpm dev Start the backend service or CLI as described in py-origin/README.md: python -m aifw launch Repository: https://github.com/funstory-ai/aifw (MIT license)
Old Meng AI Explorer
Tracking global AI developments 24/7, focusing on large model iterations, commercial applications, and tech ethics. We break down hardcore technology into plain language, providing fresh news, in-depth analysis, and practical insights for professionals and enthusiasts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
