Document Rendering and Structured Data Extraction in Baidu Wenku: From Layout Data to Flow Data and Chart Metadata
The article explains Baidu Wenku's document conversion pipeline, detailing how various office formats are transformed into PDF layout data, then into adaptive flow data for mobile devices, and describes the technical methods for extracting structured content and chart metadata from PDFs and OOXML documents.
Baidu Wenku stores billions of documents (Word, PPT, Excel, PDF, etc.) and its core services are document transcoding and rendering. To unify handling of diverse formats, all documents are first converted to PDF, parsed using open‑source PDF structures, and then transformed into Baidu's proprietary layout format for both PC and mobile rendering.
On PC, the layout data (coordinates, size, and metadata for each element) enables high‑fidelity, scale‑preserving display. However, on mobile devices the same layout data appears too small, so the solution is to convert layout data into flow data that contains hierarchical structures such as sections, paragraphs, tables, and formulas, allowing adaptive re‑layout for various screen sizes.
The first flow‑data approach, called Retype , traverses xreader layout elements, groups them into lines and paragraphs based on coordinate proximity, and merges lines into paragraphs while handling complex cases like multi‑column papers, tables, and footnotes.
The second approach, BDJson , targets Word documents by parsing the OOXML (docx) package. OOXML provides native structural information (sections, paragraphs, tables) that can be directly mapped to flow data, while also handling headers/footers, footnotes, list numbering, merged cells, and formula conversion to LaTeX for consistent rendering.
For chart extraction, the pipeline consists of two modules: range detection and metadata extraction. Range detection scans PDF pages, merges adjacent fragments into lines, identifies blank spaces as candidate ranges, and refines them through merging and filtering. Metadata extraction then crops each range, classifies the image (e.g., bar chart, pie chart), reconstructs axes, performs OCR on sub‑ranges, and assembles the chart’s data points into structured metadata.
Figures illustrate the challenges of scaling layout data on mobile, the conversion workflow from layout to flow data, and examples of range detection and chart metadata extraction.
Future work focuses on finer‑grained element extraction and richer interactive features, building on the established document structure extraction foundation.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.