Understanding the Essence of Office Files and PDF Parsing for Frontend Developers
This article explains the historical background, standards, and internal structure of office formats like XLSX, DOCX, PPTX and PDF, and demonstrates how frontend developers can parse these files using XML, ZIP archives, JSZip and browser APIs to extract data or render documents.
1. Introduction
As developers we often encounter various office file formats such as XLSX, DOCX, PPTX, and PDF. While third‑party libraries like sheet.js, mammoth.js, and pptxjs can parse them, this article focuses on the underlying nature of these files so that developers can understand and solve problems without being locked into specific APIs.
2. The Essence of Office Files
History
Paper was invented by Cai Lun in 105 AD, enabling recording, writing and dissemination of information. For two millennia paper dominated until the late 20th century when the Internet prompted the digitisation of documents, leading to software such as WPS (1971) and Microsoft Office, which dominate the market today.
Standards
All major word processors can open the same .doc/.docx files because they follow the Microsoft Office Open XML (OOXML) standard, maintained by Ecma International. The standard defines how paragraphs, tables, images, layout, etc. are described in XML, allowing different applications to read and write compatible files.
Compression Packages
Modern office files are essentially ZIP archives that contain a collection of XML and auxiliary files. Changing a .docx extension to .zip and extracting reveals a folder full of .xml files; the same applies to .xlsx and .pptx. PDF, however, is not a ZIP archive.
XML
XML is a plain‑text markup language; browsers can parse it directly via DOMParser. The article provides a simple XML parsing example that extracts employee data from an XML string.
<code style="padding: 16px; color: #333; display: -webkit-box; font-family: Operator Mono, Consolas, Monaco, Menpace, monospace; font-size: 12px"><span style='color: #999; font-weight: bold;'><!DOCTYPE html></span>
<span style='color: #00f;'><html lang="en"></span>
<span style='color: #00f;'><head></span>
<span style='color: #00f;'><meta charset="UTF-8"></span>
<span style='color: #00f;'><meta name="viewport" content="width=device-width, initial-scale=1.0"></span>
<span style='color: #00f;'><title></span>XML 解析示例<span style='color: #00f;'></title></span>
<span style='color: #00f;'></head></span>
<span style='color: #00f;'><body></span>
<span style='color: #00f;'><h1></span>XML 解析示例<span style='color: #00f;'></h1></span>
<span style='color: #00f;'><script></span>
// 假设有以下 XML 数据
var xmlData = `
<employees>
<employee>
<id>1</id>
<name>John Doe</name>
<position>Developer</position>
</employee>
<employee>
<id>2</id>
<name>Jane Smith</name>
<position>Designer</position>
</employee>
</employees>
`;
var parser = new DOMParser();
var xmlDoc = parser.parseFromString(xmlData, "text/xml");
var employees = xmlDoc.getElementsByTagName("employee");
for (var i = 0; i < employees.length; i++) {
var id = employees[i].getElementsByTagName("id")[0].textContent;
var name = employees[i].getElementsByTagName("name")[0].textContent;
var position = employees[i].getElementsByTagName("position")[0].textContent;
console.log("Employee ID: " + id);
console.log("Name: " + name);
console.log("Position: " + position);
console.log("--------------------");
}
<span style='color: #00f;'></script></span>
<span style='color: #00f;'></body></span>
<span style='color: #00f;'></html></span></code>3. Parsing Office Files in the Browser
Front‑end code can read files from disk, unzip them with JSZip (available for both browser and Node.js), and then parse the contained XML to produce JSON or DOM structures. A complete JSZip compression/decompression example is shown below.
<code style="padding: 16px; color: #333; display: -webkit-box; font-family: Operator Mono, Consolas, Monaco, Menpace, monospace; font-size: 12px"><!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>JSZip Demo</title>
<script src="https://cdnjs.cloudflare.com/ajax/libs/jszip/3.1.5/jszip.min.js"></script>
</head>
<body>
<script>
// 压缩字符串
function compressString(originalString) {
return new Promise((resolve, reject) => {
const zip = new JSZip();
zip.file("compressed.txt", originalString);
zip.generateAsync({ type: "blob" })
.then(compressedBlob => {
const reader = new FileReader();
reader.onload = () => resolve(reader.result);
reader.readAsText(compressedBlob);
})
.catch(reject);
});
}
// 解压缩字符串
function decompressString(compressedString) {
return new Promise((resolve, reject) => {
const zip = new JSZip();
zip.loadAsync(compressedString)
.then(zipFile => {
const compressedData = zipFile.file("compressed.txt");
if (compressedData) {
return compressedData.async("string");
} else {
reject(new Error("Unable to find compressed data in the zip file."));
}
})
.then(resolve)
.catch(reject);
});
}
const originalText = "Hello, this is a sample text for compression and decompression with JSZip.";
console.log("Original Text:", originalText);
compressString(originalText)
.then(compressedData => {
console.log("Compressed Data:", compressedData);
return decompressString(compressedData);
})
.then(decompressedText => {
console.log("Decompressed Text:", decompressedText);
})
.catch(error => {
console.error("Error:", error);
});
</script>
</body>
</html></code>4. PDF
PDF is a portable document format introduced by Adobe in 1993. It stores text, graphics, and layout information in a fixed‑position, non‑editable way, making the visual appearance identical across platforms. Unlike office files, PDF is not a ZIP archive; it uses its own page‑description language.
A minimal PDF file begins with %PDF-1.1 and contains objects that describe pages, fonts, and drawing commands. The article shows a tiny PDF example and explains how the syntax encodes absolute coordinates for each character.
%PDF-1.1
%¥±ë
1 0 obj
<< /Type /Catalog
/Pages 2 0 R
>>
endobj
2 0 obj
<< /Type /Pages
/Kids [3 0 R]
/Count 1
/MediaBox [0 0 300 144]
>>
endobj
3 0 obj
<< /Type /Page
/Parent 2 0 R
/Resources
<< /Font
<< /F1
<< /Type /Font
/Subtype /Type1
/BaseFont /Times-Roman
>>
>>
>>
/Contents 4 0 R
>>
endobj
4 0 obj
<< /Length 55 >>
stream
BT
/F1 18 Tf
0 0 Td
(Hello World) Tj
ET
endstream
endobj
xref
0 5
0000000000 65535 f
0000000018 00000 n
0000000077 00000 n
0000000178 00000 n
0000000457 00000 n
trailer
<< /Root 1 0 R
/Size 5
>>
startxref
565
%%EOFModern browsers can render PDFs directly using <embed>, <iframe>, or the open‑source pdf.js library.
5. Conclusion
Office documents are ZIP‑based collections of XML that follow the OOXML standard, while PDF is a set of drawing commands describing absolute layout. Understanding these fundamentals enables front‑end developers to build custom parsers or effectively use existing libraries for reading and displaying office files and PDFs.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Rare Earth Juejin Tech Community
Juejin, a tech community that helps developers grow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
