Convert DOC to DOCX and Extract Tables with Python & LibreOffice
This tutorial explains why converting legacy .doc files to .docx is necessary, shows how to install LibreOffice and python-docx, provides a Python script to perform the conversion via LibreOffice's command‑line interface, and demonstrates reading and modifying tables in the resulting .docx files.
Purpose
Legacy .doc files need programmatic reading or modification. Modern Python libraries handle .docx but not .doc.
Why Convert to DOCX
.docxis XML‑based, avoiding encoding problems.
The python-docx library can manipulate .docx files but cannot work with the older .doc format.
Conversion Options
If Microsoft Word is installed on Windows, pywin32 can use Word’s COM interface. For a free, cross‑platform solution, LibreOffice provides a command‑line converter usable on Windows, Linux, and macOS.
Environment Setup
Install LibreOffice
Install via the official installer on Windows or with a package manager on Linux:
sudo apt update && sudo apt install libreoffice sudo dnf install libreoffice sudo zypper install libreofficeVerify the installation:
soffice --versionInstall python-docx
pip install python-docxImplementation
1. Convert .doc to .docx
Use LibreOffice in headless mode from Python:
import os
import subprocess
def convert_doc_to_docx(doc_path, output_dir, soffice_path=None):
"""Convert a .doc file to .docx using LibreOffice.
Args:
doc_path (str): Path to the .doc file.
output_dir (str): Directory for the converted file.
soffice_path (str, optional): Full path to the soffice executable.
Returns:
str: Path to the generated .docx file.
"""
if not doc_path.lower().endswith('.doc'):
raise ValueError("Input file must be .doc format")
os.makedirs(output_dir, exist_ok=True)
base_name = os.path.splitext(os.path.basename(doc_path))[0]
docx_path = os.path.join(output_dir, f"{base_name}.docx")
if soffice_path is None:
try:
subprocess.run(["soffice", "--version"], check=True,
stdout=subprocess.PIPE, stderr=subprocess.PIPE)
soffice_cmd = ["soffice"]
except FileNotFoundError:
raise RuntimeError("LibreOffice is not installed or not in PATH.")
else:
soffice_cmd = [soffice_path]
try:
subprocess.run(
soffice_cmd + ["--headless", "--nodefault", "--nologo",
"--convert-to", "docx", "--outdir", output_dir, doc_path],
check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
if not os.path.exists(docx_path):
raise RuntimeError(f"Conversion failed, output not found: {docx_path}")
return docx_path
except subprocess.CalledProcessError as e:
raise RuntimeError(f"Conversion failed: {e.stderr.decode()}")2. Read and optionally modify tables in the converted DOCX
Use python-docx to extract table data:
from docx import Document
def read_docx_tables(docx_path):
"""Extract tables from a .docx file.
Args:
docx_path (str): Path to the .docx file.
Returns:
list: List of tables, each table is a list of row lists.
"""
try:
doc = Document(docx_path)
except Exception as e:
raise RuntimeError(f"Unable to read .docx file: {e}")
tables_data = []
for i, table in enumerate(doc.tables):
print(f"
Table {i + 1}:")
table_data = []
for row in table.rows:
row_data = [cell.text.strip() for cell in row.cells]
# Optional: clean special characters
row_data = [clean_text(cell) for cell in row_data]
table_data.append(row_data)
print(row_data)
tables_data.append(table_data)
return tables_dataThe conversion function checks for LibreOffice, runs the conversion, and returns the path of the generated .docx file. The table‑reading function returns a nested list that can be processed further or modified using the same library.
Summary
By installing LibreOffice and python-docx, legacy .doc files can be converted to .docx on any major OS and their tables can be programmatically accessed and edited with Python.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Dunmao Tech Hub
Sharing selected technical articles synced from CSDN. Follow us on CSDN: Dunmao.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
