Fundamentals 8 min read

Convert DOC to DOCX and Extract Tables with Python & LibreOffice

This tutorial explains why converting legacy .doc files to .docx is necessary, shows how to install LibreOffice and python-docx, provides a Python script to perform the conversion via LibreOffice's command‑line interface, and demonstrates reading and modifying tables in the resulting .docx files.

Dunmao Tech Hub
Dunmao Tech Hub
Dunmao Tech Hub
Convert DOC to DOCX and Extract Tables with Python & LibreOffice

Purpose

Legacy .doc files need programmatic reading or modification. Modern Python libraries handle .docx but not .doc.

Why Convert to DOCX

.docx

is XML‑based, avoiding encoding problems.

The python-docx library can manipulate .docx files but cannot work with the older .doc format.

Conversion Options

If Microsoft Word is installed on Windows, pywin32 can use Word’s COM interface. For a free, cross‑platform solution, LibreOffice provides a command‑line converter usable on Windows, Linux, and macOS.

Environment Setup

Install LibreOffice

Install via the official installer on Windows or with a package manager on Linux:

sudo apt update && sudo apt install libreoffice
sudo dnf install libreoffice
sudo zypper install libreoffice

Verify the installation:

soffice --version

Install python-docx

pip install python-docx

Implementation

1. Convert .doc to .docx

Use LibreOffice in headless mode from Python:

import os
import subprocess

def convert_doc_to_docx(doc_path, output_dir, soffice_path=None):
    """Convert a .doc file to .docx using LibreOffice.
    Args:
        doc_path (str): Path to the .doc file.
        output_dir (str): Directory for the converted file.
        soffice_path (str, optional): Full path to the soffice executable.
    Returns:
        str: Path to the generated .docx file.
    """
    if not doc_path.lower().endswith('.doc'):
        raise ValueError("Input file must be .doc format")
    os.makedirs(output_dir, exist_ok=True)
    base_name = os.path.splitext(os.path.basename(doc_path))[0]
    docx_path = os.path.join(output_dir, f"{base_name}.docx")

    if soffice_path is None:
        try:
            subprocess.run(["soffice", "--version"], check=True,
                             stdout=subprocess.PIPE, stderr=subprocess.PIPE)
            soffice_cmd = ["soffice"]
        except FileNotFoundError:
            raise RuntimeError("LibreOffice is not installed or not in PATH.")
    else:
        soffice_cmd = [soffice_path]

    try:
        subprocess.run(
            soffice_cmd + ["--headless", "--nodefault", "--nologo",
                           "--convert-to", "docx", "--outdir", output_dir, doc_path],
            check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        if not os.path.exists(docx_path):
            raise RuntimeError(f"Conversion failed, output not found: {docx_path}")
        return docx_path
    except subprocess.CalledProcessError as e:
        raise RuntimeError(f"Conversion failed: {e.stderr.decode()}")

2. Read and optionally modify tables in the converted DOCX

Use python-docx to extract table data:

from docx import Document

def read_docx_tables(docx_path):
    """Extract tables from a .docx file.
    Args:
        docx_path (str): Path to the .docx file.
    Returns:
        list: List of tables, each table is a list of row lists.
    """
    try:
        doc = Document(docx_path)
    except Exception as e:
        raise RuntimeError(f"Unable to read .docx file: {e}")

    tables_data = []
    for i, table in enumerate(doc.tables):
        print(f"
Table {i + 1}:")
        table_data = []
        for row in table.rows:
            row_data = [cell.text.strip() for cell in row.cells]
            # Optional: clean special characters
            row_data = [clean_text(cell) for cell in row_data]
            table_data.append(row_data)
            print(row_data)
        tables_data.append(table_data)
    return tables_data

The conversion function checks for LibreOffice, runs the conversion, and returns the path of the generated .docx file. The table‑reading function returns a nested list that can be processed further or modified using the same library.

Summary

By installing LibreOffice and python-docx, legacy .doc files can be converted to .docx on any major OS and their tables can be programmatically accessed and edited with Python.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Data Extractiondocxlibreoffice
Dunmao Tech Hub
Written by

Dunmao Tech Hub

Sharing selected technical articles synced from CSDN. Follow us on CSDN: Dunmao.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.