Fed up feeding AI with docs? Microsoft’s Open‑Source MarkItDown converts any format to Markdown in a few lines
MarkItDown, an open‑source Python tool from Microsoft’s AutoGen team, converts over 20 document and media formats—including Word, Excel, PDF, images, audio and YouTube links—into standardized Markdown, offering OCR, LLM integration, Docker deployment, Azure Document Intelligence support, and extensive command‑line examples for enterprise and research pipelines.
Overview
MarkItDown is an open‑source Python utility from Microsoft’s AutoGen team that converts more than 20 document and media formats—including Word, Excel, PowerPoint, PDF, HTML, CSV, JSON, XML, images, audio, YouTube links, EPub and ZIP archives—into a standardized Markdown representation for large‑language‑model (LLM) ingestion.
Supported formats
Document: DOC/DOCX (preserves headings, lists, tables, comments), XLS/XLSX (preserves table structure and formatted cells), PPTX (extracts text, images, can generate image descriptions via LLM), HTML (extracts text and links), PDF (text and table extraction; OCR plugin for embedded images), ZIP (traverses archive and converts each file).
Multimedia: images (EXIF metadata, optional OCR), audio (EXIF metadata, optional transcription).
Structured data: CSV, JSON, XML (structure retained).
Other: YouTube links (subtitle extraction), EPub (text and chapter structure), community contributions for older PPT, EML, ODT.
Intelligent processing
Integrates LLMs such as GPT‑4o to generate image descriptions or optimise text.
Optional markitdown-ocr plugin provides OCR for embedded images without additional ML libraries.
Docker multi‑stage builds isolate Python dependencies.
Optional Azure Document Intelligence backend (formerly Form Recognizer) improves conversion accuracy for complex, multilingual PDFs.
Plugin architecture enables third‑party extensions (e.g., enhanced OCR, new file types).
Installation
Requires Python 3.10+. Recommended to use a virtual environment (venv, uv, Conda) to avoid dependency conflicts. pip install 'markitdown[all]' Alternative installers:
pip install hatch brew install hatch conda install -c conda-forge hatchLatest development version:
git clone [email protected]:microsoft/markitdown.git
cd markitdown
pip install -e 'packages/markitdown[all]'Known issue with youtube-transcript-api has been fixed in recent releases.
Basic usage
Convert an Excel file:
# Output to stdout
markitdown test.xlsx > test.md
# Explicit output file
markitdown test.xlsx -o test.mdConvert a PDF with OCR and LLM‑generated image descriptions:
pip install markitdown-ocr openai
python - <<'PY'
from markitdown import MarkItDown
from openai import OpenAI
md = MarkItDown(enable_plugins=True,
llm_client=OpenAI(),
llm_model="gpt-4o")
result = md.convert("image-rich.pdf")
with open("output.md", "w", encoding="utf-8") as f:
f.write(result.text_content)
PYBatch processing via pipeline:
find ./docs -name '*.pdf' | xargs -I{} markitdown {} -o {}.mdAzure Document Intelligence integration (replace <document_intelligence_endpoint> with your resource URL):
markitdown target.pdf -o out.md -d -e "<document_intelligence_endpoint>"Enterprise scenarios
CI/CD‑driven bulk conversion of PDFs, Word manuals, and PPT decks.
Pre‑processing data lakes: unify Excel reports, meeting recordings, and email attachments into Markdown for downstream analysis.
Knowledge‑base construction: feed converted documentation into Retrieval‑Augmented Generation (RAG) pipelines.
Multimodal context creation: combine image descriptions, audio transcripts, and text into a single Markdown file.
Example commands for large‑scale automation
Batch conversion of all PDFs in a directory:
find ./docs -name '*.pdf' -exec markitdown {} -o {}.md \;Batch conversion of Excel reports:
find ./reports -name '*.xlsx' -o -name '*.xls' | xargs -I{} markitdown {} -o {}.mdConvert audio with optional timestamped transcription:
markitdown meeting.mp3 -o meeting.md --audio-timestamp trueConvert an image and generate a description using GPT‑4o:
markitdown product.jpg -o description.md --llm-model gpt-4oCombine multimodal outputs:
cat description.md audio.md document.md > multimodal_context.mdRepository
Project homepage: https://www.star-history.com/microsoft/markitdown
AI Architecture Path
Focused on AI open-source practice, sharing AI news, tools, technologies, learning resources, and GitHub projects.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
