Tech Musings
Feb 7, 2026 · Fundamentals
How to Clean and Convert a Chinese Poetry Dataset for RAG Projects
This guide explains how to clean a Chinese poetry corpus—removing special characters, filtering short entries, and converting traditional characters to simplified Chinese—using Python validation functions, batch file processing, and WSL‑based OpenCC conversion, then persisting the results as JSON.
JSONRAGdata cleaning
0 likes · 12 min read
