How to Clean and Convert a Chinese Poetry Dataset for RAG Projects
This guide explains how to clean a Chinese poetry corpus—removing special characters, filtering short entries, and converting traditional characters to simplified Chinese—using Python validation functions, batch file processing, and WSL‑based OpenCC conversion, then persisting the results as JSON.
1. Background
Links to the previous two parts of the RAG engineering series are provided. The current knowledge base uses poems from the chinese-poetry project, which contains 55,000 Tang poems, 260,000 Song poems, and 21,000 Song lyrics. The dataset suffers from special characters, invalid content, and primarily traditional characters with variant forms.
Contains special characters and invalid content.
Primarily traditional characters, includes variant forms.
To make the project usable, the raw corpus must be cleaned and converted from traditional to simplified Chinese.
2. Data Source Format
The original data is stored as JSON, with each poem containing the fields author, paragraphs, title, and id.
{
"author": "無名氏",
"paragraphs": [
"神暢(一作「粲」)感寂庭,嘿思徹九重。",
"靈歌理冥運,百和結成章。"
],
"title": "道藏歌詩 四十四",
"id": "a1758a2c-ed74-4ab5-b05d-cf9069d01978"
}3. Data Cleaning Implementation
3.1 Content Validation Function
The function is_valid_poem_content() validates poem content.
def is_valid_poem_content(paragraphs):
"""
检查诗词内容是否符合要求
1. 只允许中文字符、数字、"。"、"?"、","、"."
2. 将所有段落拼接后,去除空白字符,字数不少于20字
"""
combined = "".join(paragraphs)
cleaned = re.sub(r"\s+", "", combined)
if len(cleaned) < 20:
print(f"过滤诗词(字数不足20字): {cleaned[:50]}...")
return False
allowed_pattern = r"^[a-zA-Z\u4e00-\u9fa5\u3002\uff1f\uff0c\.,\?\d]+$"
if not re.match(allowed_pattern, cleaned):
print(f"过滤诗词(包含特殊字符): {cleaned[:50]}...")
return False
return TrueImplementation notes:
Use Unicode range \u4e00-\u9fa5 to match Chinese characters.
Perform length check before regex to reduce overhead.
Log the first 50 characters of filtered content for debugging.
3.2 Batch File Processing
Python's pathlib module handles file operations.
def process_tang_poems():
"""处理全唐诗数据"""
tang_poems_dir = Path(r"c:\xxx\chinese-poetry\全唐诗")
output_file = Path(r"c:\xxx\cleaned_tang_poems.json")
cleaned_poems = []
total_poems = 0
filtered_poems = 0
poet_files = sorted(tang_poems_dir.glob("poet.*.json"))
print(f"找到 {len(poet_files)} 个诗词文件")
all_poems_data = []
for file_path in poet_files:
print(f"正在读取: {file_path.name}")
try:
with open(file_path, "r", encoding="utf-8") as f:
poems = json.load(f)
for poem in poems:
total_poems += 1
author = poem.get("author", "")
title = poem.get("title", "")
paragraphs = poem.get("paragraphs", [])
if is_valid_poem_content(paragraphs):
paragraphs_str = "".join(paragraphs)
all_poems_data.append({"author": author, "title": title, "paragraphs": paragraphs_str})
else:
filtered_poems += 1
except Exception as e:
print(f"处理文件 {file_path.name} 时出错: {e}")
continue
# Further processing followsKey features:
Independent exception handling for each file.
Two‑stage processing: collect valid data, then perform batch conversion.
Real‑time progress output.
4. Traditional‑to‑Simplified Conversion
4.1 Technical Approach
The project runs on Windows and invokes the Linux version of OpenCC via WSL. Alternative solutions include installing OpenCC with pip (requires pre‑compiled dictionaries) or using the pure‑Python opencc‑python‑reimplemented package (no longer maintained).
4.2 Single‑Text Conversion
The function convert_text_with_wsl_opencc() converts a single string.
def convert_text_with_wsl_opencc(text):
"""使用WSL中的opencc进行繁简转换"""
if not text:
return ""
try:
cmd = f"wsl bash -c \"echo '{text}' | opencc -c t2s\""
result = subprocess.run(cmd, capture_output=True, text=True, encoding="utf-8", shell=True)
if result.returncode == 0:
return result.stdout.strip()
else:
print(f"转换错误: {result.stderr}")
return text
except Exception as e:
print(f"调用WSL opencc时出错: {e}")
return textDetails:
Use wsl bash -c to run commands inside WSL.
Echo and pipe to opencc avoid PowerShell parsing issues.
UTF‑8 encoding is enforced throughout.
On failure, the original text is returned.
4.3 Batch Conversion Optimization
Batch conversion mitigates the performance bottleneck of single‑text conversion by writing all texts to a temporary file and converting them in one WSL call.
def batch_convert_with_wsl_opencc(texts):
"""批量使用WSL中的opencc进行繁简转换"""
if not texts:
return []
try:
import tempfile
with tempfile.NamedTemporaryFile(mode="w", encoding="utf-8", delete=False, suffix=".txt") as tmp_in:
separator = "|||SEPARATOR|||"
tmp_in.write(separator.join(texts))
tmp_in_path = tmp_in.name
wsl_path = tmp_in_path.replace("\\", "/").replace("C:", "/mnt/c")
cmd = f'wsl bash -c "cat {wsl_path} | opencc -c t2s"'
result = subprocess.run(cmd, capture_output=True, text=True, encoding="utf-8", shell=True)
os.unlink(tmp_in_path)
if result.returncode == 0:
return result.stdout.strip().split(separator)
else:
print(f"批量转换错误: {result.stderr}")
return texts
except Exception as e:
print(f"批量调用WSL opencc时出错: {e}")
return textsOptimization strategies:
Write all texts to a temporary file and convert them with a single WSL call.
Use a unique separator |||SEPARATOR||| to avoid conflicts with poem content.
Convert Windows paths to WSL format (e.g., C: → /mnt/c).
Delete the temporary file after conversion.
4.4 Integration into Main Workflow
Batch conversion is integrated after data collection.
print(f"
读取完成,共 {len(all_poems_data)} 首有效诗词")
print("开始批量繁简转换...")
authors = [p["author"] for p in all_poems_data]
titles = [p["title"] for p in all_poems_data]
paragraphs_list = [p["paragraphs"] for p in all_poems_data]
print("转换作者...")
authors_simplified = batch_convert_with_wsl_opencc(authors)
print("转换标题...")
titles_simplified = batch_convert_with_wsl_opencc(titles)
print("转换诗词内容...")
paragraphs_simplified = batch_convert_with_wsl_opencc(paragraphs_list)
for i, poem_data in enumerate(all_poems_data):
cleaned_poem = {
"author": poem_data["author"],
"author_simplified": authors_simplified[i] if i < len(authors_simplified) else poem_data["author"],
"title": poem_data["title"],
"title_simplified": titles_simplified[i] if i < len(titles_simplified) else poem_data["title"],
"paragraphs": poem_data["paragraphs"],
"paragraphs_simplified": paragraphs_simplified[i] if i < len(paragraphs_simplified) else poem_data["paragraphs"],
}
cleaned_poems.append(cleaned_poem)5. Result Output
5.1 Data Persistence
The processed results are saved as JSON with Chinese characters preserved.
# 保存清理后的数据
print(f"
正在保存到 {output_file}")
with open(output_file, "w", encoding="utf-8") as f:
json.dump(cleaned_poems, f, ensure_ascii=False, indent=2)
# 输出统计信息
print("
" + "=" * 60)
print("处理完成!统计信息:")
print(f"总诗词数:{total_poems}")
print(f"保留诗词数:{len(cleaned_poems)}")
print(f"过滤的诗词数:{filtered_poems}")
print(f"输出文件:{output_file}")
print("=" * 60)Configuration notes: ensure_ascii=False keeps Chinese characters unchanged. indent=2 uses two‑space indentation for readability.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
