Fundamentals 12 min read

How to Clean and Convert a Chinese Poetry Dataset for RAG Projects

This guide explains how to clean a Chinese poetry corpus—removing special characters, filtering short entries, and converting traditional characters to simplified Chinese—using Python validation functions, batch file processing, and WSL‑based OpenCC conversion, then persisting the results as JSON.

Tech Musings
Tech Musings
Tech Musings
How to Clean and Convert a Chinese Poetry Dataset for RAG Projects

1. Background

Links to the previous two parts of the RAG engineering series are provided. The current knowledge base uses poems from the chinese-poetry project, which contains 55,000 Tang poems, 260,000 Song poems, and 21,000 Song lyrics. The dataset suffers from special characters, invalid content, and primarily traditional characters with variant forms.

Contains special characters and invalid content.

Primarily traditional characters, includes variant forms.

To make the project usable, the raw corpus must be cleaned and converted from traditional to simplified Chinese.

2. Data Source Format

The original data is stored as JSON, with each poem containing the fields author, paragraphs, title, and id.

{
  "author": "無名氏",
  "paragraphs": [
    "神暢(一作「粲」)感寂庭,嘿思徹九重。",
    "靈歌理冥運,百和結成章。"
  ],
  "title": "道藏歌詩 四十四",
  "id": "a1758a2c-ed74-4ab5-b05d-cf9069d01978"
}

3. Data Cleaning Implementation

3.1 Content Validation Function

The function is_valid_poem_content() validates poem content.

def is_valid_poem_content(paragraphs):
    """
    检查诗词内容是否符合要求
    1. 只允许中文字符、数字、"。"、"?"、","、"."
    2. 将所有段落拼接后,去除空白字符,字数不少于20字
    """
    combined = "".join(paragraphs)
    cleaned = re.sub(r"\s+", "", combined)
    if len(cleaned) < 20:
        print(f"过滤诗词(字数不足20字): {cleaned[:50]}...")
        return False
    allowed_pattern = r"^[a-zA-Z\u4e00-\u9fa5\u3002\uff1f\uff0c\.,\?\d]+$"
    if not re.match(allowed_pattern, cleaned):
        print(f"过滤诗词(包含特殊字符): {cleaned[:50]}...")
        return False
    return True

Implementation notes:

Use Unicode range \u4e00-\u9fa5 to match Chinese characters.

Perform length check before regex to reduce overhead.

Log the first 50 characters of filtered content for debugging.

3.2 Batch File Processing

Python's pathlib module handles file operations.

def process_tang_poems():
    """处理全唐诗数据"""
    tang_poems_dir = Path(r"c:\xxx\chinese-poetry\全唐诗")
    output_file = Path(r"c:\xxx\cleaned_tang_poems.json")
    cleaned_poems = []
    total_poems = 0
    filtered_poems = 0
    poet_files = sorted(tang_poems_dir.glob("poet.*.json"))
    print(f"找到 {len(poet_files)} 个诗词文件")
    all_poems_data = []
    for file_path in poet_files:
        print(f"正在读取: {file_path.name}")
        try:
            with open(file_path, "r", encoding="utf-8") as f:
                poems = json.load(f)
            for poem in poems:
                total_poems += 1
                author = poem.get("author", "")
                title = poem.get("title", "")
                paragraphs = poem.get("paragraphs", [])
                if is_valid_poem_content(paragraphs):
                    paragraphs_str = "".join(paragraphs)
                    all_poems_data.append({"author": author, "title": title, "paragraphs": paragraphs_str})
                else:
                    filtered_poems += 1
        except Exception as e:
            print(f"处理文件 {file_path.name} 时出错: {e}")
            continue
    # Further processing follows

Key features:

Independent exception handling for each file.

Two‑stage processing: collect valid data, then perform batch conversion.

Real‑time progress output.

4. Traditional‑to‑Simplified Conversion

4.1 Technical Approach

The project runs on Windows and invokes the Linux version of OpenCC via WSL. Alternative solutions include installing OpenCC with pip (requires pre‑compiled dictionaries) or using the pure‑Python opencc‑python‑reimplemented package (no longer maintained).

4.2 Single‑Text Conversion

The function convert_text_with_wsl_opencc() converts a single string.

def convert_text_with_wsl_opencc(text):
    """使用WSL中的opencc进行繁简转换"""
    if not text:
        return ""
    try:
        cmd = f"wsl bash -c \"echo '{text}' | opencc -c t2s\""
        result = subprocess.run(cmd, capture_output=True, text=True, encoding="utf-8", shell=True)
        if result.returncode == 0:
            return result.stdout.strip()
        else:
            print(f"转换错误: {result.stderr}")
            return text
    except Exception as e:
        print(f"调用WSL opencc时出错: {e}")
        return text

Details:

Use wsl bash -c to run commands inside WSL.

Echo and pipe to opencc avoid PowerShell parsing issues.

UTF‑8 encoding is enforced throughout.

On failure, the original text is returned.

4.3 Batch Conversion Optimization

Batch conversion mitigates the performance bottleneck of single‑text conversion by writing all texts to a temporary file and converting them in one WSL call.

def batch_convert_with_wsl_opencc(texts):
    """批量使用WSL中的opencc进行繁简转换"""
    if not texts:
        return []
    try:
        import tempfile
        with tempfile.NamedTemporaryFile(mode="w", encoding="utf-8", delete=False, suffix=".txt") as tmp_in:
            separator = "|||SEPARATOR|||"
            tmp_in.write(separator.join(texts))
            tmp_in_path = tmp_in.name
        wsl_path = tmp_in_path.replace("\\", "/").replace("C:", "/mnt/c")
        cmd = f'wsl bash -c "cat {wsl_path} | opencc -c t2s"'
        result = subprocess.run(cmd, capture_output=True, text=True, encoding="utf-8", shell=True)
        os.unlink(tmp_in_path)
        if result.returncode == 0:
            return result.stdout.strip().split(separator)
        else:
            print(f"批量转换错误: {result.stderr}")
            return texts
    except Exception as e:
        print(f"批量调用WSL opencc时出错: {e}")
        return texts

Optimization strategies:

Write all texts to a temporary file and convert them with a single WSL call.

Use a unique separator |||SEPARATOR||| to avoid conflicts with poem content.

Convert Windows paths to WSL format (e.g., C:/mnt/c).

Delete the temporary file after conversion.

4.4 Integration into Main Workflow

Batch conversion is integrated after data collection.

print(f"
读取完成,共 {len(all_poems_data)} 首有效诗词")
print("开始批量繁简转换...")
authors = [p["author"] for p in all_poems_data]
titles = [p["title"] for p in all_poems_data]
paragraphs_list = [p["paragraphs"] for p in all_poems_data]

print("转换作者...")
authors_simplified = batch_convert_with_wsl_opencc(authors)
print("转换标题...")
titles_simplified = batch_convert_with_wsl_opencc(titles)
print("转换诗词内容...")
paragraphs_simplified = batch_convert_with_wsl_opencc(paragraphs_list)

for i, poem_data in enumerate(all_poems_data):
    cleaned_poem = {
        "author": poem_data["author"],
        "author_simplified": authors_simplified[i] if i < len(authors_simplified) else poem_data["author"],
        "title": poem_data["title"],
        "title_simplified": titles_simplified[i] if i < len(titles_simplified) else poem_data["title"],
        "paragraphs": poem_data["paragraphs"],
        "paragraphs_simplified": paragraphs_simplified[i] if i < len(paragraphs_simplified) else poem_data["paragraphs"],
    }
    cleaned_poems.append(cleaned_poem)

5. Result Output

5.1 Data Persistence

The processed results are saved as JSON with Chinese characters preserved.

# 保存清理后的数据
print(f"
正在保存到 {output_file}")
with open(output_file, "w", encoding="utf-8") as f:
    json.dump(cleaned_poems, f, ensure_ascii=False, indent=2)

# 输出统计信息
print("
" + "=" * 60)
print("处理完成!统计信息:")
print(f"总诗词数:{total_poems}")
print(f"保留诗词数:{len(cleaned_poems)}")
print(f"过滤的诗词数:{filtered_poems}")
print(f"输出文件:{output_file}")
print("=" * 60)

Configuration notes: ensure_ascii=False keeps Chinese characters unchanged. indent=2 uses two‑space indentation for readability.

RAGJSONdata cleaningtext preprocessingopencc
Tech Musings
Written by

Tech Musings

Capturing thoughts and reflections while coding.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.