Artificial Intelligence 26 min read

Building an AI-Powered Proofreading Agent for Media: Architecture, Prompt Engineering, and Evaluation

This article details a practical case study of designing, implementing, and evaluating an AI-driven proofreading agent for a media client, covering background challenges, a three‑layer architecture, prompt engineering techniques, RAG knowledge‑base construction, model selection, fine‑tuning, automated metrics, and lessons learned.

Alibaba Cloud Developer

Nov 19, 2025

Building an AI-Powered Proofreading Agent for Media: Architecture, Prompt Engineering, and Evaluation

Background

After the surge of large‑model AI in early 2024, the media industry recognized that generative models can automate the entire content production pipeline, from event detection to article generation and proofreading. Media customers often lack dedicated AI engineering teams and a clear understanding of model capabilities, leading to unrealistic expectations and resistance from editorial staff.

Scenario Analysis

Article proofreading is a critical step in media workflows and can be divided into four rule categories: basic formatting, compliance risk, content‑specific terminology, and language‑style nuances. Traditional manual processes require multiple teams to verify each dimension, resulting in long cycles and high labor costs.

Intelligent Agent Solution

The solution adopts a three‑layer architecture:

Business Layer : defines the four rule groups.

Agent Layer : implements rule execution via prompt engineering, Retrieval‑Augmented Generation (RAG) for domain knowledge, and real‑time MCP services for dynamic rule updates.

Model Layer : leverages the public‑cloud Bailei platform to select a base model (e.g., Qwen‑Long) and applies domain‑specific fine‑tuning.

Prompt Engineering

Effective prompts follow standard principles—clear task definition, structured output, and examples—while addressing two common pitfalls in proofreading: rule forgetting and rule conflict.

Rule Forgetting

Repeat critical rules at the beginning, middle, and end of the prompt, using a special marker <!CRITICAL> to boost priority.

Guide attention with weighted directives (e.g., [Importance:5/5]) and chain‑of‑thought reasoning.

Split rules into independent agents and run them in parallel to avoid interference.

Rule Conflict

Design rules to be mutually exclusive, preventing overlapping modifications.

Define a fallback hierarchy so that high‑priority rules override lower‑priority ones during conflict resolution.

RAG Knowledge‑Base Construction

For keyword‑level corrections, the knowledge base is usually a plain‑text list of terms, synonyms, or policy entries. Two key steps are required:

Structure selection : use a structured database for exact term replacements (e.g., "apartment" → "flat"); use unstructured documents when exhaustive enumeration is impossible.

Parameter configuration : choose an embedding model (default DashScope text‑embedding‑v4, but switch to text‑embedding‑v2 for exact keyword matching), set similarity thresholds (0.4–0.5 for precise matches), and control which fields participate in retrieval to reduce noise.

Model Evaluation

Metrics

Proofreading performance is measured with precision, recall, and F1 score, analogous to information‑retrieval evaluation for unordered result sets.

Automated Evaluation Procedure

The workflow uses an Excel sheet with four columns (Original Text, Input Text, Output Text, Notes) and a Python script that computes per‑sample and micro‑averaged metrics.

def process(data, checklist, out_file):
    """Process proofreading evaluation data and compute recall, precision, and F1."""
    mistake_all = 0
    detect_all = 0
    correct_all = 0
    output = []
    for i in range(len(data)):
        print(f"
===== Test Sample {i} =====")
        response = extract_response(data[i]["response"])
        mistake_cnt = len(checklist[i])
        mistake_all += mistake_cnt
        detect_cnt = 0
        correct_cnt = 0
        for key in checklist[i]:
            correct_word_count = sum(data[i]["correct"].count(word) for word in checklist[i][key])
            sample_mistake_count = data[i]["sample"].count(key)
            correct_mistake_count = data[i]["correct"].count(key)
            sample_correct_count = sum(data[i]["sample"].count(word) for word in checklist[i][key])
            conditions_met = (
                correct_word_count == 1 and
                sample_mistake_count == 1 and
                correct_mistake_count == 0 and
                sample_correct_count == 0
            )
            if not conditions_met:
                error_info = f"{i}, {key}, {checklist[i][key]}, {correct_word_count}, {sample_mistake_count}, {correct_mistake_count}, {sample_correct_count}"
                print(error_info)
                assert False, error_info
        for mistake in checklist[i]:
            correct_words = checklist[i][mistake]
            if response.count(mistake) == 0:
                detect_cnt += 1
            else:
                print(f"Mistake not detected for: {mistake}.")
            correct_occurrences = sum(response.count(word) for word in correct_words)
            if correct_occurrences == 1:
                correct_cnt += 1
            else:
                print(f"{mistake} should be corrected to: {correct_words}.")
        detect_all += detect_cnt
        correct_all += correct_cnt
        recall = correct_cnt / mistake_cnt if mistake_cnt > 0 else 0
        precision = correct_cnt / detect_cnt if detect_cnt != 0 else 0
        f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) != 0 else 0
        print(f"{mistake_cnt} mistakes, detect {detect_cnt}, correct {correct_cnt}, f1_score: {f1}")
        output.append([data[i]["correct"], data[i]["sample"], data[i]["response"], recall, precision, f1])
    recall_micro = correct_all / mistake_all if mistake_all > 0 else 0
    precision_micro = correct_all / detect_all if detect_all != 0 else 0
    f1_micro = 2 * (precision_micro * recall_micro) / (precision_micro + recall_micro) if (precision_micro + recall_micro) != 0 else 0
    output.append([None, None, None, recall_micro, precision_micro, f1_micro])
    save_output_to_excel(output, out_file)
    print(f"
Totally, {mistake_all} mistakes, detect {detect_all}, correct {correct_all}, f1_score: {f1_micro}")
    return

Model Selection and Fine‑Tuning

For long‑form articles, Qwen‑Long is recommended because of its extended context window. Fine‑tuning follows a standard pipeline: data preparation, model configuration, training monitoring, and deployment. The client’s English‑language content also required British‑English spelling adjustments and tense consistency, addressed through specialized prompts and, when necessary, model fine‑tuning.

Results and Lessons Learned

The end‑to‑end solution was delivered in four months, achieving an F1 score above the client’s 80‑point target. After the first successful scenario, the client expanded AI adoption to search, blogging, translation, and digital‑human projects. Key takeaways include the importance of incremental rule‑by‑rule prompt refinement, the trade‑off between RAG recall and engineered string‑matching, and the need to manage knowledge‑base field participation to avoid noisy retrieval.