Building an AI-Powered Proofreading Agent for Media: Architecture, Prompt Engineering, and Evaluation
This article details a practical case study of designing, implementing, and evaluating an AI-driven proofreading agent for a media client, covering background challenges, a three‑layer architecture, prompt engineering techniques, RAG knowledge‑base construction, model selection, fine‑tuning, automated metrics, and lessons learned.
Background
After the surge of large‑model AI in early 2024, the media industry recognized that generative models can automate the entire content production pipeline, from event detection to article generation and proofreading. Media customers often lack dedicated AI engineering teams and a clear understanding of model capabilities, leading to unrealistic expectations and resistance from editorial staff.
Scenario Analysis
Article proofreading is a critical step in media workflows and can be divided into four rule categories: basic formatting, compliance risk, content‑specific terminology, and language‑style nuances. Traditional manual processes require multiple teams to verify each dimension, resulting in long cycles and high labor costs.
Intelligent Agent Solution
The solution adopts a three‑layer architecture:
Business Layer : defines the four rule groups.
Agent Layer : implements rule execution via prompt engineering, Retrieval‑Augmented Generation (RAG) for domain knowledge, and real‑time MCP services for dynamic rule updates.
Model Layer : leverages the public‑cloud Bailei platform to select a base model (e.g., Qwen‑Long) and applies domain‑specific fine‑tuning.
Prompt Engineering
Effective prompts follow standard principles—clear task definition, structured output, and examples—while addressing two common pitfalls in proofreading: rule forgetting and rule conflict.
Rule Forgetting
Repeat critical rules at the beginning, middle, and end of the prompt, using a special marker <!CRITICAL> to boost priority.
Guide attention with weighted directives (e.g., [Importance:5/5]) and chain‑of‑thought reasoning.
Split rules into independent agents and run them in parallel to avoid interference.
Rule Conflict
Design rules to be mutually exclusive, preventing overlapping modifications.
Define a fallback hierarchy so that high‑priority rules override lower‑priority ones during conflict resolution.
RAG Knowledge‑Base Construction
For keyword‑level corrections, the knowledge base is usually a plain‑text list of terms, synonyms, or policy entries. Two key steps are required:
Structure selection : use a structured database for exact term replacements (e.g., "apartment" → "flat"); use unstructured documents when exhaustive enumeration is impossible.
Parameter configuration : choose an embedding model (default DashScope text‑embedding‑v4, but switch to text‑embedding‑v2 for exact keyword matching), set similarity thresholds (0.4–0.5 for precise matches), and control which fields participate in retrieval to reduce noise.
Model Evaluation
Metrics
Proofreading performance is measured with precision, recall, and F1 score, analogous to information‑retrieval evaluation for unordered result sets.
Automated Evaluation Procedure
The workflow uses an Excel sheet with four columns (Original Text, Input Text, Output Text, Notes) and a Python script that computes per‑sample and micro‑averaged metrics.
def process(data, checklist, out_file):
"""Process proofreading evaluation data and compute recall, precision, and F1."""
mistake_all = 0
detect_all = 0
correct_all = 0
output = []
for i in range(len(data)):
print(f"
===== Test Sample {i} =====")
response = extract_response(data[i]["response"])
mistake_cnt = len(checklist[i])
mistake_all += mistake_cnt
detect_cnt = 0
correct_cnt = 0
for key in checklist[i]:
correct_word_count = sum(data[i]["correct"].count(word) for word in checklist[i][key])
sample_mistake_count = data[i]["sample"].count(key)
correct_mistake_count = data[i]["correct"].count(key)
sample_correct_count = sum(data[i]["sample"].count(word) for word in checklist[i][key])
conditions_met = (
correct_word_count == 1 and
sample_mistake_count == 1 and
correct_mistake_count == 0 and
sample_correct_count == 0
)
if not conditions_met:
error_info = f"{i}, {key}, {checklist[i][key]}, {correct_word_count}, {sample_mistake_count}, {correct_mistake_count}, {sample_correct_count}"
print(error_info)
assert False, error_info
for mistake in checklist[i]:
correct_words = checklist[i][mistake]
if response.count(mistake) == 0:
detect_cnt += 1
else:
print(f"Mistake not detected for: {mistake}.")
correct_occurrences = sum(response.count(word) for word in correct_words)
if correct_occurrences == 1:
correct_cnt += 1
else:
print(f"{mistake} should be corrected to: {correct_words}.")
detect_all += detect_cnt
correct_all += correct_cnt
recall = correct_cnt / mistake_cnt if mistake_cnt > 0 else 0
precision = correct_cnt / detect_cnt if detect_cnt != 0 else 0
f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) != 0 else 0
print(f"{mistake_cnt} mistakes, detect {detect_cnt}, correct {correct_cnt}, f1_score: {f1}")
output.append([data[i]["correct"], data[i]["sample"], data[i]["response"], recall, precision, f1])
recall_micro = correct_all / mistake_all if mistake_all > 0 else 0
precision_micro = correct_all / detect_all if detect_all != 0 else 0
f1_micro = 2 * (precision_micro * recall_micro) / (precision_micro + recall_micro) if (precision_micro + recall_micro) != 0 else 0
output.append([None, None, None, recall_micro, precision_micro, f1_micro])
save_output_to_excel(output, out_file)
print(f"
Totally, {mistake_all} mistakes, detect {detect_all}, correct {correct_all}, f1_score: {f1_micro}")
returnModel Selection and Fine‑Tuning
For long‑form articles, Qwen‑Long is recommended because of its extended context window. Fine‑tuning follows a standard pipeline: data preparation, model configuration, training monitoring, and deployment. The client’s English‑language content also required British‑English spelling adjustments and tense consistency, addressed through specialized prompts and, when necessary, model fine‑tuning.
Results and Lessons Learned
The end‑to‑end solution was delivered in four months, achieving an F1 score above the client’s 80‑point target. After the first successful scenario, the client expanded AI adoption to search, blogging, translation, and digital‑human projects. Key takeaways include the importance of incremental rule‑by‑rule prompt refinement, the trade‑off between RAG recall and engineered string‑matching, and the need to manage knowledge‑base field participation to avoid noisy retrieval.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
