Artificial Intelligence 20 min read

From Bad Prompts to 9.5 Scores: A Step‑by‑Step Prompt Engineering Guide

This article walks through an iterative prompt‑engineering workflow—starting with a weak baseline, applying four concrete techniques (clarity & directness, specificity, XML structuring, and examples), evaluating each change with a PromptEvaluator, and showing how scores jump from 3.4 to over 9.5 using real code snippets and concrete data.

Java One

Apr 20, 2026

From Bad Prompts to 9.5 Scores: A Step‑by‑Step Prompt Engineering Guide

1. Prompt‑Engineering Overview

Prompt engineering means repeatedly improving a prompt to obtain more reliable, higher‑quality model output. The process follows a clear loop: set a goal, write an initial prompt, evaluate it, apply engineering tricks, re‑evaluate, and repeat until the evaluation score meets expectations.

Set a goal – define what you want the model to accomplish.

Write the initial prompt – create a simple baseline.

Evaluate the prompt – run a test suite with a scoring model.

Apply engineering tricks – use specific techniques to boost performance.

Re‑evaluate – verify that the change actually improves the score.

The loop repeats until the score stabilises.

Building an Evaluation Pipeline

To demonstrate the workflow, the article builds a concrete example: a prompt that generates a one‑day diet plan for an athlete. The PromptEvaluator class creates a synthetic dataset and scores each run. Concurrency is controlled with the max_concurrent_tasks argument; a safe starting value is 3, but a Pro user can raise it to 5 or 10 without hitting rate limits.

evaluator = PromptEvaluator(max_concurrent_tasks=5)

Dataset generation uses a small number of cases (2‑3) for fast iteration:

dataset = evaluator.generate_dataset(
    task_description="为单个运动员编写一份紧凑简洁的一日饮食计划",
    prompt_inputs_spec={
        "height": "运动员身高（厘米，cm）",
        "weight": "运动员体重（公斤，kg）",
        "goal": "运动员的目标",
        "restrictions": "饮食限制"
    },
    output_file="dataset.json",
    num_cases=3
)

Initial Prompt and First Score

The baseline prompt simply asks “这个人应该吃什么？” and inserts the four input fields. Running the evaluator yields a score of 3.4/10 , a typical starting point.

def run_prompt(prompt_inputs):
    prompt = f"""
这个人应该吃什么？
- 身高：{prompt_inputs["height"]}
- 体重：{prompt_inputs["weight"]}
- 目标：{prompt_inputs["goal"]}
- 饮食限制：{prompt_inputs["restrictions"]}
"""
    messages = []
    add_user_message(messages, prompt)
    return chat(messages)

Technique 1 – Clarity & Directness

Rewrite the first line to be explicit and imperative:

为运动员生成一份满足其饮食限制的一日饮食计划。

This change alone raises the score from 3.4 to 5.4 because the model now knows exactly what action to take, what object to produce, and which constraints apply.

Technique 2 – Specificity

Adding a quality‑guideline block (daily calories, macro breakdown, meal times, etc.) tells Claude exactly what the output must contain. After inserting the guideline, the score climbs from 5.4 to 7.4 .

指南：
1. 包含准确的每日总热量
2. 显示蛋白质、脂肪和碳水的含量
3. 明确每餐的进食时间
4. 仅使用符合饮食限制的食材
5. 列出所有份量（以克为单位）
6. 如果提到了预算，保持经济实惠

Technique 3 – XML‑Based Structuring

Wrapping related sections with custom XML tags ( <athlete_information>, <my_code>, etc.) creates clear boundaries, especially when the prompt mixes code, data, and instructions. Adding these tags pushes the score above 9.0 .

prompt = f"""
根据下面的运动员信息生成一份满足其饮食限制的一日饮食计划：

<athlete_information>
- 身高：{prompt_inputs["height"]}
- 体重：{prompt_inputs["weight"]}
- 目标：{prompt_inputs["goal"]}
- 饮食限制：{prompt_inputs["restrictions"]}
</athlete_information>

指南：
1. 包含准确的每日总热量
2. 显示蛋白质、脂肪和碳水的含量
…
"""

Technique 4 – Providing Examples (One‑Shot / Multi‑Shot)

Supplying a concrete input‑output pair demonstrates the desired format and helps the model avoid edge‑case failures such as sarcasm detection in sentiment analysis. The article shows a sentiment‑analysis example where a sarcastic tweet is correctly labelled negative after adding a positive and a sarcastic example.

When the best‑scoring output (10/10) is used as a prompt example, the evaluation score stabilises around 9.5 . However, a mismatch between the example length and the dataset’s “concise” requirement can cause a drop (e.g., a 10‑point example reduced the score to 7.5). Selecting a concise, high‑scoring example resolves the conflict.

Practical Takeaways

Iterate one change at a time and measure the impact.

Use clear, imperative language for the first line.

Specify output quality criteria or step‑by‑step instructions.

Wrap large or mixed content with descriptive XML tags.

Include one‑shot or multi‑shot examples that illustrate edge cases and ideal formatting.

By following this systematic approach, prompt engineers can reliably lift a low‑scoring baseline into a high‑performing prompt that consistently yields the desired output.