Artificial Intelligence 46 min read

Boosting A/B Experiment Automation: Prompt Engineering Achieves 80% Accuracy

This article details how a production‑grade prompt system powered by large language models was designed to replace manual A/B experiment inspection, introducing a six‑level priority decision tree, robust data preprocessing, and systematic bad‑case analysis that lifted automation accuracy from 68% to over 80% while providing clear, explainable recommendations.

Alibaba Cloud Developer

Feb 2, 2026

Boosting A/B Experiment Automation: Prompt Engineering Achieves 80% Accuracy

In the era of intense traffic competition, traditional manual inspection of thousands of A/B experiments becomes a bottleneck, consuming 4–6 hours daily and still yielding high error rates. To address this, a production‑level prompt automation system was built on large language models, combining domain expertise with LLM reasoning to evaluate experiments efficiently and transparently.

System Background and Business Pain Points

The workflow for large‑scale strategy optimization involves rapid rollout of N‑person strategies, immediate data collection, and quick offline decisions. However, the existing rule‑engine approach suffers from three major limitations: low efficiency of manual checks, rigid regular‑expression based thresholds that cannot capture complex trends, and reliance on single statistical metrics that ignore small‑sample volatility.

Prompt‑Driven Decision Logic

The core of the system is a hierarchical priority decision tree (six layers) that evaluates experiment data step by step. Each layer defines explicit conditions using pseudo‑code, ensuring the LLM follows a deterministic path without creative deviation. The layers include:

Priority 0 – Data sufficiency check (days < 3).

Priority 1 – Strict continuous negative trend.

Priority 2 – Accelerating negative trend (trend worsening).

Priority 3 – Failure to rebound after a negative period.

Priority 4 – Positive trend decay with exclusion rules.

Priority 5 – High volatility without a clear positive direction.

Priority 6 – Default hold (no rule triggered).

Each rule is expressed in clear pseudo‑code, for example:

IF longest_consecutive_negative_days >= 3 AND last_day <= 0% AND last_2_days_have_no_positive THEN RETURN isRecommendOffline = true

A global benefit check is inserted before lower‑priority rules to prevent false negatives when cumulative absolute gain remains positive.

Data Pre‑Processing and Validation

Before rule evaluation, the system sorts input records by the dt field, parses percentage strings, validates sign consistency, and computes traffic normalization when the traffic ratio difference exceeds 5%.

# Step 2.1: Sort by date
sorted_days = sorted(input, key=lambda x: x['dt'])
# Step 2.2: Parse percentages
for d in sorted_days:
    d['dauRelativeChangePct'] = float(d['dauRelativeChangePct'].strip('%')) / 100
# Step 2.3: Validate sign consistency
    assert (d['dauRelativeChangePct'] >= 0) == (d['dauAbsoluteChange'] >= 0)

Bad‑Case Analysis and Prompt Refinement

Thirty‑plus bad cases were collected where the model’s decision differed from human judgment. Three representative cases illustrate common failure modes:

Unsorted dates causing the model to misinterpret the trend.

Sign‑misinterpretation leading to incorrect continuous‑negative detection.

Ignoring cumulative absolute gain, resulting in premature down‑selection.

Each case was analyzed using a three‑level Root Cause Analysis (RCA) framework, and the prompt was iteratively refined by adding explicit data‑validation steps, pseudo‑code definitions for “continuous negative segment”, and a global benefit flag. After each fix, targeted regression tests on the bad‑case set, the full historical dataset, and a fresh hold‑out set ensured no new regressions were introduced.

Iterative Improvement Workflow

The engineering process follows a disciplined branch‑fix‑test‑merge cycle:

Create a feature branch (e.g., v1.2‑continuity).

Run the updated prompt on the bad‑case subset to verify fixes.

Validate on the full historical dataset (no accuracy loss) and on a new hold‑out set (accuracy gain).

Merge into the mainline once all tests pass.

Progress is tracked in a version‑impact table, showing how each modification contributed 1–2 % accuracy improvements, moving the overall system from 68 % to over 80 %.

Explainable Recommendations

Beyond the binary isRecommendOffline flag, the system now returns a human‑readable recommendation that includes concrete dates, user counts, and the specific rule triggered, while avoiding technical jargon. Example:

{"isRecommendOffline": true, "recommendation": "最近两天用户持续减少，分别损失35人和42人，趋势恶化幅度加大，建议停止实验"}

This format enables product analysts to verify the decision quickly and trust the automation.

Key Takeaways for Prompt Engineering

Explicit pseudo‑code beats implicit natural language. Defining metrics and loops removes ambiguity.

Priority layering prevents rule conflicts. Higher‑risk rules short‑circuit evaluation.

Targeted exclusion conditions improve precision. Each rule includes carefully crafted “but not” clauses.

Global checks safeguard against over‑fitting to local patterns. Cumulative gain flags prevent premature down‑selection.

Systematic bad‑case collection and RCA turn errors into knowledge. The process yields a reusable knowledge base for future prompts.

Future Directions

Planned extensions include scaling the system to additional experiment channels, further modularizing prompt components for dynamic composition, and continuing to refine the engineering workflow to capture more domain knowledge as a reusable LLM‑driven knowledge base.

LLM prompt engineering Data Analysis A/B testing decision tree

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.