Why Pixel Diff Failed and How VLM Fine‑Tuning Became the Eyes of UI Automation

Traditional pixel‑by‑pixel UI comparison breaks on complex CAD drawings due to semantic changes, so a team built a visual‑language‑model fine‑tuning pipeline that turns failure cases into training data, achieves ~95% AI accuracy, improves regression efficiency by over 40%, and now powers hundreds of daily automation tests.

Qunhe Technology Quality Tech
Qunhe Technology Quality Tech
Qunhe Technology Quality Tech
Why Pixel Diff Failed and How VLM Fine‑Tuning Became the Eyes of UI Automation

Problem with pixel‑level comparison

Pixel‑by‑pixel diff works on simple pages but fails on dense line drawings, annotations, and anti‑aliasing or scaling artifacts common in CAD and construction schematics. The false‑positive rate remains high, forcing engineers to manually review many useless cases. The core question is whether the comparison should judge visual similarity or business‑level semantic changes.

Key decisions

Decision 1: Adjusting pixel thresholds cannot solve semantic‑level differences such as missing annotations or shifted dimensions; thresholds only mask the problem.

Decision 2: Generic large models perform poorly on domain‑specific UI differences; a vertically‑focused model with domain knowledge is needed.

Decision 3: A closed‑loop pipeline is required—online failures become hard examples for the next training round; without data feedback accuracy plateaus quickly.

Closed‑loop pipeline

Data collection → fine‑tuning → model evaluation → production integration → monitoring → feedback. Failed or false‑positive UI comparison cases are automatically stored in a “hard‑example” repository, which fuels the next training iteration.

Platformization for test engineers

Dataset management and training submission are wrapped in a web platform, allowing a newcomer to go from zero knowledge to a successful training run in about three days (down from two weeks). After GPU driver updates, testers can submit training jobs without deep technical involvement.

Training data format

Each training sample is a JSON‑structured dialogue where the user role provides a system prompt, the expected image, and the actual image; the assistant role returns a JSON‑structured conclusion.

{
  "messages": [
    {"role": "user", "content": [
      {"type": "text", "text": "系统提示词:请对比两张图的业务差异..."},
      {"type": "image", "image": "期望图 expect"},
      {"type": "image", "image": "实际图 actual"}
    ]},
    {"role": "assistant", "content": [
      {"type": "text", "text": "JSON 格式的推理结论(见下方示例)"}
    ]}
  ]
}

The model must output a parsable JSON, e.g.

{
  "analysis": [
    "差异1:【标注800】类型:内容变化 | 图1有3个 -> 图2仅有2个。判定:尺寸标注丢失,故异常。",
    "差异2:【标注S-29月光白】类型:位置变化 | 图1位于左侧 -> 图2位于右侧。判定:引线指向一致,故正常。"
  ],
  "conclusion": "异常",
  "reason": "尺寸标注800丢失,属于关键工程语义变更"
}

Training design highlights

Quantization fallback: 4‑bit quantization automatically degrades to 16‑bit on GPUs where CUDA compatibility fails.

Validation and early‑stop: Eval loss is checked every 10 optimizer steps; training stops after 5 consecutive steps without improvement.

Dual model output: Both a LoRA adapter (lightweight for incremental fine‑tuning) and a merged 16‑bit model (ready for inference) are saved.

Training is launched with SFTTrainer and a custom UnslothVisionDataCollator, using a learning rate of 1e‑4, max sequence length 6144 (visual tokens + JSON output), and early‑stopping callbacks.

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    data_collator=UnslothVisionDataCollator(model, tokenizer),
    train_dataset=dataset,
    eval_dataset=eval_dataset,
    args=SFTConfig(
        learning_rate=1e-4,
        optim=_resolve_optimizer(),
        max_seq_length=6144,
        eval_strategy="steps",
        eval_steps=10,
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss",
        callbacks=[EarlyStoppingCallback(early_stopping_patience=5)],
        ...
    )
)

def apply_lora(model):
    return FastVisionModel.get_peft_model(
        model,
        finetune_vision_layers=True,
        finetune_language_layers=True,
        finetune_attention_modules=True,
        finetune_mlp_modules=True,
        r=16,
        ...
    )

Performance target and latency

The goal is to keep per‑case inference under 3 seconds (≈60 cases in 3 minutes) while handling daily regression traffic. Current inference time meets the 3‑second target; further quantization may affect accuracy and will be continuously validated.

Evaluation and monitoring

After each training round, a dedicated evaluation set compares the new model against the online baseline. Scores are aggregated; the new model is promoted only if its score ≥ baseline or the gap is within a predefined threshold.

New ≥ baseline → pass.

Baseline > new but gap ≤ threshold → pending (often due to judge variance).

Gap > threshold → reject.

Monitoring dashboards track accuracy and inference latency to detect any regression after deployment.

Results

AI comparison accuracy ≈ 95 % (agreement with human judgment on AI‑filtered cases).

Regression efficiency gain +40 % (reduced manual verification workload).

Deployment scale: 2 core business lines, >100 use cases, thousands of runs (Q2 2026).

Case studies show the model correctly classifies pixel‑shift false positives as normal differences and precisely identifies content changes, matching human conclusions.

Practical takeaways for test engineers

Define the comparison goal: pixel similarity ≠ business correctness; build a hard‑example repository.

Structure online failure cases as JSON for downstream processing; raw natural‑language is less actionable.

Quantitative “new vs. baseline” comparison is mandatory before full rollout.

Enable test engineers to contribute data and validation, not just consume AI outputs.

Remaining challenges

Occasional mismatch between inference process and final conclusion.

Inconsistent results on repeated inference of the same image (being addressed).

Complex layouts sometimes split into two images, causing perpetual failures.

These issues reinforce the need for a complete loop of training, gate‑keeping evaluation, monitoring, and continuous feedback.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

UI automationfine-tuningmodel evaluationVLMimage comparisonAI monitoring
Qunhe Technology Quality Tech
Written by

Qunhe Technology Quality Tech

Kujiale Technology Quality

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.