Why Pixel Diff Failed and How VLM Fine‑Tuning Became the Eyes of UI Automation
Traditional pixel‑by‑pixel UI comparison breaks on complex CAD drawings due to semantic changes, so a team built a visual‑language‑model fine‑tuning pipeline that turns failure cases into training data, achieves ~95% AI accuracy, improves regression efficiency by over 40%, and now powers hundreds of daily automation tests.
Problem with pixel‑level comparison
Pixel‑by‑pixel diff works on simple pages but fails on dense line drawings, annotations, and anti‑aliasing or scaling artifacts common in CAD and construction schematics. The false‑positive rate remains high, forcing engineers to manually review many useless cases. The core question is whether the comparison should judge visual similarity or business‑level semantic changes.
Key decisions
Decision 1: Adjusting pixel thresholds cannot solve semantic‑level differences such as missing annotations or shifted dimensions; thresholds only mask the problem.
Decision 2: Generic large models perform poorly on domain‑specific UI differences; a vertically‑focused model with domain knowledge is needed.
Decision 3: A closed‑loop pipeline is required—online failures become hard examples for the next training round; without data feedback accuracy plateaus quickly.
Closed‑loop pipeline
Data collection → fine‑tuning → model evaluation → production integration → monitoring → feedback. Failed or false‑positive UI comparison cases are automatically stored in a “hard‑example” repository, which fuels the next training iteration.
Platformization for test engineers
Dataset management and training submission are wrapped in a web platform, allowing a newcomer to go from zero knowledge to a successful training run in about three days (down from two weeks). After GPU driver updates, testers can submit training jobs without deep technical involvement.
Training data format
Each training sample is a JSON‑structured dialogue where the user role provides a system prompt, the expected image, and the actual image; the assistant role returns a JSON‑structured conclusion.
{
"messages": [
{"role": "user", "content": [
{"type": "text", "text": "系统提示词:请对比两张图的业务差异..."},
{"type": "image", "image": "期望图 expect"},
{"type": "image", "image": "实际图 actual"}
]},
{"role": "assistant", "content": [
{"type": "text", "text": "JSON 格式的推理结论(见下方示例)"}
]}
]
}The model must output a parsable JSON, e.g.
{
"analysis": [
"差异1:【标注800】类型:内容变化 | 图1有3个 -> 图2仅有2个。判定:尺寸标注丢失,故异常。",
"差异2:【标注S-29月光白】类型:位置变化 | 图1位于左侧 -> 图2位于右侧。判定:引线指向一致,故正常。"
],
"conclusion": "异常",
"reason": "尺寸标注800丢失,属于关键工程语义变更"
}Training design highlights
Quantization fallback: 4‑bit quantization automatically degrades to 16‑bit on GPUs where CUDA compatibility fails.
Validation and early‑stop: Eval loss is checked every 10 optimizer steps; training stops after 5 consecutive steps without improvement.
Dual model output: Both a LoRA adapter (lightweight for incremental fine‑tuning) and a merged 16‑bit model (ready for inference) are saved.
Training is launched with SFTTrainer and a custom UnslothVisionDataCollator, using a learning rate of 1e‑4, max sequence length 6144 (visual tokens + JSON output), and early‑stopping callbacks.
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
data_collator=UnslothVisionDataCollator(model, tokenizer),
train_dataset=dataset,
eval_dataset=eval_dataset,
args=SFTConfig(
learning_rate=1e-4,
optim=_resolve_optimizer(),
max_seq_length=6144,
eval_strategy="steps",
eval_steps=10,
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
callbacks=[EarlyStoppingCallback(early_stopping_patience=5)],
...
)
)
def apply_lora(model):
return FastVisionModel.get_peft_model(
model,
finetune_vision_layers=True,
finetune_language_layers=True,
finetune_attention_modules=True,
finetune_mlp_modules=True,
r=16,
...
)Performance target and latency
The goal is to keep per‑case inference under 3 seconds (≈60 cases in 3 minutes) while handling daily regression traffic. Current inference time meets the 3‑second target; further quantization may affect accuracy and will be continuously validated.
Evaluation and monitoring
After each training round, a dedicated evaluation set compares the new model against the online baseline. Scores are aggregated; the new model is promoted only if its score ≥ baseline or the gap is within a predefined threshold.
New ≥ baseline → pass.
Baseline > new but gap ≤ threshold → pending (often due to judge variance).
Gap > threshold → reject.
Monitoring dashboards track accuracy and inference latency to detect any regression after deployment.
Results
AI comparison accuracy ≈ 95 % (agreement with human judgment on AI‑filtered cases).
Regression efficiency gain +40 % (reduced manual verification workload).
Deployment scale: 2 core business lines, >100 use cases, thousands of runs (Q2 2026).
Case studies show the model correctly classifies pixel‑shift false positives as normal differences and precisely identifies content changes, matching human conclusions.
Practical takeaways for test engineers
Define the comparison goal: pixel similarity ≠ business correctness; build a hard‑example repository.
Structure online failure cases as JSON for downstream processing; raw natural‑language is less actionable.
Quantitative “new vs. baseline” comparison is mandatory before full rollout.
Enable test engineers to contribute data and validation, not just consume AI outputs.
Remaining challenges
Occasional mismatch between inference process and final conclusion.
Inconsistent results on repeated inference of the same image (being addressed).
Complex layouts sometimes split into two images, causing perpetual failures.
These issues reinforce the need for a complete loop of training, gate‑keeping evaluation, monitoring, and continuous feedback.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
