Building a Scalable Evaluation Platform for Loading/Unloading Point Recommendations
This article describes how Huolala created a data‑driven, automated testing platform to evaluate and improve loading and unloading recommendation points, covering background challenges, a multi‑layer evaluation framework, offline and online testing methods, metric design, result analysis, smart alerting, and future CI/CD integration.
Background and Challenges
Huolala, a leading intra‑city freight platform, launched loading/unloading recommendation points to reduce handling time and communication cost. The quality of these points directly impacts user experience and operational efficiency, but faces three major problems: vague evaluation standards, lack of a unified testing platform, and static offline simulation that differs from live environments.
Vague Evaluation Standards No unified, quantifiable metrics exist, leading to reliance on manual judgment and inconsistent results.
Testing Platform Gap There is no systematic platform for comprehensive verification and quality analysis, causing fragmented processes and low efficiency.
Static Environment Simulation Offline metrics suffer from a "data greenhouse" effect due to differences from real‑time production features.
Platform‑Centric Solution Framework
We propose a "multi‑layer capability + dual‑driver" evaluation framework supporting the full lifecycle from assessment to optimization. The framework consists of four core layers and two supporting engines.
Goals
Develop a comprehensive metric system covering accuracy, accessibility, convenience, and safety.
Build an evaluation platform integrating task management, execution, analysis, and issue tracing.
Support both offline batch analysis and real‑time online monitoring.
Create an intelligent feedback loop that automatically detects anomalies, classifies issues, and triggers strategy adjustments.
Evaluation Platform Capability Construction
Architecture Design
The platform consists of presentation, business, storage, and data layers, enabling automatic model updates for recommendation points.
Design Process
Algorithm testing challenges include outputting estimated values, high abstraction making debugging hard, and model version drift requiring continuous monitoring.
Offline Evaluation
Targets the recommendation service, samples the last request of user behavior, and uses the "non‑point rate" (distance between created address and actual point) as the primary metric. Evaluation runs on each model/strategy/data release before testing.
Online Evaluation
Relies on the algorithm side to auto‑generate models, building a complete online testing platform that supports automatic model updates.
Core Function Implementation
We detail evaluation metrics, sample selection, evaluation methods, result analysis, and feedback mechanisms.
Evaluation Metrics
Traditional metric blind spots: High accuracy (85%) may still increase complaints; low response time (<200 ms) can reduce conversion; high recall (90%) may raise non‑point rate.
Business metric design: Three dimensions—user experience (45), business benefit (40), system performance (15)—with weighted allocation and scenario‑specific customization.
Sample Selection Principles
Close to real scenarios: use actual online order data.
Ensure sufficient scale: default ~100 k samples, adjustable.
Avoid training set overlap and broaden coverage across cities and vehicle types.
Distributed Evaluation Framework
Data is sharded by spatio‑temporal dimensions and processed with Kafka + Spark Streaming, achieving processing of millions of samples with P99 latency under 300 ms.
Evaluation Methods
Offline stage: two‑stage verification.
Offline Simulation
Replay historical data to build a benchmark test set.
Shadow Mode
Run new strategy models in parallel with the live model, automatically compute metrics, filter bad cases, and manually confirm high‑risk results.
Online stage: A/B testing.
Isolate traffic to compare new and old strategies, ensuring no regression before deployment.
private LalamapMetric calcMetric(List<LalamapCompareResult> compareResults) {
LalamapMetric lalamapMetric = new LalamapMetric();
lalamapMetric.addTotal(compareResults.size());
for (LalamapCompareResult compareResult : compareResults) {
metric(compareResult, lalamapMetric);
}
return lalamapMetric;
}Result Analysis
Bad‑case analysis moves from manual sampling to rule‑engine based detection, automatically diffing result files to highlight missing recommendations, low‑heat areas, or overly distant points, and provides actionable improvement suggestions.
Smart QA Feedback & Alert Architecture
Multi‑channel notification via structured Feishu messages and state‑machine‑driven alert strategies.
Anomaly detection with real‑time alerts.
Alert Type
Trigger Condition
Response Time
Escalation
QA Task Timeout
Task not finished within window (+30 min)
15 min
@owner every 30 min
QA Failure
Accuracy < 0.85 for 3 consecutive cycles
Real‑time
Automatic rollback
Evaluation Benefits
Quality: increased daily data capacity from 10 k to over 1 M, reducing metric gap from 0.38 pp to 0.12 pp.
Testing quality: supported 30 features, uncovered 45 bugs pre‑release.
Efficiency: processing time for 1 M samples dropped from 33.3 h to 2.7 h, saving 30.6 h and 119 person‑days.
Business impact: improved driver satisfaction (loading +0.06 pp, unloading stable) and reduced 30 min non‑point rates.
Stability: replay success rate reached 99.06 % with no platform‑induced blocks.
Future Plans
Sink evaluation capability to the model core, enabling direct model assessment.
Deeply integrate the evaluation workflow into CI/CD pipelines as a quality gate before model release.
Upgrade bad‑case mining with clustering, anomaly detection, and causal inference to provide precise optimization guidance.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
