Operations 17 min read

Building a Scalable Evaluation Platform for Loading/Unloading Point Recommendations

This article describes how Huolala created a data‑driven, automated testing platform to evaluate and improve loading and unloading recommendation points, covering background challenges, a multi‑layer evaluation framework, offline and online testing methods, metric design, result analysis, smart alerting, and future CI/CD integration.

Huolala Tech

May 8, 2025

Building a Scalable Evaluation Platform for Loading/Unloading Point Recommendations

Background and Challenges

Huolala, a leading intra‑city freight platform, launched loading/unloading recommendation points to reduce handling time and communication cost. The quality of these points directly impacts user experience and operational efficiency, but faces three major problems: vague evaluation standards, lack of a unified testing platform, and static offline simulation that differs from live environments.

Vague Evaluation Standards No unified, quantifiable metrics exist, leading to reliance on manual judgment and inconsistent results.

Testing Platform Gap There is no systematic platform for comprehensive verification and quality analysis, causing fragmented processes and low efficiency.

Static Environment Simulation Offline metrics suffer from a "data greenhouse" effect due to differences from real‑time production features.

Platform‑Centric Solution Framework

We propose a "multi‑layer capability + dual‑driver" evaluation framework supporting the full lifecycle from assessment to optimization. The framework consists of four core layers and two supporting engines.

Goals

Develop a comprehensive metric system covering accuracy, accessibility, convenience, and safety.

Build an evaluation platform integrating task management, execution, analysis, and issue tracing.

Support both offline batch analysis and real‑time online monitoring.

Create an intelligent feedback loop that automatically detects anomalies, classifies issues, and triggers strategy adjustments.

Evaluation Platform Capability Construction

Architecture Design

The platform consists of presentation, business, storage, and data layers, enabling automatic model updates for recommendation points.

Design Process

Algorithm testing challenges include outputting estimated values, high abstraction making debugging hard, and model version drift requiring continuous monitoring.

Offline Evaluation

Targets the recommendation service, samples the last request of user behavior, and uses the "non‑point rate" (distance between created address and actual point) as the primary metric. Evaluation runs on each model/strategy/data release before testing.

Online Evaluation

Relies on the algorithm side to auto‑generate models, building a complete online testing platform that supports automatic model updates.

Core Function Implementation

We detail evaluation metrics, sample selection, evaluation methods, result analysis, and feedback mechanisms.

Evaluation Metrics

Traditional metric blind spots: High accuracy (85%) may still increase complaints; low response time (<200 ms) can reduce conversion; high recall (90%) may raise non‑point rate.

Business metric design: Three dimensions—user experience (45), business benefit (40), system performance (15)—with weighted allocation and scenario‑specific customization.

Sample Selection Principles

Close to real scenarios: use actual online order data.

Ensure sufficient scale: default ~100 k samples, adjustable.

Avoid training set overlap and broaden coverage across cities and vehicle types.

Distributed Evaluation Framework

Data is sharded by spatio‑temporal dimensions and processed with Kafka + Spark Streaming, achieving processing of millions of samples with P99 latency under 300 ms.

Evaluation Methods

Offline stage: two‑stage verification.

Offline Simulation

Replay historical data to build a benchmark test set.

Shadow Mode

Run new strategy models in parallel with the live model, automatically compute metrics, filter bad cases, and manually confirm high‑risk results.

Online stage: A/B testing.

Isolate traffic to compare new and old strategies, ensuring no regression before deployment.

private LalamapMetric calcMetric(List<LalamapCompareResult> compareResults) {
    LalamapMetric lalamapMetric = new LalamapMetric();
    lalamapMetric.addTotal(compareResults.size());
    for (LalamapCompareResult compareResult : compareResults) {
        metric(compareResult, lalamapMetric);
    }
    return lalamapMetric;
}

Result Analysis

Bad‑case analysis moves from manual sampling to rule‑engine based detection, automatically diffing result files to highlight missing recommendations, low‑heat areas, or overly distant points, and provides actionable improvement suggestions.

Smart QA Feedback & Alert Architecture

Multi‑channel notification via structured Feishu messages and state‑machine‑driven alert strategies.

Anomaly detection with real‑time alerts.

Alert Type

Trigger Condition

Response Time

Escalation

QA Task Timeout

Task not finished within window (+30 min)

15 min

@owner every 30 min

QA Failure

Accuracy < 0.85 for 3 consecutive cycles

Real‑time

Automatic rollback

Evaluation Benefits

Quality: increased daily data capacity from 10 k to over 1 M, reducing metric gap from 0.38 pp to 0.12 pp.

Testing quality: supported 30 features, uncovered 45 bugs pre‑release.

Efficiency: processing time for 1 M samples dropped from 33.3 h to 2.7 h, saving 30.6 h and 119 person‑days.

Business impact: improved driver satisfaction (loading +0.06 pp, unloading stable) and reduced 30 min non‑point rates.

Stability: replay success rate reached 99.06 % with no platform‑induced blocks.

Future Plans

Sink evaluation capability to the model core, enabling direct model assessment.

Deeply integrate the evaluation workflow into CI/CD pipelines as a quality gate before model release.

Upgrade bad‑case mining with clustering, anomaly detection, and causal inference to provide precise optimization guidance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

recommendation system quality assurance evaluation platform

Written by

Huolala Tech

Technology reshapes logistics

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.