How Huolala Built an End‑to‑End AI Evaluation Platform for Logistics
This article explains how Huolala designed and implemented a one‑stop AI evaluation platform—Lala Zhiping—to select and assess large language models for logistics scenarios, detailing its business background, architecture, configurable workflow, data isolation, permission system, and future development plans.
Business Background
Huolala, founded in 2013 in the Greater Bay Area, provides intra‑city and inter‑city freight, corporate logistics, moving, car sales, and after‑market services. By June 2023 it operated in 11 global markets, covering 360 Chinese cities with 900k active drivers and 10.5 million active users.
As of November 2023, over 200 large‑model products have been released in China, divided into general‑purpose models and vertical models targeting specific industries.
General models focus on foundational capabilities, e.g., ChatGPT, Baidu Wenxin, Alibaba Tongyi, iFlytek Spark. Vertical models specialize in domains such as freight, finance, healthcare, education, and transportation.
The need arose to find the most suitable large model for Huolala’s business scenarios—AI customer‑service scripts, intelligent data analysis, knowledge‑base evaluation—and to provide a unified online evaluation platform.
Thus, the AI one‑stop evaluation platform Lala Zhiping was created, offering a three‑in‑one “out‑test‑evaluate” capability.
Lala Zhiping Overview
The platform aims to become the leading evaluation system for the logistics industry, covering all testing scenarios and accelerating digital transformation.
Before the platform, evaluation required extensive manual effort to generate test sets, conduct offline testing, and collect results, leading to low efficiency. Existing workflows could not keep up with rapid model releases.
By abstracting the evaluation process, the product architecture includes:
Usage Flow: Define a project space, output a question bank, configure AI answer sheets, generate random exams, automatic or manual scoring, define evaluation methods, and visualize results via drag‑and‑drop.
Key process explanations:
Project space defines evaluation domain, collects questions, and binds scoring methods.
Lala Zhiping can integrate API test sources to assess AI capabilities or directly evaluate AI model performance.
Visualization relies on the company’s data‑platform “Yuntai”, allowing custom result displays based on detailed data.
Technical Architecture Design
Based on business characteristics, the architecture consists of three layers:
Access Layer: Web entry via Kong gateway forwards requests to backend services.
Application Layer: Lala Zhiping integrates question generation, question management, exam management, testing, and evaluation, encapsulating core models with configurable APIs.
Infrastructure Layer: The platform relies on third‑party visualization, permission management, and push mechanisms, and includes a secondary wrapper around GPT gateways for model access.
System requirements derived from real‑world scenarios include:
Universal Configuration System
The design extracts recognizable fields from custom APIs, enabling:
Configurable question creators via third‑party APIs.
Configurable answerers (e.g., gpt‑3.5, gpt‑4) for AI‑driven answering.
Configurable graders using third‑party APIs for automatic scoring.
Example of answerer configuration:
Data Space Isolation
Permissions are divided into functional permissions (module and page access) and data permissions (visibility of specific data). Proper permission design reduces feature clutter and aligns roles—question creators, evaluators, scorers, administrators—with appropriate system capabilities.
Different users see different homepages, all configurable without front‑end or back‑end releases.
Universal Evaluation Assessment
Evaluation rules are flexible per task. Scoring can be automated by AI or performed manually for subjective items. Results can be visualized in multiple dimensions using the company’s BI tool “Yuntai”.
Supports AI automatic/manual scoring.
Supports multi‑dimensional visual score dashboards via Yuntai.
Industry comparison shows platforms like Hugging Face and FlagEval; Lala Zhiping, though early, offers advantages:
Security: All data stays within the corporate network.
Generality: Designed as a universal evaluation platform, not limited to large models.
Customizability: Easy internal API integration with short development cycles.
Domain specificity: Tailored to freight‑industry Q&A.
Future development focuses on:
Completing core capabilities such as model deployment, standardized evaluation, and broader test collections.
Enhancing AI scenarios, moving from basic model testing to full AI‑driven question generation and broader task evaluation.
Authors: Li Ming – Big Data expert (formerly Tencent, now at Huolala); Shi Wenjing – Senior Data Product (AI data products at Huolala).
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
