How Huolala Built an End‑to‑End AI Evaluation Platform for Logistics

This article explains how Huolala designed and implemented a one‑stop AI evaluation platform—Lala Zhiping—to select and assess large language models for logistics scenarios, detailing its business background, architecture, configurable workflow, data isolation, permission system, and future development plans.

Huolala Tech
Huolala Tech
Huolala Tech
How Huolala Built an End‑to‑End AI Evaluation Platform for Logistics

Business Background

Huolala, founded in 2013 in the Greater Bay Area, provides intra‑city and inter‑city freight, corporate logistics, moving, car sales, and after‑market services. By June 2023 it operated in 11 global markets, covering 360 Chinese cities with 900k active drivers and 10.5 million active users.

As of November 2023, over 200 large‑model products have been released in China, divided into general‑purpose models and vertical models targeting specific industries.

General models focus on foundational capabilities, e.g., ChatGPT, Baidu Wenxin, Alibaba Tongyi, iFlytek Spark. Vertical models specialize in domains such as freight, finance, healthcare, education, and transportation.

The need arose to find the most suitable large model for Huolala’s business scenarios—AI customer‑service scripts, intelligent data analysis, knowledge‑base evaluation—and to provide a unified online evaluation platform.

Thus, the AI one‑stop evaluation platform Lala Zhiping was created, offering a three‑in‑one “out‑test‑evaluate” capability.

Lala Zhiping Overview

The platform aims to become the leading evaluation system for the logistics industry, covering all testing scenarios and accelerating digital transformation.

Before the platform, evaluation required extensive manual effort to generate test sets, conduct offline testing, and collect results, leading to low efficiency. Existing workflows could not keep up with rapid model releases.

By abstracting the evaluation process, the product architecture includes:

Usage Flow: Define a project space, output a question bank, configure AI answer sheets, generate random exams, automatic or manual scoring, define evaluation methods, and visualize results via drag‑and‑drop.

Key process explanations:

Project space defines evaluation domain, collects questions, and binds scoring methods.

Lala Zhiping can integrate API test sources to assess AI capabilities or directly evaluate AI model performance.

Visualization relies on the company’s data‑platform “Yuntai”, allowing custom result displays based on detailed data.

Technical Architecture Design

Based on business characteristics, the architecture consists of three layers:

Access Layer: Web entry via Kong gateway forwards requests to backend services.

Application Layer: Lala Zhiping integrates question generation, question management, exam management, testing, and evaluation, encapsulating core models with configurable APIs.

Infrastructure Layer: The platform relies on third‑party visualization, permission management, and push mechanisms, and includes a secondary wrapper around GPT gateways for model access.

System requirements derived from real‑world scenarios include:

Universal Configuration System

The design extracts recognizable fields from custom APIs, enabling:

Configurable question creators via third‑party APIs.

Configurable answerers (e.g., gpt‑3.5, gpt‑4) for AI‑driven answering.

Configurable graders using third‑party APIs for automatic scoring.

Example of answerer configuration:

Data Space Isolation

Permissions are divided into functional permissions (module and page access) and data permissions (visibility of specific data). Proper permission design reduces feature clutter and aligns roles—question creators, evaluators, scorers, administrators—with appropriate system capabilities.

Different users see different homepages, all configurable without front‑end or back‑end releases.

Universal Evaluation Assessment

Evaluation rules are flexible per task. Scoring can be automated by AI or performed manually for subjective items. Results can be visualized in multiple dimensions using the company’s BI tool “Yuntai”.

Supports AI automatic/manual scoring.

Supports multi‑dimensional visual score dashboards via Yuntai.

Industry comparison shows platforms like Hugging Face and FlagEval; Lala Zhiping, though early, offers advantages:

Security: All data stays within the corporate network.

Generality: Designed as a universal evaluation platform, not limited to large models.

Customizability: Easy internal API integration with short development cycles.

Domain specificity: Tailored to freight‑industry Q&A.

Future development focuses on:

Completing core capabilities such as model deployment, standardized evaluation, and broader test collections.

Enhancing AI scenarios, moving from basic model testing to full AI‑driven question generation and broader task evaluation.

Authors: Li Ming – Big Data expert (formerly Tencent, now at Huolala); Shi Wenjing – Senior Data Product (AI data products at Huolala).
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

System ArchitectureData IsolationAI Evaluationpermission managementlogistics platformmodel testing
Huolala Tech
Written by

Huolala Tech

Technology reshapes logistics

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.