Operations 17 min read

How Huolala Guarantees Cloud‑Native Stability at Scale

In this detailed account of Huolala's 2021 Cloud Operations Best Practices talk, the company shares its multi‑cloud architecture, service‑oriented governance, capacity‑testing, monitoring, and risk‑prediction techniques that together ensure high‑availability and efficient scaling for its diverse logistics services.

Huolala Tech

Oct 29, 2021

How Huolala Guarantees Cloud‑Native Stability at Scale

On February 22, 2021, at the Cloud Operations Best Practices forum of the Yunqi Conference, Huolala's Vice President of Technology, Chen Yongting, delivered a talk titled "Cloud‑Based Huolala Technology Stability Assurance Practices," sharing how Huolala achieves technical stability and offering insights for peers with similar business scenarios.

1. Huolala Business Model

Huolala, founded in 2013, initially focused on intra‑city freight and international freight. Today it operates multiple lines including enterprise, inter‑city, moving, and less‑than‑truck‑load services. The platform supplies driver capacity, matches user orders, and its system capacity is tightly linked to business characteristics. Demand‑supply imbalances cause spikes in computational load, often several times the normal peak, while the system lacks idle capacity under normal conditions.

The traffic patterns of each business line show typical high‑peak/low‑peak distributions, with occasional abnormal spikes (e.g., higher weekend traffic, driver shortages before holidays) that increase cancellation and pending‑pair metrics, triggering frequent dispatch alerts. Over the past 1‑2 years, Huolala focused on designing system capacity and evaluating stress‑testing methods under these business patterns.

2. Infrastructure Governance

Huolala’s infrastructure governance aims to improve baseline stability. Early efforts focused on two aspects: (1) service‑oriented governance tailored to Huolala’s context, and (2) providing developers with a controllable, reliable environment for rapid releases, even during peak periods.

The earliest architecture was simple, supporting rapid development and delivery. However, rapid growth in business, data, and services (millions of orders) exposed hidden risks: unreliable service links, unclear dependencies between core and non‑core services, oversized core services, weak self‑healing under high load, and low incident‑resolution efficiency.

To address these, Huolala adopted service‑oriented governance using mature open‑source micro‑service frameworks. Because many legacy services (Java, PHP) could not be rewritten overnight, Huolala introduced a "generic service‑ification" approach that retains original HTTP URL APIs while adding lightweight service registration, discovery, and routing, enabling gradual migration without massive code changes.

During the transition, both Java and PHP services co‑exist, communicating via a sidecar that provides service registration, discovery, and configuration, allowing them to run in the same pool. After governance, service links become clearer, monitoring can visualize the topology, and emergency mechanisms (degradation, rate‑limiting, circuit‑breaking) can be applied swiftly.

Huolala evolved from a single‑IDC single‑link architecture with poor fault tolerance to a single‑IDC multi‑link design, creating isolated logical lanes ("swim lanes") that enable safe gray‑release during peak traffic and provide an emergency path when capacity bottlenecks occur.

3. Building Technical Assurance Capabilities

After completing service‑oriented reconstruction, Huolala established a global stability team responsible for a comprehensive technical assurance platform, covering pre‑construction, fault detection, response, mitigation, and post‑mortem.

The platform includes a NOC team that must detect issues within 1 minute and respond within 5 minutes, acting as the gatekeeper of stability. Organizational safeguards ensure dedicated structures for fault review and continuous metric tracking.

The monitoring platform (AI‑Monitor) provides standard monitoring, alerting, data storage, query, and visualization, as well as advanced capabilities such as intelligent discovery, health scanning, and automated analysis. It monitors over 1,000 applications and 9,000 nodes, generating more than 5,000 alerts daily (pre‑noise reduction).

Risk‑prediction in the AI‑OPS module continuously assesses OS metrics, process metrics, upstream/downstream services, QPS/latency, infrastructure health, and JVM information, issuing early warnings when indicators deviate from defined thresholds. Over 70% of smoke events are predicted in advance, allowing early mitigation.

Root‑cause automatic analysis aggregates abnormal metrics, leverages service‑link relationships, and provides initial diagnostic suggestions, dramatically speeding up engineer troubleshooting and reducing business impact.

Full‑link capacity stress testing is a core battlefield. Huolala conducts bi‑weekly capacity tests, simulating 1.5‑2× traffic on all services to uncover design and performance bottlenecks, and runs regular fault‑simulation drills to keep the team sensitive to stability issues.

Data platform construction (DMS) focuses on data isolation, rate‑limiting, and auto‑scaling; DAL provides flexible sharding, read/write separation; and resource‑ID design abstracts physical connection details, enabling developers to use services without knowing underlying cluster configurations, while operations can quickly locate problematic resources.

4. Cross‑Cloud Thinking and Implementation

Huolala balances efficiency, cost, and stability in multi‑cloud environments through three measures: smoothing cloud‑provider differences, mitigating cloud‑induced jitter, and controlling IT costs.

To smooth differences, Huolala built the LCloud tool platform that unifies APIs across cloud providers, giving developers a consistent experience.

Because cloud services can jitter, Huolala ensures services have sufficient elasticity; even brief network jitter can raise latency and cause service unavailability, so elasticity is critical.

Cost control leverages Kubernetes for elastic resource scheduling, spot instances, reserved instances, and container‑team efforts to reclaim idle resources for offline tasks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Operations Multi-Cloud Service Governance capacity testing

Written by

Huolala Tech

Technology reshapes logistics

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.