How Huolala Ensures Doris Stability: Real-World Big Data Practices
This article details Huolala's big‑data architecture and the practical measures—ranging from background analysis and stability challenges to case studies, discovery mechanisms, capacity planning, high‑availability, and automation—that the company employs to guarantee Doris's reliability and performance across its rapidly growing logistics platform.
Background and Challenges
Huolala, founded in 2013 in the Greater Bay Area, provides intra‑city and inter‑city freight, enterprise logistics, moving services, less‑than‑truckload, and vehicle sales. By December 2022 it operated in 360 Chinese cities with 680,000 active drivers and 9.5 million active users, entering a rapid growth phase.
The company’s big‑data system consists of a foundational layer (computation, storage, cluster management), a data‑research and data‑warehouse layer, and business‑oriented application and service layers that support and empower its services.
Doris Overview
Since early 2022 Huolala has been using Doris as its primary OLAP engine. Doris ingests data from IoT devices, event‑tracking (埋点), and databases via both real‑time and offline data warehouses. It powers AB‑testing, user profiling, growth analysis (罗盘), and data‑visualization (云台) platforms.
Stability Challenges
Two major challenges exist: (1) Business demands high service stability because Doris underpins many core services; (2) Gaps between open‑source capabilities and production needs, such as limited monitoring, alerting, and operational control, lead to frequent issues as Doris evolves.
Stability goals include maintaining >99.45% accuracy for core‑link data, detecting issues within 5 minutes, and restoring P0 incidents within 5 minutes (P1 within 10 minutes).
Stability Capability Cases
Case 1 – Query Performance : Intermittent errors in cloud‑dashboard queries caused by thread‑pool saturation; mitigated by increasing cache size, lowering query timeouts, and adding large‑query interception.
Case 2 – Derivation Performance : Real‑time tasks delayed due to out‑of‑order jobs; solved by optimizing Doris task parameters, tightening change‑control, and implementing multi‑tenant isolation.
Case 3 – Data Quality : Inconsistent results from Sparkload imports into a Unique model; resolved by converting to a Duplicate model, rewriting tasks, and adding usage guidelines.
Case 4 – Version Upgrade : OOM during upgrade from 1.1 to 1.2 due to missing predicate push‑down; fixed by optimizing SQL predicates and planning HA solutions.
Case 5 – Business Change : Schema changes triggered a Doris 1.0 bug causing segment damage; an emergency Sparkload recovery plan and stricter change governance were introduced.
Stability Construction Approach
Three dimensions guide the effort: Less Incidents , Fast Detection , and Rapid Recovery . Key capabilities include discovery, capacity planning, high‑availability, automation, and additional safeguards.
Discovery Capability
Doris monitoring uses Zabbix for service health. Monitoring covers three levels:
Table‑level metrics (capacity, status)
Task‑level metrics (derivation jobs)
Component metrics (queries, processes, machine resources)
Metrics are tiered: Level‑1 indicates service outage, while Levels‑2 and‑3 support routine troubleshooting.
Capacity Planning
Capacity is assessed by business needs, data volume, hardware resources, and cluster size. Thresholds for disk, CPU, memory, and task queues determine whether the cluster is healthy or requires scaling.
High‑Availability
FE nodes are deployed in a three‑node HA configuration; BE data is replicated across three nodes with four‑node clusters to avoid single‑point failures. Load balancers distribute connections for read/write HA.
Automation
An automated operations platform built on Conductor and Ansible orchestrates Doris deployment, scaling, and upgrades, improving stability and operational efficiency.
Other Safeguards
Query interception rules to kill abusive queries.
Fast data and tablet recovery mechanisms.
Business isolation via multi‑tenant clusters.
RBAC for fine‑grained user permissions.
Stability Process Standards
Three procedural standards ensure consistent stability:
Doris Admission Guidelines : Evaluate new demands against stability, ingestion, and storage criteria; recommend alternatives (e.g., MySQL, HBase) when unsuitable.
Doris Usage Guidelines : Provide test clusters, best‑practice documentation, and anti‑patterns (e.g., small bucket sizes, uncontrolled writes, full‑table scans).
Doris Change Management : Define change windows, release communication, review processes, and functional/stability acceptance testing.
Summary and Planning
Stability targets are quantified, incidents are recorded and reviewed, and processes are refined continuously. Future plans include multi‑cluster HA, multi‑tenant isolation, cold‑hot storage strategies, expanding OLAP platform capabilities, and leveraging upcoming Doris 2.0 features such as high‑concurrency point‑queries and integrated text search.
Thank you for reading.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
