How Huolala Cuts Big Data Costs with Hybrid Cloud Strategies
This article details Huolala's comprehensive big‑data cost‑control system—covering data‑asset measurement, budgeting, auxiliary governance, storage tiering, and elastic compute management—to dramatically reduce both storage and compute expenses while maintaining service quality across diverse workloads.
Preface
Enterprise cost reduction is a hot topic; beyond crude measures, companies can cut big‑data, marketing, and operational expenses. At the ArchSummit Shenzhen, Huolala's big‑data architecture lead Wang Haihua shared the practice of building a hybrid‑cloud big‑data cost‑control system, outlining the challenges and solutions.
Background and Challenges
Huolala operates many business lines—freight, moving, large‑vehicle, cross‑city, car sales—covering over 350 cities with 7.6 million monthly active users. Its big‑data platform spans three IDC locations (Huawei Cloud, Alibaba Cloud, and self‑built data centers), handling massive machines, storage, and daily task volumes.
Huolala's big‑data architecture consists of six layers from bottom to top:
Foundation and access layer – basic storage, compute, and ingestion capabilities.
Platform layer – data development, operations, and governance.
Data‑warehouse layer – layered models, market‑level tables, and tag/metric/feature systems for applications.
Service and application layers.
The architecture is a top‑down, inter‑dependent system, making cost governance complex.
Cost‑control challenges fall into three categories: scenario diversity, data‑asset diversity, and cost‑control difficulty.
Scenario Diversity
Huolala's big‑data workloads include offline, online, and real‑time scenarios. The chart shows 24‑hour compute volume fluctuations for each scenario.
All three exhibit clear peaks and valleys; offline peaks at 0–6 am, real‑time peaks during daytime. Fixed‑size clusters waste resources during low‑peak periods, and clusters for different scenarios cannot share idle capacity.
Data‑Asset Diversity
The data‑flow architecture consists of data collection, storage & compute, and data services.
Data collection includes real‑time and scheduled offline ingestion, generating asset information.
Real‑time storage and compute involve HBase, Kafka, MySQL, etc., producing tables and streams that may be pushed to services.
Offline tasks generate Hive tables, warehouse tables, metrics, tags, and features.
Data services expose APIs and reports.
All ten‑plus asset types consume cost and must be measured and governed.
Cost‑Control Challenges
Key issues include:
Rising per‑order big‑data cost despite business growth.
Unclear cost consumers—costs should be charged to the business units that use the data platform.
Uncertainty about whether cost usage is reasonable or healthy.
Overall Approach
The strategy consists of:
Data‑asset inventory and cost allocation, establishing budget requests and control mechanisms for business units.
After budgeting, provide auxiliary governance (table deletion, lifecycle, archiving) to help business manage costs.
Big‑Data Cost‑Control System
The system has four components: data‑asset measurement, resource budgeting, auxiliary governance, and continuous operation.
Data‑Asset Measurement : Collect resource pool, cost, and health data to support downstream control.
Resource Budgeting : Budget request, usage tracking, alerts, and limits.
Auxiliary Governance : Offline storage and compute governance capabilities.
Continuous Operation : Health‑based red‑black lists with rewards and penalties.
Budget control starts with cost allocation; the top five consuming resources are placed under a budget cap, tracked monthly, and warned or restricted if exceeded for three consecutive quarters.
Storage Cost Optimization
Rapid storage growth (tripled from Feb to Sep) and increasing cold data prompted a three‑step solution:
Identify and label cold data.
Build capabilities: archiving, temperature display, lifecycle expiration.
Data‑team‑led governance: archive cold data, lifecycle cleanup, drop unused tables.
Data Hot/Cold Tiering
Using cloud‑storage archive‑benefit curves, data accessed zero times in the last 90 days is classified as “ice data”, 1‑2 accesses as “cold data”.
Ice data is prioritized for deletion or archiving; cold data is recommended for archiving; hot data may be cached for faster computation.
Data Archiving
Tables have lifecycle and archiving period attributes. Example: lifecycle = 180 days (deletion), archiving = 90‑180 days (archive). Partition business date is used to compute age; non‑partition tables use max(modify‑time, last‑access‑time).
Additional storage optimizations include lifecycle management, archiving, file‑compression upgrades, and deep warehouse governance.
Snappy is the default compression; switching to Zlib saves 25‑30 % space. For latency‑insensitive workloads Zlib is used; for latency‑sensitive core tables Zstd provides high compression with Snappy‑level decompression speed.
Result: after optimization, storage growth halted for eight months and then declined, achieving a 54 % cost reduction and saving tens of millions of RMB annually.
Compute Cost Optimization
Offline and real‑time clusters show clear peak‑valley patterns, causing 20‑30 % resource waste during low‑peak periods.
Elastic compute resource management addresses this via three mechanisms:
In‑house elastic scaling service for dynamic high/low‑peak resource management.
Public‑cloud on‑demand and spot instances to build an elastic pool.
YARN scheduler enhancements to guarantee high‑priority jobs.
Elastic Scaling
Public‑cloud offers reserved, on‑demand, and spot instances. Reserved instances are cheap for fixed capacity; on‑demand is pricier but flexible; spot is cheapest but preemptible. Huolala combines all three to form an elastic pool, scaling down with on‑demand/spot during low load and scaling up when needed.
High‑priority tasks are scheduled to reserved or on‑demand instances to avoid spot preemption.
These measures reduced overall cluster cost by 20‑30 % without affecting high‑priority jobs.
Compute Over‑Commit (超卖)
Logical resources were fully utilized, but physical CPU usage stayed below 50 % (real‑time even ~20 %). By over‑committing logical resources via YARN Dynamic Resource (raising NodeManager limits by 10‑30 % steps) and monitoring CPU and OOM, Huolala achieved a 25 % over‑commit for both offline and real‑time clusters, effectively turning a ¥1 M monthly compute spend into a ¥1.25 M capacity.
OOM protection is added by adjusting OOM scores for critical containers and using YARN NM unhealthy back‑pressure to throttle overloaded nodes.
Compute Task Memory Optimization
Standard Hive containers use 4 GB per core, but most jobs need less. By reducing the default to 2 GB for normal queues (with gray‑scale parameter tuning) and 3 GB for critical queues, CPU utilization rose and Hive task consumption dropped 15 %, equivalent to a 15 % cluster expansion.
Additional CPU gains come from changing instance types to higher CPU‑to‑memory ratios (e.g., 8 GB per core).
Summary
Huolala's cost‑control framework consists of budgeting & control, auxiliary governance, continuous operation, and technical optimization.
Budgeting enables asset measurement, cost allocation, business‑unit visibility, budget caps, and regular tracking/alerts. Auxiliary governance provides tools for cold‑data cleanup, task waste detection, and cost‑based decisions. Technical optimizations—elastic resource management, compute over‑commit, memory tuning, and compression upgrades—further reduce expenses. Continuous operation keeps cost health in check.
Reflection
Systematic cost governance uncovers hidden spenders (e.g., HBase) and drives significant savings. Early‑stage, hands‑on governance works for small teams, but as scale grows, budget‑based self‑service governance frees the data team to focus on deeper technical optimizations.
The goal is transparent, low‑impact cost reduction rather than blunt resource cuts, using health‑driven operations and technology.
Future work includes expanding over‑commit to I/O and kernel dimensions, leveraging spot instances for latency‑tolerant workloads, and applying low‑frequency storage or deep‑archive tiers for cold data.
Speaker Introduction
Wang Haihua leads Huolala's big‑data stability, security, cost, and infrastructure team. With over seven years of experience at Didi, Ele.me, and Pinduoduo, he has built large‑scale data platforms and specializes in big‑data security, platform products, computer architecture, and distributed systems. He has spoken at QCon, SACC, DTCC, and other major conferences.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
