Cloud Native 15 min read

How Alibaba Cloud Hologres Scales to 8,192 Nodes with Cloud‑Native Architecture

Alibaba Cloud's real‑time data warehouse Hologres passed a large‑scale performance test of up to 8,192 nodes, demonstrating how cloud‑native design, automated operations, and intelligent monitoring enable ultra‑large‑scale deployment and high‑availability for enterprise data platforms.

Alibaba Cloud Developer

Feb 9, 2022

How Alibaba Cloud Hologres Scales to 8,192 Nodes with Cloud‑Native Architecture

From November 23 to December 3, 2021, the China Academy of Information and Communications Technology evaluated 27 distributed analytical database products, and Alibaba Cloud's real‑time data warehouse Hologres (formerly Interactive Analysis) passed the large‑scale performance test, refreshing the record with 8,192 nodes.

Hologres is the largest MPP data warehouse to pass this evaluation, proving its capability as a foundational infrastructure for data warehouses and big‑data platforms, supporting massive scale and critical industry workloads.

Challenges of Ultra‑Large‑Scale Deployment

Exponential data growth makes single‑node databases insufficient, especially for analytical queries that may need to process full datasets. Enterprises also demand real‑time data freshness and high‑performance analytics, leading to challenges such as rapid cluster provisioning, elastic scaling, defining SLA metrics, storage‑compute integration, capacity planning, and robust monitoring.

Meeting fast delivery and elastic scaling requirements.

Defining service availability indicators and SLA frameworks.

Choosing hardware models and planning capacity for storage‑compute integration.

Weak monitoring, slow fault recovery, and lack of self‑healing.

Scaling to tens of thousands of nodes further amplifies scheduling, deployment, and operational difficulties.

Achieving second‑level instance startup and elastic scaling for clusters with tens of thousands of nodes.

Ensuring capacity planning, stability, and self‑healing for massive clusters.

Providing minute‑level monitoring and rapid issue resolution.

Cloud‑Native Large‑Scale Scheduling Architecture

Hologres adopts a cloud‑native containerized deployment using Kubernetes as the resource scheduler, supporting clusters of over 10,000 servers and single instances with 8,192 nodes or more.

Kubernetes Ten‑Thousand‑Node Scheduling

Kubernetes officially supports up to 5,000 nodes; Alibaba Cloud extended this to ten‑thousand‑node clusters by deep optimizations of etcd, API Server, controller, and scheduler components, addressing read/write latency, OOM risks, and load‑balancing issues.

Etcd read/write latency and service‑denial incidents.

High API Server query latency causing etcd OOM.

Controller processing delays and slow recovery.

Scheduler high latency and low throughput.

Optimizations included a new memory‑free‑page management algorithm for etcd, lightweight heartbeat mechanisms, improved HA API Server load balancing, hot‑standby controller/scheduler failover, and enhanced scheduler algorithms with equivalence class handling and random relaxation.

Hologres Operations System

Overview

Leveraging Alibaba Cloud's big‑data operations platform, Hologres built an automated delivery, real‑time observability, and intelligent self‑healing system to achieve production‑grade SLA.

Automated Cluster Delivery

Hologres separates storage and compute, deploying compute nodes via Kubernetes. Using the ABM management system, resources and business clusters are abstracted, enabling rapid provisioning and capacity maintenance.

Observability System

Hologres implements a rich metric monitoring system using Alibaba's Emon platform, supporting billions of metric writes per second, automatic down‑sampling, and aggregation. Metrics are also exported to cloud monitoring for user‑side inspection.

Log collection utilizes Alibaba Cloud SLS with modular and tiered ingestion, enabling cost‑effective large‑scale log handling and keyword‑based alerts.

To assess overall instance health, a meta‑warehouse aggregates multi‑dimensional data (metadata, module status, events) for comprehensive availability monitoring and supports slow‑query diagnostics.

Intelligent Operations for SLA Improvement

Based on the observability foundation, Hologres implements automated fault diagnosis, self‑healing, and intelligent inspection. Known issues trigger automatic recovery; unknown issues generate tickets for manual handling, gradually expanding self‑healing capabilities.

Product‑Level Operational Capabilities

Hologres employs a high‑availability architecture proven in Alibaba's major events (e.g., Double 11), featuring storage‑compute separation, multi‑replication for read/write separation, and a robust scheduling system for rapid node failover.

Observability includes multi‑dimensional metrics (CPU, memory, connections, I/O), slow‑query logs with detailed diagnostics, and execution‑plan visualizations to guide performance tuning.

Conclusion

Through targeted analysis and optimization of scheduling bottlenecks, Hologres achieves deployment and scaling up to 8,192 nodes and beyond, while its cloud‑native intelligent operations ensure high performance, high availability, and rapid issue resolution, supporting enterprise digital transformation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

cloud-native Kubernetes Real-Time Data Warehouse Alibaba Cloud Large-Scale Deployment

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.