Practices in Building Distributed Technologies for Large‑Scale Cloud Computing Platforms
The article summarizes Dr. Zhang Wensong’s 2014 ArchSummit keynote on the challenges, architectural design, storage strategies, performance optimizations, monitoring, and future directions of Alibaba Cloud’s large‑scale distributed cloud computing platform, covering ECS, SLB, RDS, OCS and full‑link analytics.
Based on Dr. Zhang Wensong’s keynote at the 2014 ArchSummit titled “Practices in Building Distributed Technologies for Large‑Scale Cloud Computing Platforms,” this article presents the design and operational insights of Alibaba Cloud’s large‑scale platform. Slides are available from the ArchSummit website.
Speaker Biography
Dr. Zhang is a senior researcher and vice‑president at Alibaba Group, leading core software and cloud product R&D, performance optimization of network hardware/software, and the construction of next‑generation, highly scalable, low‑cost e‑commerce infrastructure. He is a Linux kernel contributor and the founder of the Linux Virtual Server (LVS) project, which runs in millions of clusters worldwide. Before Alibaba, he was chief scientist and co‑founder of TelTel and a former associate professor at the National University of Defense Technology.
Challenges and Requirements of Cloud Computing
Unlike Taobao’s many small, loosely coupled services, a single cloud VM failure means 100 % unavailability for the customer, making reliability, data integrity, low latency, and cost efficiency critical. The platform must match bare‑metal performance, detect and resolve issues before users notice them, and remain cheaper than on‑premise servers.
ECS Distributed Storage Design
ECS (Elastic Compute Service) is Alibaba Cloud’s flagship VM product, backed by a distributed file system that supports snapshots, custom images, failover, network isolation, attack mitigation, and dynamic upgrades. A single control system currently manages about 3,600 physical machines, with a roadmap to 5,000‑20,000.
Data reliability is ensured by a hybrid write strategy: initially two synchronous copies followed by an asynchronous third copy, later refined to two‑sync‑plus‑async. Performance gains focus on reducing write latency by shortening the I/O path while preserving reliability.
Hybrid SSD+SATA storage on chunk servers (≈5,500 IOPS for 4 KB random writes).
Cache mechanisms.
Multithreaded, event‑driven redesign of TDC and Chunk Server so a single thread handles an I/O request, eliminating locks and context switches.
IO Path Cache Layers and Write Modes
Three cache layers exist from application to disk: user‑level cache (e.g., MySQL buffer pool), OS page cache, and hardware cache. Four write modes are discussed:
Buffer write : writes to OS page cache, later flushed to disk; fast but data safety depends on flush policy.
Direct write : bypasses OS cache, writes directly to hardware cache; avoids flush stalls but still not fully durable until sync.
Write+sync : writes to page cache and immediately syncs to storage; safe but slower.
O_SYNC : forces synchronous flush to disk on each write.
Benchmark on a local SAS disk (4 KB block size) shows average speeds: buffer write ≈212 MB/s, direct write ≈68 MB/s, direct+sync ≈257 KB/s. In practice, buffer and direct writes dominate 97 % of operations.
IO in a Cloud Environment
In the cloud, the I/O path extends to a distributed storage layer, effectively replacing local disk cache with a distributed cache system while preserving POSIX semantics across the full VM‑to‑storage pipeline:
VM SYNC → PV front‑end FLUSH → backend → host → cache system → distributed storage.
Cache System Effects
Performance tests using fio illustrate the benefits of an SSD cache layer.
./fio -direct=1 -iodepth=1 -rw=randwrite -ioengine=libaio -bs=16k -numjobs=2 -runtime=30 -group_reporting -size=30G -name=/mnt/test30GWith iodepth=1, pure SATA storage yields ~200 IOPS, 8 ms average latency, 7 ms jitter. Adding SSD cache raises IOPS to ~600, latency to 3 ms, jitter to ~2 ms.
./fio -direct=1 -iodepth=8 -rw=randwrite -ioengine=libaio -bs=16k -numjobs=2 -runtime=30 -group_reporting -size=30G -name=/mnt/test30GAt iodepth=8, SATA alone reaches ~2,100 IOPS (7 ms latency). With SSD cache, IOPS climb to ~2,900 and latency drops to ~5 ms, jitter ~1 ms. The cache thus accelerates writes and reduces latency jitter, which is crucial for high‑concurrency workloads (cf. Jeff Dean’s “The Tail at Scale”).
ECS Storage Options
Customers can choose pure SATA clusters for general workloads, hybrid clusters for higher I/O needs, and upcoming pure‑SSD clusters (expected late 2023) that can deliver up to 180,000 IOPS on physical machines and ~90,000 IOPS on VMs, though CPU consumption remains a challenge.
For workloads like Hadoop, HBase, or MongoDB that already replicate data, Alibaba Cloud offers SATA or SSD local disks to reduce unnecessary redundancy and cost.
SLB, RDS, and OCS
SLB (Server Load Balancer) provides both Layer‑4 (LVS‑based) and Layer‑7 (Tengine‑based) balancing, Anycast cross‑region failover, and DDoS protection. A 12‑core SLB instance can forward ~12 million packets per second (pps), far surpassing a comparable ECS instance (~0.7 million pps).
RDS (Relational Database Service) runs on bare metal, offering three‑tier redundancy, stateless design, and hot‑upgrade capability. It integrates with many internal components (SLB, ODPS, SLS, OSS, OAS, etc.) and achieves 2‑3× higher TPS than equivalent ECS instances.
OCS (Object Cache Service) is built on Tair, with a proxy handling security, QoS, and flow control. It delivers sub‑2 ms response for 99 % of requests and handled over 100 million requests per second during peak events, at roughly half the cost of a self‑managed Memcached deployment.
Full‑Link Monitoring and Analysis System
The monitoring platform collects SQL logs, network traces, and user behavior, funnels them into a Kafka cluster, and processes them with JStorm and Spark for real‑time analysis, while ODPS handles offline analytics. Daily SQL log volume reaches tens of terabytes, enabling second‑level detection of slow queries, missing indexes, or network anomalies, and providing alerts to users.
Initially deployed for RDS, the system is being extended to all cloud products for comprehensive end‑to‑end observability.
Future Work Outlook
Planned enhancements include further I/O path optimization (advanced cache policies, SSD read/write caches, storage‑compute separation, 10 Gbps pure‑SSD clusters, batch flushing, GPU acceleration, LXC/cgroups support), dynamic hot‑spot migration, and broader product migration (e.g., moving Alibaba’s e‑commerce services to the cloud).
RDS will add PostgreSQL support, explore low‑cost disaster recovery using high‑speed non‑volatile network storage for redo logs, and consider trade‑offs between availability and cost.
The monitoring framework will be applied across all cloud services, and new offerings such as wireless network acceleration, AliBench QoS monitoring, OCR, and deep‑learning inference services are slated for release.
Art of Distributed System Architecture Design
Introductions to large-scale distributed system architectures; insights and knowledge sharing on large-scale internet system architecture; front-end web architecture overviews; practical tips and experiences with PHP, JavaScript, Erlang, C/C++ and other languages in large-scale internet system development.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.