Technical Evolution and Optimization of Kuaishou HDFS
Over the past four years Kuaishou's data grew dozens of times, prompting scalability and storage‑cost challenges, and this article details the architectural evolution, performance and cost optimizations, cross‑region expansion, and future plans of Kuaishou's HDFS system.
Guest: Ding Dinghua (Kuaishou) Editor: Brandon OPPO Platform: DataFunTalk
Overview: In the last four years Kuaishou's data volume increased by dozens of times, reaching exabyte scale with tens of thousands of nodes and hundreds of PB written per day. The cluster faces two main problems: scalability and storage cost. This talk introduces the technical evolution of Kuaishou HDFS, the thinking behind it, and future plans.
01 Technical Evolution and Thinking
1. Kuaishou HDFS Overview
Kuaishou stores data that has grown dozens of times in the past four years, now at the EB level with tens of thousands of nodes and daily writes of hundreds of PB. Kuaishou HDFS supports a wide range of workloads, including traditional data warehouses, media storage, AI training, and machine‑learning data, covering offline, near‑line, and real‑time scenarios. The BlobStore built on HDFS provides object storage for trillions of objects.
A large distributed storage system can be evaluated on four dimensions: availability, reliability, cost, and performance. Kuaishou's improvements over recent years have focused on these dimensions.
Scalability : Beyond cross‑region clusters, Kuaishou built hierarchical protection, degradation, and throttling mechanisms.
Cost : Introduced storage‑cost reduction capabilities, data‑usage insights, governance suggestions, and quota mechanisms.
Performance : Split the large read‑write lock in both NameNode and DataNode into fine‑grained locks, greatly improving metadata and data‑path performance; added slow‑node detection and circuit‑break mechanisms; enhanced reliability with end‑to‑end checksum, silent‑error handling, DataNode indexing, and multiple offline verification methods.
2. Scalability Optimizations
2.1 Metadata Path Scalability
The original architecture used a single large read‑write lock in the NameNode, becoming a bottleneck at 200k‑300k QPM. Metadata was stored entirely in memory, limiting per‑namespace capacity, and client‑side routing updates were costly.
Optimization was performed in two stages: first, hierarchical protection to guarantee high‑priority job latency; second, redesign of the metadata path.
Hierarchical Protection : RPC requests are prioritized based on job priority, with a fetcher that skews resources toward high‑priority jobs to keep their RPC latency low.
Router‑Based Federation (RBF) : Introduced an access layer between clients and NameNode, moving routing tables to a Router for easier operation and horizontal scaling.
RBF challenges: synchronous call model can be exhausted by a few slow namespaces. Added a thread‑permit mechanism and a bypass queue; an asynchronous call model is under development.
StandbyRead : Enables read‑write separation and horizontal scaling of read services by handling read requests in the Router, reducing load on primary NameNode.
After enabling StandbyRead, hotspot namespaces achieve nearly 10 million QPM reads and can scale horizontally.
Fine‑Grained Lock Splitting
NameNode functions: namespace management, file layout & block management, cluster state & replica consistency. The large lock was split into a namespace lock and a block‑manager lock, then the namespace lock was further split into inode‑level locks, and the block‑manager lock into DataNode‑level locks.
Benefits: write throughput increased 5.7×, read throughput 16.6× in high‑load Yarn shuffle namespaces.
3. Cross‑Region Cluster Construction
Motivation: regional resource limits and cost‑effective data‑center placement. Challenges include network latency, bandwidth constraints, and maintaining availability and performance.
Tenant‑level orchestration determines which region a tenant (business department or project) resides in, allowing data locality and reducing cross‑region traffic.
Storage Layer Optimizations
Fine‑grained, global bandwidth control for cross‑region traffic using a Zone Service SDK that reports lease information (tenant, platform, traffic type, priority, DAG ID) and enforces throttling based on resource availability.
Location‑aware read/write routing and regional caching to reduce cross‑region accesses.
Transparent data migration mechanisms that leverage replica‑based caching across regions.
Zone Service operates as a stateless service; when unavailable or overloaded, bandwidth decisions fall back to local enforcement.
4. Cost Optimization System
Rapid growth of offline and object‑storage data increases storage cost pressure. Kuaishou offers multiple storage types to balance availability, durability, cost, and performance:
Object storage: standard (local‑redundant, intra‑city redundant) and low‑frequency (local‑redundant, intra‑city redundant).
HDFS offline storage: standard and low‑frequency, both with local redundancy only.
Low‑frequency storage uses erasure coding (EC) with flexible K+M configurations, supporting RS, XOR, and LRC algorithms, and can place stripe fragments across AZs for fault tolerance.
EC data integrity is ensured by immediate parity verification, checksum protection, and CRC storage in a distributed KV system for offline verification.
Future Plans
Kuaishou will continue to optimize the four dimensions:
Cost: Separate storage from compute, increase storage density, share resource pools with other storage products, explore cheaper cold‑storage solutions, and improve EC coverage and direct‑write EC.
Availability: Enhance multi‑tenant mechanisms and provide varied SLA guarantees.
Reliability: Optimize copyset placement, shorten recovery time, and improve data durability.
Performance: Further optimize router and NameNode paths, improve single‑node storage engines (XFS, SSD/HDD hybrid), and enhance caching in compute‑storage separation scenarios.
Q&A Highlights
Q: How are slow nodes handled? A: Two approaches – avoidance (monitoring physical and service metrics, routing around slow nodes) and circuit‑break (voting among suspected slow nodes based on client reports).
Q: What cost dimensions are considered for HDFS? A: Data volume, per‑GB storage cost, and average replica count. Strategies include storage‑type selection, EC to reduce replicas, and data‑usage insights for lifecycle management.
Q: Are there EC bugs causing data loss? A: Loss occurs only when a stripe loses more than M blocks; high‑risk stripes are monitored, and offline verification is performed. EC is mainly for low‑frequency data, with limited impact on bandwidth and I/O.
Thank you for attending.
Share, like, and give a three‑click boost at the end of the article!
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
