Big Data 21 min read

JD Retail HDFS Unified Storage: Cross‑Region and Tiered Storage Practices

This article presents JD Retail's large‑scale HDFS deployment, detailing its unified storage architecture, cross‑region data replication challenges and solutions, tiered storage strategies for hot, warm and cold data, and the operational modules that together improve performance, reliability and cost efficiency in a big‑data environment.

DataFunSummit
DataFunSummit
DataFunSummit
JD Retail HDFS Unified Storage: Cross‑Region and Tiered Storage Practices

01 Overview

With the arrival of the big‑data era, massive data storage and processing have become critical challenges for enterprises. JD Retail relies on Hadoop Distributed File System (HDFS) as a highly reliable and scalable distributed file system that underpins data analysis tools, downstream services, and massive offline jobs. The platform operates tens of thousands of servers, stores data at the exabyte level, and handles daily growth of tens of petabytes, supported by visual management tools that simplify monitoring and operations.

02 Cross‑Region Storage

1. Existing Problems

Single‑datacenter deployments can no longer meet JD's multi‑datacenter expansion needs, leading to issues such as insufficient disaster‑recovery capability, inconsistent metadata across sites, redundant data storage, and uncontrolled inter‑datacenter links.

2. Storage Architecture

JD adopted a full‑storage plus full‑network‑topology strategy, enabling all DataNodes (DN) in a region to report to a common NameNode, achieving unified metadata management, eliminating metadata inconsistency, and reducing migration costs. The new architecture improved migration efficiency by 350 % and increased read performance by over 70 % through read‑only nodes and write‑read separation.

3. Challenges

Rapid cluster expansion, cross‑region heartbeat stability, and traffic control between datacenters required new management mechanisms and dynamic throttling to avoid queue backlogs and ensure reliable data synchronization.

03 Tiered Storage

1. Storage Comparison

Hot data is stored on high‑performance SSDs, warm data on standard HDDs, and cold data on high‑density HDDs. This hierarchy addresses the waste caused by treating hot and cold data alike and leverages hardware differences for optimal performance.

2. JD Tiered Storage Strategy

Three‑level tiering (hot‑SSD, warm‑HDD, cold‑high‑density HDD) is enforced by labeling directories with XATTR tags. Automatic conversion modules in the NameNode move data between tiers based on access patterns, TTL, and erasure coding for cold data, while read/write weight, storage usage, and node health guide block placement.

3. Core Modules

Data Access Monitor – uses LRU to identify hot files and provides APIs for policy changes.

Tier Management Module – scans tagged directories, creates conversion tasks, and submits them to the Task Management Module.

Task Management Module – a distributed scheduler that dispatches block delete, copy, or recovery tasks to DataNodes, extending community task types.

These modules together improved overall performance by 10 %, increased erasure‑coded data coverage to 30 %, and reduced cold‑data storage cost by 90 %.

04 Practical Integration

1. Cross‑Region Lifecycle Management

Data that remains unread for long periods is migrated from multi‑datacenter storage to a single‑datacenter tier and then converted to erasure‑coded cold storage, dramatically lowering redundancy and cost.

2. Data Scheduling

By monitoring access, hot data is “re‑heated” and distributed across regions for faster reads, while task‑drift mechanisms allocate jobs to clusters with available resources, improving execution timeliness.

Conclusion

JD Retail's unified HDFS storage solution, combining cross‑region replication and tiered storage, achieves both performance gains and significant cost reductions, offering a reference architecture for large‑scale distributed storage systems.

Big Datadata managementdistributed file systemHDFSTiered StorageCross-Region Storage
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.