Shopee Data Infra Presentation: Storage Status, Acceleration, Serviceization, and Future Plans
The Shopee Data Infra talk details the current storage architecture, Presto‑based acceleration with Alluxio caching, service‑oriented storage solutions using Alluxio Fuse and S3 APIs, and outlines future enhancements for Spark/Hive integration and CSI/Fuse optimizations, providing a comprehensive view of large‑scale big data storage engineering.
The presentation, delivered by Ding Tianbao and Sun Haoning from Shopee Data Infra, covered four main topics: storage status, storage acceleration, storage serviceization, and future planning.
Storage status : Shopee’s storage stack consists of a storage layer, a scheduling layer (YARN), a compute engine layer (Spark, Flink, Presto), and a platform management layer. The underlying storage uses HDFS and Ozone, with thousands of nodes, hundreds of petabytes of data, billions of files, and peak QPS in the hundreds of thousands.
Storage acceleration : The focus is on Presto, which runs a cluster of several thousand instances with a TP90 of about two minutes and processes tens of petabytes of data daily. Inconsistent HDFS performance and query jitter led to a cache‑centric solution using Alluxio. The classic Alluxio + Presto design mounts HDFS through Alluxio, but faces challenges such as a cache‑size mismatch (PB‑scale storage vs. TB‑scale cache) and long initial data import times.
Solution : Set HMS flags to indicate whether data resides in Presto or Alluxio, design a Cache Manager with custom caching policies and pre‑loading, and allow Presto to read directly from HDFS when the data is not cached. The Cache Manager issues load/unload/mount commands to Alluxio, selects hot tables from Presto query logs, applies cache policies, provides APIs for mount/unmount/load/query, integrates with Kafka for cache updates, and writes flags to HMS for engine awareness.
The implementation identifies hot tables by daily query logs, loads the most frequently accessed partitions into Alluxio, records the mapping in a database, and sets HMS flags (key‑value pairs) so that Presto can query Alluxio directly. This approach yielded up to a 55.5% performance improvement compared with pure HDFS reads.
Storage serviceization : Business pain points include HDFS‑only storage, diverse programming languages, and limited non‑Java HDFS clients. Two service‑oriented solutions were introduced: (1) Alluxio Fuse, deployed either on physical machines or via Kubernetes CSI, providing POSIX‑compatible file access; (2) S3‑for‑HDFS, leveraging Alluxio’s S3‑compatible proxy to allow any language SDK to access data.
Three deployment modes for Fuse were compared: physical‑machine deployment, Kubernetes CSI (NodeServer‑based), and Kubernetes sidecar (per‑Pod container). Physical deployment offers highest independence but higher operational cost; CSI reduces node‑level overhead; sidecar provides isolation at the expense of additional container resources.
S3 integration : Alluxio’s proxy implements the S3 API, enabling clients to use any S3 SDK. Authentication was added by verifying signatures generated from client ID and secret against a server‑side recomputed signature. The proxy maps bucket names to Alluxio top‑level directories and keys to sub‑paths.
The overall service architecture includes S3‑SDK clients behind a load balancer, Alluxio proxy services, physical‑machine Fuse instances, and Kubernetes sidecar containers, all accessing the same HDFS‑backed data pool.
Future plans : (1) Storage acceleration – integrate Spark and Hive with Alluxio and add adaptive cache strategies; (2) Storage serviceization – optimize CSI, decouple Fuse from NodeServer, and enhance POSIX support to handle write‑heavy workloads.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.