NetEase Real-Time Computing Platform (Sloth): Architecture, Practices, and Future Outlook
This article introduces NetEase's real-time computing platform Sloth, detailing its architecture, component layers, integrated IDE, operational tooling, unified metadata management, challenges such as Kudu write amplification, and proposes a tiered real‑time data‑warehouse model with a vision for storage‑compute separation and unified batch‑stream APIs.
NetEase's real‑time computing platform, named Sloth, has grown since its launch in December 2017 to over 50,000 elastic compute units, 15,110 CPU cores, and more than 34 TB of memory, providing a scalable foundation for low‑latency data processing.
Platform Architecture is divided into functional modules (Admin for asynchronous services and Server for stateless PaaS services) and four data‑flow layers: Source (relational data via NDC and log data via Datastream), Message Queue (Kafka), Compute (Flink for cleaning, transformation, and aggregation), and Sink (Kudu as the primary columnar store, with MySQL/Redis for smaller volumes).
One‑stop Real‑time Development IDE supports both SQL‑based and JAR‑based development, offering offline/online debugging, version control, diff, and configuration management.
Operational Management includes three aspects: task operation (viewing task info, parameters, logs), server monitoring (Grafana‑based dashboards for throughput, latency, IO, QPS), and alert configuration (rules for failure counts, latency thresholds, notification channels).
Unified Metadata Center standardizes metadata registration across relational databases, NoSQL sources, and message queues, simplifying development, enabling reuse, and allowing flexible field changes.
Challenges and Future Directions highlight Kudu's write‑amplification due to compaction, the exponential cost growth of stream processing with data volume, and propose storage‑compute separation by externalizing compaction, offering merge‑on‑read and copy‑on‑write strategies, and delivering a unified batch‑stream API.
Data‑Warehouse Tiering suggests classifying warehouses by latency requirements: millisecond‑second (private‑car), minute (subway), hour‑day (high‑speed rail), each with distinct trade‑offs.
Real‑time Warehouse Trade‑offs emphasize balancing latency, availability, and cost, noting that stream processing incurs exponential cost due to random I/O and compaction, while batch processing scales linearly.
The article concludes by summarizing NetEase's real‑time warehouse product shape, analyzing current pain points, and outlining a vision for a tiered, storage‑compute‑separated, batch‑stream unified architecture.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.