Big Data 13 min read

Huya Real-Time Computing SLA Practices: Platform Evolution, Core SLA Definition, Capability Building, and Future Outlook

This article details Huya's real‑time computing platform evolution, core SLA definitions focused on latency compliance, capability enhancements such as demand management, task analysis, dynamic resource scaling, and outlines future directions emphasizing usability, stability, openness, and unified batch‑stream processing.

DataFunTalk
DataFunTalk
DataFunTalk
Huya Real-Time Computing SLA Practices: Platform Evolution, Core SLA Definition, Capability Building, and Future Outlook

As real‑time computing matures, more businesses rely on low‑latency, highly reliable platforms; Huya's real‑time computing SLA practice addresses these challenges by covering platform introduction, core SLA definition, capability building, and future outlook.

Platform Introduction : The platform evolved through four stages—Chaos (pre‑2019 fragmented engines), Unification (standardized on Flink with jar/config modes), Improvement (FlinkSQL, containerization, real‑time warehouse), and Transformation (service‑oriented and intelligent features). Images illustrate each stage.

Platform Architecture Overview : Data flows from Datahub to a data lake, then branches to offline and real‑time warehouses, with the real‑time computing platform spanning the entire pipeline.

Core SLA Definition : The shift from platform‑centric to service‑centric thinking leads to defining latency compliance rate as the core SLA, with different guarantees for varying latency needs. Monitoring uses source consumption time minus queue write time plus checkpoint duration, providing lightweight, end‑to‑end latency tracking.

Capability Building : Includes delay demand management & monitoring, task analysis (exception, latency, resource), resource evaluation, and dynamic scaling. Debugging pressure testing comprises configuration, resource allocation, source sampling, and sink simulation. Runtime resource evaluation uses a rule‑engine‑driven diagnosis engine to recommend scaling. Optimizations such as delayed task scheduling and group sorting improve load balancing, raising SLA from 70% to 99% and resource utilization from 12% to 21%.

Future Outlook : Four directions—Usability (enhanced SQL, end‑to‑end products), Stability (aiming for 99.9%+ SLA), Openness (community interaction), and Unification (integrated batch‑stream storage, compute, and metadata).

Q&A Highlights : Resource utilization is calculated per‑node by the container platform; upstream/downstream connectivity abstracts logical topics and Flink tables; dynamic eviction is controlled by host utilization with safeguards; the performance diagnosis engine combines business‑side latency with resource metrics; scaling is primarily horizontal.

Big Dataflinkstream processingresource managementSLAreal-time computing
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.