Big Data 12 min read

Huya Real-Time Computing SLA Practice: Platform Evolution, Core SLA Definition, Capability Building, and Future Outlook

The talk details Huya’s real‑time computing platform evolution from chaotic early stages to a unified, containerized system, defines core SLA metrics focused on latency compliance, describes capability enhancements such as demand monitoring, task analysis, dynamic scaling, and outlines future goals for usability, stability, openness, and unified stream‑batch processing.

DataFunSummit

Apr 22, 2022

Huya Real-Time Computing SLA Practice: Platform Evolution, Core SLA Definition, Capability Building, and Future Outlook

Speaker Chen Jian, a big‑data architect at Huya, presented the evolution and SLA practice of Huya’s real‑time computing platform.

Platform Evolution : The platform progressed through four phases – Chaotic (pre‑2019, fragmented engines), Unified (post‑2019, Flink with jar/config modes), Mature (FlinkSQL, containerization, real‑time warehouse, monitoring) and Transforming (service‑oriented and intelligent features).

Architecture Overview : Data flows from DataHub to a data lake, then to offline and real‑time warehouses; the real‑time engine (Flink) spans the entire pipeline, handling ingestion, processing, and output.

Core SLA Definition : Emphasis shifted from platform availability to user‑centric latency compliance. The platform defines “latency compliance rate” as the core SLA, measuring end‑to‑end delay (source consumption time – queue write time + checkpoint time) and prioritising latency‑related issues.

Capability Building : Includes demand‑driven monitoring, lightweight end‑to‑end latency tracking, task analysis (exception, latency, resource), resource evaluation with pre‑deployment stress testing, runtime diagnostic engine, dynamic scaling (horizontal), task disaster recovery (input, compute, output layers), and compute‑power balancing across TaskManagers.

Future Outlook : Focus on improving usability (SQL, end‑to‑end products), stability (higher SLA targets up to 99.99%), openness (community collaboration), and unification (stream‑batch integration across storage, compute, and metadata).

Q&A Highlights : Discussed resource utilization calculation, upstream‑downstream connectivity, dynamic container eviction, and the performance‑diagnostic engine that drives scaling decisions based on business‑side latency.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink stream processing SLA Real‑Time Computing

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.