Huya Real-Time Computing SLA Practice: Platform Evolution, Core SLA Definition, Capability Building, and Future Outlook
The talk details Huya’s real‑time computing platform evolution from chaotic early stages to a unified, containerized system, defines core SLA metrics focused on latency compliance, describes capability enhancements such as demand monitoring, task analysis, dynamic scaling, and outlines future goals for usability, stability, openness, and unified stream‑batch processing.
Speaker Chen Jian, a big‑data architect at Huya, presented the evolution and SLA practice of Huya’s real‑time computing platform.
Platform Evolution : The platform progressed through four phases – Chaotic (pre‑2019, fragmented engines), Unified (post‑2019, Flink with jar/config modes), Mature (FlinkSQL, containerization, real‑time warehouse, monitoring) and Transforming (service‑oriented and intelligent features).
Architecture Overview : Data flows from DataHub to a data lake, then to offline and real‑time warehouses; the real‑time engine (Flink) spans the entire pipeline, handling ingestion, processing, and output.
Core SLA Definition : Emphasis shifted from platform availability to user‑centric latency compliance. The platform defines “latency compliance rate” as the core SLA, measuring end‑to‑end delay (source consumption time – queue write time + checkpoint time) and prioritising latency‑related issues.
Capability Building : Includes demand‑driven monitoring, lightweight end‑to‑end latency tracking, task analysis (exception, latency, resource), resource evaluation with pre‑deployment stress testing, runtime diagnostic engine, dynamic scaling (horizontal), task disaster recovery (input, compute, output layers), and compute‑power balancing across TaskManagers.
Future Outlook : Focus on improving usability (SQL, end‑to‑end products), stability (higher SLA targets up to 99.99%), openness (community collaboration), and unification (stream‑batch integration across storage, compute, and metadata).
Q&A Highlights : Discussed resource utilization calculation, upstream‑downstream connectivity, dynamic container eviction, and the performance‑diagnostic engine that drives scaling decisions based on business‑side latency.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.