Operations 13 min read

How Huya Unified Its Monitoring Platform with OpenTelemetry for Zero‑Cost Integration

This article details Huya's transition from fragmented, non‑standard monitoring solutions to a unified OpenTelemetry‑based platform, covering project background, pain points, design decisions, SDK architecture, data pipeline, storage, alerting, root‑cause analysis, and future plans, highlighting the benefits of standardization and zero‑cost service integration.

Efficient Ops

Jun 4, 2024

How Huya Unified Its Monitoring Platform with OpenTelemetry for Zero‑Cost Integration

Project Background

Huya operated multiple legacy monitoring systems that met early business needs but lacked standardization, causing growing challenges in stability and development efficiency.

Pain Points

Various custom log and metric collection methods existed without a common SDK, leading to duplicated effort across languages, high intrusion, and difficulty adopting open‑source solutions.

Solution Overview

Based on the OpenTelemetry industry standard, Huya built a unified monitoring platform offering standardized ingestion, storage, analysis, and alerting. The design ensures compatibility, smooth migration, and zero‑cost integration for services.

SDK Architecture Design

A componentized SDK defines Huya‑specific metric, trace, and log conventions and provides language‑agnostic instrumentation (e.g., Java Agent, C++/Go framework injection) for transparent, non‑intrusive integration.

Unified Data Pipeline (Otel Collector)

The pipeline consists of Receivers (supporting legacy and OTLP protocols), Processors (validation, cleaning, protocol conversion), and Exporters (output to Kafka). It ensures data compatibility and high‑performance ingestion.

Storage Layer

Data is written to VM agents with pre‑aggregation and roll‑up, stored in both detailed and aggregated VM clusters. Exporter plugins allow writing to MySQL and HugeGraph for metric and graph storage.

Product Features

Grafana‑based visualizations with custom dashboards for RPC and resource metrics.

Alerting with rule‑based thresholds, dynamic policies, and multi‑dimensional analysis.

Root‑cause drilling from metrics to traces, supporting multi‑dimensional exploration.

Self‑Observability

Collectors emit meta‑data about SDK versions, data volume, and errors. Sinkers handle retries, back‑pressure, and ensure reliable storage.

Alarm & Root‑Cause Analysis

Scheduled rule checks evaluate expressions (e.g., reqCount>1000 && errorRate>20%). Detected anomalies generate events that are aggregated, stored, and sent to alert channels with detailed root‑cause dimensions.

Use Cases

Multi‑dimensional root‑cause drilling for custom business metrics (e.g., gift revenue drops).

RPC call analysis showing service‑level success rates and pinpointing problematic instances.

Future Outlook

Plans include further open‑source contributions of generic components such as Meta Analyzer, Pre‑Agg, and Alarm & Cause Analyzer to the OpenTelemetry ecosystem.

Q&A Highlights

Effectiveness is measured by improved integration efficiency, coverage, stability, and faster feature iteration. SDK design focuses on broad language support, zero intrusion, and performance reporting.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Observability metrics OpenTelemetry Tracing Huya

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.