How Huya Unified Its Monitoring Platform with OpenTelemetry for Zero‑Cost Integration
This article details Huya's transition from fragmented, non‑standard monitoring solutions to a unified OpenTelemetry‑based platform, covering project background, pain points, design decisions, SDK architecture, data pipeline, storage, alerting, root‑cause analysis, and future plans, highlighting the benefits of standardization and zero‑cost service integration.
Project Background
Huya operated multiple legacy monitoring systems that met early business needs but lacked standardization, causing growing challenges in stability and development efficiency.
Pain Points
Various custom log and metric collection methods existed without a common SDK, leading to duplicated effort across languages, high intrusion, and difficulty adopting open‑source solutions.
Solution Overview
Based on the OpenTelemetry industry standard, Huya built a unified monitoring platform offering standardized ingestion, storage, analysis, and alerting. The design ensures compatibility, smooth migration, and zero‑cost integration for services.
SDK Architecture Design
A componentized SDK defines Huya‑specific metric, trace, and log conventions and provides language‑agnostic instrumentation (e.g., Java Agent, C++/Go framework injection) for transparent, non‑intrusive integration.
Unified Data Pipeline (Otel Collector)
The pipeline consists of Receivers (supporting legacy and OTLP protocols), Processors (validation, cleaning, protocol conversion), and Exporters (output to Kafka). It ensures data compatibility and high‑performance ingestion.
Storage Layer
Data is written to VM agents with pre‑aggregation and roll‑up, stored in both detailed and aggregated VM clusters. Exporter plugins allow writing to MySQL and HugeGraph for metric and graph storage.
Product Features
Grafana‑based visualizations with custom dashboards for RPC and resource metrics.
Alerting with rule‑based thresholds, dynamic policies, and multi‑dimensional analysis.
Root‑cause drilling from metrics to traces, supporting multi‑dimensional exploration.
Self‑Observability
Collectors emit meta‑data about SDK versions, data volume, and errors. Sinkers handle retries, back‑pressure, and ensure reliable storage.
Alarm & Root‑Cause Analysis
Scheduled rule checks evaluate expressions (e.g., reqCount>1000 && errorRate>20%). Detected anomalies generate events that are aggregated, stored, and sent to alert channels with detailed root‑cause dimensions.
Use Cases
Multi‑dimensional root‑cause drilling for custom business metrics (e.g., gift revenue drops).
RPC call analysis showing service‑level success rates and pinpointing problematic instances.
Future Outlook
Plans include further open‑source contributions of generic components such as Meta Analyzer, Pre‑Agg, and Alarm & Cause Analyzer to the OpenTelemetry ecosystem.
Q&A Highlights
Effectiveness is measured by improved integration efficiency, coverage, stability, and faster feature iteration. SDK design focuses on broad language support, zero intrusion, and performance reporting.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.