Operations 14 min read

How Huya Built a Scalable APM Platform for Full‑Stack Observability

Facing explosive growth and increasingly complex distributed services, Huya designed and deployed a custom APM platform that unifies metric, trace, and log collection, provides zero‑cost integration, supports real‑time root‑cause analysis, and offers open APIs for cross‑team empowerment.

dbaplus Community

Jul 6, 2023

How Huya Built a Scalable APM Platform for Full‑Stack Observability

Project Background

As Huya’s live‑streaming business scaled, the diversity of monitoring solutions across teams (custom log collectors, various open‑source tracing tools) created fragmented traceability and hindered efficient troubleshooting. To address these pain points, Huya built an end‑to‑end APM platform that transparently integrates with services, aggregates Metric, Trace, and Log data, and enables full‑link root‑cause identification.

Solution Practice

Design Considerations

Data collection must be zero‑cost to encourage adoption.

Observability model links Metric, Trace, and Log across services.

Full‑link coverage from client → Nginx/Signal → distributed services → DB/Cache.

Observability Model

Handle model : captures request type, handling method, resources, and associated Metric/Span/Log data; Trace context is passed via thread‑local storage.

RPC model : records RPC/HTTP/DB/cache calls, propagating Trace context via request headers.

Correlation : vertical linking of intra‑process and inter‑process calls; horizontal linking of queues, thread pools, and connection pools.

SDK Architecture

Zero‑cost integration using bytecode weaving for Java and framework hooks for C++ and client SDKs.

Built on the OpenTracing standard with a Jaeger implementation, extending it for custom Metric aggregation.

Layered, plug‑in design allows optional features and remote control (sampling, logging, feature toggles).

Server‑Side Architecture

Collector service handles data ingestion, flow control, validation, cleaning, and transformation.

Sinkers store and analyze data:

Trace Sinker – parses span details, adds custom tags and logs.

Metric Sinker – parses metric data, adds dimension tags.

Meta Data Analyzer – extracts URI, interface name, service name, instance info.

Error Analyzer – detects timeline errors, correlates metrics and traces for root‑cause analysis.

Error Analysis – Application Performance Root‑Cause

By visualizing thread‑load rates and request latency, the platform pinpointed high load on /hikari/getConnection as the cause of downstream timeouts in /getUser. Multi‑node aggregation revealed that a single downstream node ( sim‑second:10.66.109.165:8001) was the bottleneck, enabling rapid isolation of the problematic service.

Error Analysis – Performance Bottleneck

Thread‑pool snapshots showed near‑full utilization with many threads blocked on connection acquisition. Detailed request‑level metrics confirmed that /getUser → /hikari/getConnection accounted for the majority of load, linking request processing, thread usage, and connection resources.

Open Empowerment

The APM platform exposes APIs for other teams, enabling:

Metric time‑series data for AIOps anomaly detection.

Trace detail data for automated test request tracing.

Trace context propagation for user‑level request coloring.

Q&A

Q1: Does the platform impact application performance? Metric collection runs in asynchronous threads with minimal overhead. Trace spans are sampled (pre‑sampling and lossy post‑sampling) to limit performance and storage impact.

Q2: How are trace spans correlated with logs and metrics? The SDK tags Metric dimensions to match Trace span tags, allowing joint analysis of request paths, resource usage, and error logs.

Q3: Can custom application logs be collected? Java SDK enriches logs via MDC with trace IDs; C++ integrates similar identifiers. Custom logs are ingested into the log platform and can be queried together with trace data.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance APM

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.