Operations 13 min read

How a Unified Metadata Platform Boosts SRE Efficiency and Cuts Costs

This article describes how Huya built a unified metadata platform to break data silos across its numerous operations systems, enabling standardized data ingestion, association, visualization and analysis that improve resource governance, root‑cause diagnosis, and overall cost‑control for SRE teams.

Efficient Ops
Efficient Ops
Efficient Ops
How a Unified Metadata Platform Boosts SRE Efficiency and Cuts Costs

1. Project Background

With Huya’s rapid business growth, the operations ecosystem expanded to include resource provisioning, containerization, build‑release, monitoring and alerting, each maintaining independent metadata, creating data islands that severely hindered data utilization.

Severe data‑island problems due to lack of a unified metadata model.

Cost‑control difficulties because business resource usage and cross‑data‑center traffic could not be analyzed effectively.

Root‑cause location challenges caused by missing associations among monitoring metrics, trace links and alerts.

2. Solution Overview

Huya built a unified metadata platform that integrates metadata from all systems, provides a standard model, and offers capabilities for data ingestion, association, storage, visualization and analysis, thereby supporting resource governance, root‑cause diagnosis and architectural optimization.

3. Design Details

The platform treats application services as the core of the metadata network, linking horizontal (application‑to‑application) and vertical (application‑to‑resource) relationships to form a comprehensive metadata association graph.

Metadata Types

Application services : service name, IP/Port, API, dependencies, framework, code repository, etc.

Monitoring metrics : CPU, memory, network utilization, request volume, latency, error rates, etc.

Infrastructure : containers, data centers, domains, network types, resource usage, etc.

Middleware : databases, caches, message queues, real‑time and batch compute components.

All metadata are synchronized or reported to the platform and transformed according to a unified association model.

4. Implementation Practices

Key steps include:

Design metadata ingestion specifications and a unified association model.

Connect application, resource and middleware metadata to build a complete association network.

Provide visualization, retrieval and analysis capabilities for downstream users.

Core modules : data conversion, association, query (SDK/OpenAPI/Gremlin) and resource replay.

Storage engines : Graph DB stores the vertex/edge model of the association network for high‑performance multi‑level queries; OLAP DB stores multi‑dimensional, time‑granular snapshots for large‑scale statistical analysis.

5. Application Scenarios

Multi‑dimensional Resource Analysis

The platform automatically aggregates resource usage and utilization across multiple dimensions, enabling rationality analysis of business resources and detection of cross‑data‑center traffic.

Multi‑label Classification

A graph‑based hierarchical tag system, generated from trace data and AIOps, allows custom tagging of applications and flexible query across various label dimensions.

Full‑link Root‑cause定位

By correlating business, application and infrastructure metrics with trace data, the platform pinpoints failure origins, illustrated by the “gift‑sending” success‑rate case.

Future Outlook

The roadmap extends the platform to cover the entire DevOps lifecycle—code repository → build & release → runtime—enabling rapid detection of security‑related library upgrades (e.g., Log4j) and change‑induced anomalies across the full service lifecycle.

Q&A

Answers address topology complexity mitigation, trace‑based alarm root‑cause analysis, and the timing of metadata standardization (initial bulk sync followed by incremental event‑driven updates).

<code>Huya App(/api/sendGift, Success Rate: 70%)
    -> GiftServer{/api/payMoney, Timeout Rate: 30%}
    -> MoneyServer{/api/payMoney, Latency avg: 5100ms, 192.168.1.2:8080, Cpu Usage: 90%,
        /mysql/SELECT, Latency avg: 5000ms,
        sql detail: select * from money where uid=?}
    -> Mysql{more...}</code>
operationsgraph databaseSREroot cause analysisresource governancemetadata platform
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.