Evolution of Xiaohongshu Metrics System: Cloud‑Native Observability, High Availability, and Performance Optimizations
Xiaohongshu’s observability team rebuilt its Prometheus‑based metrics platform using vmagent, dual‑active HA clusters, query push‑down, high‑cardinality governance and multi‑cloud active‑active design, delivering ten‑fold collection speed, up to 70× query capacity, massive CPU‑memory‑storage savings and fully automated scaling.
In the current cloud‑native era, Xiaohongshu’s observability team recognized that Metrics are the cornerstone of monitoring, alerting, performance tuning, and capacity planning. Rapid business growth caused metric volume to reach billions of data points, exposing stability, cost, and operational challenges in the original Prometheus‑based architecture.
The team performed a comprehensive overhaul, achieving ten‑fold improvements in collection and query performance, dramatically reducing CPU, memory, and storage costs while delivering minute‑level scaling and high‑availability capabilities.
1. Background and Problems
Single‑replica architecture caused data loss and high latency during spikes.
Prometheus instances consumed >85% memory, triggering frequent alerts.
Deployment was fragmented across many K8s clusters and VMs, leading to complex operations and lack of rapid scaling.
Slow query response and Thanos fallback caused poor user experience.
2. Evolution Roadmap
2.1 Collection
Replaced Prometheus agents with vmagent and eliminated Thanos.
Implemented dynamic configuration via a central config center, enabling hot‑reload without service disruption.
Supported sharding and smooth horizontal scaling by adjusting Shard_count parameters.
Added sample‑limit and label‑length validation to protect the collector from time‑series explosion.
Optimized start‑up and deletion of large numbers of scrape targets, reducing CPU spikes and OOM risk.
2.2 High‑Availability Refactor
Deployed full‑link dual‑active clusters for both collection and storage, providing fault tolerance for single‑node or whole‑cluster failures.
Replaced Thanos‑style Reroute with a local queue mechanism and a Meta service for service discovery, eliminating cascade failures.
Introduced cloud‑disk storage, backup/restore snapshots, and automated migration to ensure data safety.
2.3 Query Optimization
Implemented computation push‑down: aggregation (sum, count, avg, max, min) is performed on storage nodes, cutting data transfer by orders of magnitude.
Added query‑data‑size limits and memory protection on both query and storage nodes.
2.4 High Cardinality Governance
Adopted a differential dynamic cardinality management strategy with label white‑lists and Bloom‑filter based detection.
Provided a separate high‑cardinality pipeline to isolate noisy metrics.
2.5 Multi‑Cloud Active‑Active
Deployed unit‑level collection, ingestion, and storage per cloud region, reducing cross‑cloud bandwidth by ~80%.
Unified query across regions with push‑down aggregation, ensuring resilience when a single cloud experiences outages.
3. Results
Collection performance improved ~10×, saving tens of thousands of CPU cores and hundreds of TB of SSD.
Query latency reduced by several‑fold to dozens of times faster; query capacity increased up to 70×.
Operational overhead dramatically lowered: configuration changes are now white‑screen, scaling is automated, and migration time shrank from weeks to half a day.
4. Authors
Han Bai, Bu Ke, and A Pu – members of Xiaohongshu’s Observability Technology Group, with backgrounds in distributed systems, cloud‑native infrastructure, and performance engineering.
Xiaohongshu Tech REDtech
Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.