Cloud Native 35 min read

Evolution of Xiaohongshu Metrics System: Cloud‑Native Observability, High Availability, and Performance Optimizations

Xiaohongshu’s observability team rebuilt its Prometheus‑based metrics platform using vmagent, dual‑active HA clusters, query push‑down, high‑cardinality governance and multi‑cloud active‑active design, delivering ten‑fold collection speed, up to 70× query capacity, massive CPU‑memory‑storage savings and fully automated scaling.

Xiaohongshu Tech REDtech

Dec 14, 2023

Evolution of Xiaohongshu Metrics System: Cloud‑Native Observability, High Availability, and Performance Optimizations

In the current cloud‑native era, Xiaohongshu’s observability team recognized that Metrics are the cornerstone of monitoring, alerting, performance tuning, and capacity planning. Rapid business growth caused metric volume to reach billions of data points, exposing stability, cost, and operational challenges in the original Prometheus‑based architecture.

The team performed a comprehensive overhaul, achieving ten‑fold improvements in collection and query performance, dramatically reducing CPU, memory, and storage costs while delivering minute‑level scaling and high‑availability capabilities.

1. Background and Problems

Single‑replica architecture caused data loss and high latency during spikes.

Prometheus instances consumed >85% memory, triggering frequent alerts.

Deployment was fragmented across many K8s clusters and VMs, leading to complex operations and lack of rapid scaling.

Slow query response and Thanos fallback caused poor user experience.

2. Evolution Roadmap

2.1 Collection

Replaced Prometheus agents with vmagent and eliminated Thanos.

Implemented dynamic configuration via a central config center, enabling hot‑reload without service disruption.

Supported sharding and smooth horizontal scaling by adjusting Shard_count parameters.

Added sample‑limit and label‑length validation to protect the collector from time‑series explosion.

Optimized start‑up and deletion of large numbers of scrape targets, reducing CPU spikes and OOM risk.

2.2 High‑Availability Refactor

Deployed full‑link dual‑active clusters for both collection and storage, providing fault tolerance for single‑node or whole‑cluster failures.

Replaced Thanos‑style Reroute with a local queue mechanism and a Meta service for service discovery, eliminating cascade failures.

Introduced cloud‑disk storage, backup/restore snapshots, and automated migration to ensure data safety.

2.3 Query Optimization

Implemented computation push‑down: aggregation (sum, count, avg, max, min) is performed on storage nodes, cutting data transfer by orders of magnitude.

Added query‑data‑size limits and memory protection on both query and storage nodes.

2.4 High Cardinality Governance

Adopted a differential dynamic cardinality management strategy with label white‑lists and Bloom‑filter based detection.

Provided a separate high‑cardinality pipeline to isolate noisy metrics.

2.5 Multi‑Cloud Active‑Active

Deployed unit‑level collection, ingestion, and storage per cloud region, reducing cross‑cloud bandwidth by ~80%.

Unified query across regions with push‑down aggregation, ensuring resilience when a single cloud experiences outages.

3. Results

Collection performance improved ~10×, saving tens of thousands of CPU cores and hundreds of TB of SSD.

Query latency reduced by several‑fold to dozens of times faster; query capacity increased up to 70×.

Operational overhead dramatically lowered: configuration changes are now white‑screen, scaling is automated, and migration time shrank from weeks to half a day.

4. Authors

Han Bai, Bu Ke, and A Pu – members of Xiaohongshu’s Observability Technology Group, with backgrounds in distributed systems, cloud‑native infrastructure, and performance engineering.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

cloud-native metrics VictoriaMetrics time series high-availability

Written by

Xiaohongshu Tech REDtech

Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.