Operations 34 min read

How Xiaohongshu Scaled Its Metrics System Tenfold with Cloud‑Native Architecture

Facing exploding metric volumes, high resource consumption, and fragile operations, Xiaohongshu's observability team completely rebuilt its metrics pipeline using Victoriametrics, achieving ten‑fold performance gains, minute‑level scaling, high‑availability, cost reduction, and robust multi‑cloud active‑active deployment while preserving data safety and query speed.

dbaplus Community

Jan 2, 2024

How Xiaohongshu Scaled Its Metrics System Tenfold with Cloud‑Native Architecture

Background

In the cloud‑native era, Xiaohongshu’s observability team needed to upgrade its Metrics system because the original Prometheus‑based stack could not handle the rapidly growing volume of metrics (billions per day) and suffered from high resource usage, poor stability, and operational complexity.

Evolution Overview

Over the past year the team rebuilt the entire pipeline—collection, high‑availability, query optimisation, high‑cardinality handling and multi‑cloud active‑active deployment—using Victoriametrics (VMS) as the core storage and processing engine.

Key Improvements

Initial Cloud‑Native Refactoring : Migrated most VM instances from VMs to containers and deployed via a release platform to achieve white‑screen changes.

Collection Layer Redesign : Replaced Prometheus agents with vmagent, introduced a configuration centre for dynamic reloads, added sharding (Shard_count) for scalable collection, and implemented sample‑limit and metric/label length validation to protect against time‑series explosion.

Performance Optimisations : Implemented delayed start for massive scrape targets, object pools for deletions, and tuned Go memory limits (SetMemoryLimit) to halve CPU usage and reduce OOM risk.

Standardised K8s Metrics Collection : Deployed cAdvisor, node_exporter and kube‑state‑metrics as sharded StatefulSets, removed dedicated Prometheus per‑cluster instances, and used vmagent to scrape them.

Host Monitoring Revamp : Replaced Consul‑based discovery with a custom registration service, introduced a configuration platform for self‑service metric collection, and reduced daily support tickets from >2 to <1 per two months.

SDK Enhancements : Added a meter cache to Micrometer‑based Prometheus SDK, dynamic metric disabling, and low‑frequency series cleanup, improving CPU overhead and enabling rapid emergency mitigation.

High‑Availability Refactor : Adopted full‑link dual‑active deployment for both collection and storage clusters, introduced local queues for write‑ahead when storage is unavailable, and built a Meta service for service discovery and query routing.

Node‑Level Adjustments : Switched hash algorithm from IP‑based to instance‑name‑based to keep time‑series placement stable during IP changes, and added grey‑scale expansion for storage scaling.

Data Safety : Migrated storage from local disks to cloud disks, implemented snapshot‑based backup/restore with hourly, daily and monthly retention, and used the backup mechanism to accelerate cross‑cloud migrations, cutting migration time from 30 days to half a day.

Query Optimisation : Pushed aggregation (sum, count, avg, max, min) down to storage nodes, introduced query data‑size limits and memory protection, and achieved several‑fold to dozens‑fold query speed improvements and a 70× increase in query range for high‑traffic services.

High‑Cardinality Management : Deployed a plug‑in Label_manage module in vminsert to detect, sample and cap label cardinality, using Bloom filters and white‑list/black‑list strategies, and isolated high‑cardinality streams to protect the main storage.

Cross‑Cloud Multi‑Active : Deployed collection and storage units per cloud region, performed joint queries across regions, and after computation push‑down reduced inter‑region bandwidth by ~80 % while ensuring fault isolation.

Results and Outlook

After more than a year of work, Xiaohongshu now operates around 30 VMS data sources, each handling up to a billion metrics per collection cycle, with total CPU consumption reduced by tens of thousands of cores and storage savings of hundreds of terabytes. Stability has improved dramatically, OOM incidents are rare, high‑availability is achieved with minute‑level scaling, and query performance has increased by multiple orders of magnitude. Future plans include longer‑term metric retention, capacity‑driven auto‑scaling, and synchronising open‑source version upgrades.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

cloud-native Observability metrics Prometheus time series high-availability vms

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.