Operations 16 min read

Design and Implementation of a Continuous Performance Optimization and Tracking Platform for Xiaohongshu Services

To curb rising resource costs as Xiaohourshu scales, engineers built a Continuous Performance Optimization & Tracking Platform that continuously profiles services, stores diff‑analyzed data in ClickHouse, automatically detects tiny regressions, links them to code changes, and has already saved and flagged roughly 20,000 CPU cores across search, recommendation and advertising workloads.

Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Design and Implementation of a Continuous Performance Optimization and Tracking Platform for Xiaohongshu Services

With the rapid growth of Xiaohongshu’s business, resource consumption and cost pressure have increased dramatically. To address the need for cost reduction and efficiency improvement, a Performance Continuous Optimization & Tracking Platform was built to systematically assist product teams in solving performance problems and to continuously monitor performance degradation during the evolution of business systems.

The platform currently covers the S0 services of Xiaohongshu search, recommendation, and advertising. After more than two months of operation, it has helped teams optimize over 10,000 CPU cores in existing workloads and has identified performance regressions affecting another 10,000 CPU cores for further optimization.

Key challenges that motivated the platform include the lack of low‑learning‑cost tools for developers to analyze their own modules, the reliance on personal experience for performance analysis, and the difficulty of detecting small, incremental regressions (e.g., a 1% CPU increase per commit) that are hidden from regular load‑testing.

The platform focuses on three problem areas:

Existing performance optimization: deep, systematic analysis, diagnosis, and optimization of current services, with knowledge sharing to enable generic optimizations.

Incremental performance regression interception: proactive detection of resource usage increase caused by frequent business iterations.

Performance stability issues: rapid localization of sudden performance degradations.

Technical approach:

Data collection: low‑frequency, continuous profiling of processes (On‑CPU and Off‑CPU) using tools such as Linux perf for C++, async‑profiler for Java, and pprof for Go. Sampling can be timed, on‑demand, or condition‑triggered.

Data processing & storage: parsed samples are stored in ClickHouse along with metadata (application name, version, data center, etc.). Flame graphs are generated per pod and saved in object storage (e.g., Tencent COS).

Analysis: merge‑plus‑diff techniques are applied vertically (different versions) and horizontally (across applications) to pinpoint regression points. Custom diff flame graphs highlight affected functions and associated CPU core consumption.

Automation: daily automated inspections detect potential regressions and push alerts to internal groups; integration with QA pipelines and release platforms enables early interception.

Traceability: regression points are linked to change events (code commits, experiment flags, configuration updates) to facilitate root‑cause analysis.

Continuous optimization workflow:

Detect regression at function level.

Correlate with change events.

Track the regression lifecycle, push optimizations, and verify resolution.

Future directions include exploring additional optimization techniques such as Profile‑Guided Optimization (PGO), leveraging PMU metrics for large‑page memory usage, and expanding the metric set to cover wall‑time, CPU cache, memory binding, and more, thereby broadening the coverage of performance analysis scenarios.

cloud-nativeperformance optimizationBig DataObservabilityprofilingcontinuous monitoringregression detection
Xiaohongshu Tech REDtech
Written by

Xiaohongshu Tech REDtech

Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.