Operations 25 min read

Scaling CAT Monitoring at Ctrip: Thread Model, Client Computation & Memory Tweaks

This article details how Ctrip optimized the CAT monitoring system—covering its large‑scale deployment, thread‑model redesign, offloading calculations to clients, double‑buffered reporting, and string handling improvements—to dramatically cut CPU usage, GC pressure, and memory consumption while handling billions of messages daily.

dbaplus Community
dbaplus Community
dbaplus Community
Scaling CAT Monitoring at Ctrip: Thread Model, Client Computation & Memory Tweaks

Background and Deployment

CAT, an open‑source trace‑based monitoring platform originally from Dianping, was introduced at Ctrip in 2015. It now supports over 70,000 client instances, processing more than 8 000 billion messages and 900 TB of monitoring traffic per day.

The system aggregates data into MessageTree structures, distributes them to multiple real‑time report analyzers, and stores results in in‑memory reports.

Case 1 – Thread‑Model Optimization

Initially each report analyzer owned its own thread, leading to massive thread counts, high context‑switch rates, and CPU saturation as traffic grew. The team decoupled queues from threads by introducing a selector‑based NIO‑style model:

Queues are listened to by a lightweight selector rather than a dedicated thread.

A single thread pool, sized to the number of CPU cores, processes all analyzers.

Priority scheduling ensures high‑importance reports receive more resources under load.

After the change, thread count dropped dramatically, CPU usage fell from >90 % to ~70 %, and data loss decreased to 5 %.

Case 2 – Client‑Side Computation

To further relieve the server, aggregation of Transaction and Event reports was moved to the client. Clients pre‑aggregate metrics and send compact summaries instead of raw MessageTree data.

Results:

Server‑side Transaction report threads now consume only 0.02 CPU cores each (down from 0.8‑0.9).

Event report threads dropped to 0.01 CPU cores each.

Overall CPU consumption for these reports fell from ~7.5 cores to a fraction of a core.

Client impact is minimal: memory usage stays under 10 MB and CPU overhead is negligible because only a few seconds of aggregation are performed before transmission.

Case 3 – Double‑Buffer Report Strategy

Hourly report objects were causing frequent Young‑GC and occasional Old‑GC spikes. The solution created two permanent report buffers that rotate each hour, keeping them in the Old generation and adding a timed cleanup for stale entries.

Effect:

Young‑GC frequency reduced by ~40 %.

Full GC drops from ~20 times/day to ~3 times/day.

Case 4 – String‑Handling Optimizations

Deserialization of the four string fields (type, name, status, data) in each Transaction/Event incurred heavy allocation and charset decoding costs. The team:

Made data and status lazily decoded, transmitting a flag for success/failure.

Replaced per‑field string creation with a BytesWrapper that references the original byte array, allowing direct byte‑level comparison.

Implemented a custom BytesHashMap that uses offset/length keys, eliminating temporary string objects entirely.

These changes cut Young‑GC by another 40 % and reduced memory pressure from string allocations.

Overall Reflections

The optimizations illustrate a systematic approach to performance problems: first eliminate unnecessary work (thread‑model redesign, client offloading), then reduce object creation (string handling), and finally redesign data structures to avoid frequent allocations (double buffering). The result is a more stable, low‑latency monitoring platform capable of handling massive real‑time traffic.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed SystemsJavamonitoringPerformance OptimizationgcThread Model
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.