Big Data 13 min read

How WeChat Reduced Query Latency from 1000ms to 100ms in Its Multi‑Dimensional Monitoring Platform

This article explains how the WeChat multi‑dimensional monitoring platform, which processes billions of data points daily, identified performance bottlenecks in its Druid‑based data layer and applied sub‑query splitting, Redis caching, and sub‑dimension tables to achieve over 85% cache hit rate and bring average query time down to around 100 ms.

ITPUB

Jul 16, 2023

How WeChat Reduced Query Latency from 1000ms to 100ms in Its Multi‑Dimensional Monitoring Platform

Background

WeChat’s multi‑dimensional monitoring platform provides flexible data reporting and real‑time cross‑dimensional analysis for millions of users. It aggregates up to 45 billion events per minute and handles 40 million queries per minute, resulting in average query latency > 1000 ms and high failure rates.

Optimization Analysis

User Query Behavior

Analysis showed that more than 99% of queries are time‑series requests, with 90% targeting data older than one day. This pattern creates many redundant scans of historical data.

Data‑Layer Architecture

The platform uses Apache‑Druid for storage and OLAP queries. Its architecture consists of Master (Overlord & Coordinator), Real‑time (MiddleManager, Peon), and Storage (Historical, DeepStorage, MetaDataStorage) nodes.

Root Causes of Slowness

Segments store 2‑4 hour data slices; queries spanning many slices trigger excessive I/O.

Large time ranges cause MiddleManager and Historical nodes to time out and increase broker memory usage.

High‑cardinality dimensions (>1 million combinations) lead to heavy segment scans.

Optimization Design

1. Split Sub‑Queries

Each large query is decomposed into finer‑grained sub‑queries (e.g., one‑day or one‑hour intervals) and dispatched to multiple brokers for parallel execution, reducing per‑broker load.

2. Add Redis Cache to Sub‑Queries

Results of sub‑queries are cached in Redis. Subsequent identical sub‑queries hit the cache, avoiding Druid I/O. A threshold time (≈10 min) is used to determine data freshness.

{
  "biz_id": 1,
  "formula": "avg_cost_time",
  "keys": [{"field": "xxx_id", "relation": "eq", "value": "3"}],
  "start_time": "2020-04-15 13:23",
  "end_time": "2020-04-17 12:00"
}

3. Sub‑Dimension Tables

For high‑cardinality dimensions, low‑cardinality dimensions are extracted into separate sub‑dimension tables that are refreshed in real time, allowing queries to target smaller tables and further reduce segment size.

Optimization Results

Cache hit rate > 85% (full hit 86%, partial hit 98.8%).

Average query latency reduced from > 1000 ms to ~140 ms; P95 reduced from > 5000 ms to ~220 ms.

Requests hitting Druid dropped to about 10% of the original volume.

Conclusion

By analyzing query patterns, redesigning the data‑layer architecture, splitting queries, introducing Redis caching, and employing sub‑dimension tables, the WeChat monitoring platform dramatically improved performance and scalability, achieving sub‑100 ms query response for most workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Big Data Redis Druid Query Splitting

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.