WeChat’s 10× Query Speedup: From 1000ms to 100ms with Druid & Redis
WeChat’s multi‑dimensional monitoring platform faced severe query latency and I/O bottlenecks, so the team analyzed user behavior and Druid architecture, then introduced sub‑query splitting, Redis caching, and segment size reductions, achieving over 85% cache hit rate and reducing average query time to around 100 ms.
Background
WeChat’s multi‑dimensional metric monitoring platform aggregates up to 45 billion events per minute and 4 trillion per day, serving thousands of custom dimensions and metrics. Query traffic peaked at 400 k requests per minute (3 hundred million per day), causing average query latency > 1000 ms and high failure rates.
Platform APIs
Dimension enumeration query: Returns combinations of dimensions and their metric values, supporting aggregation, expansion, or drill‑down.
Time‑series query: Returns metric value sequences for specified dimension filters over a time range.
Performance Problem
Most queries (> 99%) are time‑series and ~90% target data older than one day, leading to massive I/O on Druid’s segments. The data layer uses Apache‑Druid, whose segment storage and broker memory become bottlenecks, especially for high‑cardinality dimension combinations.
Optimization Analysis
The team first collected detailed usage statistics: query types, dimensions, metrics, volume, failures, and latency. Findings included the dominance of time‑series queries and heavy historical data access.
Optimization Design
1. Split Sub‑queries
Each original request is broken into finer‑grained sub‑queries (e.g., a 2‑day time‑series request becomes three daily sub‑queries) and distributed across multiple brokers, reducing per‑broker load.
2. Add Redis Cache
Results of sub‑queries are cached in Redis. For historical queries (older than one day) the cache can satisfy the request entirely, cutting Druid I/O from hundreds of segment reads to a few.
{
"biz_id": 1,
"formula": "avg_cost_time",
"keys": [{"field": "xxx_id", "relation": "eq", "value": "3"}],
"start_time": "2020-04-15 13:23",
"end_time": "2020-04-17 12:00"
}3. Dimension‑Combination Sub‑queries
For dimension‑enumeration queries, a multi‑level cache stores data at day, 4‑hour, and hour granularity. Sub‑queries are routed to the smallest applicable cache slice; only the first and last hour‑level slices may hit Druid.
4. Sub‑dimension Tables
High‑cardinality dimensions are extracted into separate “sub‑dimension” tables that are kept up‑to‑date in real time, allowing queries to target smaller tables and further reduce segment size.
Results
Cache hit rate > 85% (full hit 86%, partial hit 98.8%).
Requests reaching Druid dropped to ~10% of original volume.
Average query latency improved from > 1000 ms to ~140 ms; P95 reduced from > 5000 ms to ~220 ms.
These optimizations collectively brought query latency down to the 100 ms range.
Conclusion
The combination of sub‑query decomposition, Redis caching, segment size reduction, and sub‑dimension tables transformed the platform’s performance, demonstrating a practical approach for large‑scale, real‑time analytics systems built on Druid.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
