Big Data 12 min read

WeChat’s 10× Query Speedup: From 1000ms to 100ms with Druid & Redis

WeChat’s multi‑dimensional monitoring platform faced severe query latency and I/O bottlenecks, so the team analyzed user behavior and Druid architecture, then introduced sub‑query splitting, Redis caching, and segment size reductions, achieving over 85% cache hit rate and reducing average query time to around 100 ms.

dbaplus Community

Jun 25, 2023

WeChat’s 10× Query Speedup: From 1000ms to 100ms with Druid & Redis

Background

WeChat’s multi‑dimensional metric monitoring platform aggregates up to 45 billion events per minute and 4 trillion per day, serving thousands of custom dimensions and metrics. Query traffic peaked at 400 k requests per minute (3 hundred million per day), causing average query latency > 1000 ms and high failure rates.

Platform APIs

Dimension enumeration query: Returns combinations of dimensions and their metric values, supporting aggregation, expansion, or drill‑down.

Time‑series query: Returns metric value sequences for specified dimension filters over a time range.

Performance Problem

Most queries (> 99%) are time‑series and ~90% target data older than one day, leading to massive I/O on Druid’s segments. The data layer uses Apache‑Druid, whose segment storage and broker memory become bottlenecks, especially for high‑cardinality dimension combinations.

Optimization Analysis

The team first collected detailed usage statistics: query types, dimensions, metrics, volume, failures, and latency. Findings included the dominance of time‑series queries and heavy historical data access.

Optimization Design

1. Split Sub‑queries

Each original request is broken into finer‑grained sub‑queries (e.g., a 2‑day time‑series request becomes three daily sub‑queries) and distributed across multiple brokers, reducing per‑broker load.

2. Add Redis Cache

Results of sub‑queries are cached in Redis. For historical queries (older than one day) the cache can satisfy the request entirely, cutting Druid I/O from hundreds of segment reads to a few.

{
    "biz_id": 1,
    "formula": "avg_cost_time",
    "keys": [{"field": "xxx_id", "relation": "eq", "value": "3"}],
    "start_time": "2020-04-15 13:23",
    "end_time": "2020-04-17 12:00"
}

3. Dimension‑Combination Sub‑queries

For dimension‑enumeration queries, a multi‑level cache stores data at day, 4‑hour, and hour granularity. Sub‑queries are routed to the smallest applicable cache slice; only the first and last hour‑level slices may hit Druid.

4. Sub‑dimension Tables

High‑cardinality dimensions are extracted into separate “sub‑dimension” tables that are kept up‑to‑date in real time, allowing queries to target smaller tables and further reduce segment size.

Results

Cache hit rate > 85% (full hit 86%, partial hit 98.8%).

Requests reaching Druid dropped to ~10% of original volume.

Average query latency improved from > 1000 ms to ~140 ms; P95 reduced from > 5000 ms to ~220 ms.

These optimizations collectively brought query latency down to the 100 ms range.

Conclusion

The combination of sub‑query decomposition, Redis caching, segment size reduction, and sub‑dimension tables transformed the platform’s performance, demonstrating a practical approach for large‑scale, real‑time analytics systems built on Druid.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Cache Query Optimization Druid

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.