Big Data 13 min read

Optimizing Query Performance in WeChat's Multi‑Dimensional Monitoring Platform

This article details how the WeChat multi‑dimensional monitoring platform reduced average query latency from over 1000 ms to around 100 ms by analyzing user query patterns, redesigning the Druid data layer, splitting sub‑queries, introducing Redis caching, and employing sub‑dimension tables, achieving cache hit rates above 85%.

Architect
Architect
Architect
Optimizing Query Performance in WeChat's Multi‑Dimensional Monitoring Platform

WeChat's multi‑dimensional monitoring platform processes massive volumes of data (up to 45 billion events per minute and 4 trillion per day) and originally suffered from average query times exceeding 1000 ms with high failure rates.

Analysis of user behavior showed that more than 99% of requests are time‑series queries, many targeting data older than one day, which caused large‑scale scans across Druid segments and severe performance bottlenecks.

The platform's data layer is built on Apache Druid, comprising Master (Overlord, Coordinator), Real‑time (MiddleManager, Peon) and Storage (Historical, DeepStorage, MetaDataStorage, Zookeeper) nodes, which together handle real‑time ingestion and OLAP queries.

Root causes of slow queries included segment sizes spanning 2‑4 hours, excessive segment I/O, and high‑cardinality dimension combinations that exploded the search space.

To address these issues, the team implemented three major optimizations:

Split large queries into finer‑grained sub‑queries (e.g., one‑day or one‑hour slices) and distribute them across multiple brokers.

Introduce a Redis cache layer for sub‑query results, using a threshold time to determine cache trustworthiness and supporting both partial and full cache hits.

Create sub‑dimension tables for high‑cardinality dimensions, allowing the query engine to match the smallest possible table and reduce segment size.

Example sub‑query JSON (kept unchanged):

{
    "biz_id": 1, // query protocol table ID
    "formula": "avg_cost_time", // aggregation formula
    "keys": [{
        "field": "xxx_id",
        "relation": "eq",
        "value": "3"
    }],
    "start_time": "2020-04-15 13:23",
    "end_time": "2020-04-17 12:00"
}

After the redesign, cache hit rates exceeded 85% (full hit 86%, partial hit 98.8%), and average query latency dropped from >1000 ms to ~140 ms (P95 from >5000 ms to ~220 ms). The proportion of requests hitting Druid fell to about 10%.

In conclusion, by analyzing query patterns, restructuring the Druid architecture, and adding multi‑level caching with sub‑dimension tables, the platform achieved significant performance gains and a more scalable monitoring solution.

monitoringperformancebig datacachingQuery OptimizationWeChatDruid
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.