How Zhihu Scaled Its Real-Time Analytics with Druid and Smart Redis Caching
Zhihu built a self‑service analytics platform on Druid, introduced a multi‑level Redis caching strategy, split long‑duration queries across multiple brokers, and added automatic cache invalidation to dramatically improve query latency and resource usage for massive daily request volumes.
Data Analysis Platform Overview
Zhihu’s rapid product growth required a flexible, self‑service analytics platform. The team built a platform on the open‑source OLAP engine Druid, supporting both offline Hive tables and real‑time Kafka streams. Core capabilities include unified data source management, configurable multi‑dimensional reports, sub‑second query response, dashboard creation, data‑service APIs, and unified permission control.
495 dashboards with 2,399 reports
30,000+ daily queries
Data APIs for A/B testing, channel management, APM, and data‑mail systems
Technical Choice – Druid
Druid provides sub‑second query latency for both historical and real‑time data, low‑latency ingestion, flexible exploration, high‑performance aggregation, and easy horizontal scaling, making it suitable for large‑scale analytical workloads.
Druid Data Structures
Data Source : logical table containing time, dimension, and metric columns.
Segment : time‑partitioned index files; granularity configurable via segmentGranularity.
Query Service Components
Internal
Historical : loads and serves segment files.
Broker : routes queries to appropriate Historical nodes, merges results, and returns them.
Router : used at TB‑scale to distribute queries among Brokers.
External
Deep Storage : stores segment files; can be local disk or distributed HDFS.
Metastore Storage : stores metadata, typically MySQL.
Platform Evolution and Optimizations
1. High Query Volume
Increasing query load caused slow response times.
2. Simple Redis Cache
Initially cached the entire request body as the key and the response body as the value. This only hit when the query was identical, failing for varied time ranges.
3. Partial‑Span Cache Reuse
Introduced a mechanism to cache results for each unit time span. When a query missed the exact key, Redis is scanned for overlapping time spans and any hits are merged.
4. Reducing Redis I/O
Designed a single‑read approach: after storing a query, also store the set of timestamps it covers (using an “Interval‑Excluded Request” key). A MGET retrieves all timestamps at once, then the corresponding cached results are fetched in bulk.
5. Long‑Duration Queries
Queries spanning more than two weeks caused Druid to slow down and sometimes block other requests, leading to excessive memory consumption on Broker nodes.
6. Single‑Broker Bottleneck
Each query is routed to a single Broker, which then contacts Historical nodes. Longer time spans increase memory pressure on that Broker.
7. Multi‑Broker Parallelism
Split an N‑day query into N one‑day queries, dispatch them asynchronously, and aggregate results. This reduces per‑Broker memory usage and cuts overall latency, as shown in benchmark tests.
8. Cache Expiration
When Druid’s underlying data is refreshed, cached results become stale. The platform tags each data source with a version timestamp (from MySQL metadata). On each request, the cache timestamp is compared to the metadata version; if older, the cache entry is deleted and the query is re‑executed.
Conclusion
The Zhihu analytics platform demonstrates how a combination of Druid, Redis caching, query splitting, and automatic cache invalidation can sustain high query throughput, reduce latency, and lower resource consumption while supporting both fixed and ad‑hoc analytical needs.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
