Big Data 13 min read

How Zhihu Scaled Its Real-Time Analytics with Druid and Smart Redis Caching

Zhihu built a self‑service analytics platform on Druid, introduced a multi‑level Redis caching strategy, split long‑duration queries across multiple brokers, and added automatic cache invalidation to dramatically improve query latency and resource usage for massive daily request volumes.

dbaplus Community
dbaplus Community
dbaplus Community
How Zhihu Scaled Its Real-Time Analytics with Druid and Smart Redis Caching

Data Analysis Platform Overview

Zhihu’s rapid product growth required a flexible, self‑service analytics platform. The team built a platform on the open‑source OLAP engine Druid, supporting both offline Hive tables and real‑time Kafka streams. Core capabilities include unified data source management, configurable multi‑dimensional reports, sub‑second query response, dashboard creation, data‑service APIs, and unified permission control.

495 dashboards with 2,399 reports

30,000+ daily queries

Data APIs for A/B testing, channel management, APM, and data‑mail systems

Technical Choice – Druid

Druid provides sub‑second query latency for both historical and real‑time data, low‑latency ingestion, flexible exploration, high‑performance aggregation, and easy horizontal scaling, making it suitable for large‑scale analytical workloads.

Druid Data Structures

Data Source : logical table containing time, dimension, and metric columns.

Segment : time‑partitioned index files; granularity configurable via segmentGranularity.

Query Service Components

Internal

Historical : loads and serves segment files.

Broker : routes queries to appropriate Historical nodes, merges results, and returns them.

Router : used at TB‑scale to distribute queries among Brokers.

External

Deep Storage : stores segment files; can be local disk or distributed HDFS.

Metastore Storage : stores metadata, typically MySQL.

Platform Evolution and Optimizations

1. High Query Volume

Increasing query load caused slow response times.

2. Simple Redis Cache

Initially cached the entire request body as the key and the response body as the value. This only hit when the query was identical, failing for varied time ranges.

3. Partial‑Span Cache Reuse

Introduced a mechanism to cache results for each unit time span. When a query missed the exact key, Redis is scanned for overlapping time spans and any hits are merged.

4. Reducing Redis I/O

Designed a single‑read approach: after storing a query, also store the set of timestamps it covers (using an “Interval‑Excluded Request” key). A MGET retrieves all timestamps at once, then the corresponding cached results are fetched in bulk.

5. Long‑Duration Queries

Queries spanning more than two weeks caused Druid to slow down and sometimes block other requests, leading to excessive memory consumption on Broker nodes.

6. Single‑Broker Bottleneck

Each query is routed to a single Broker, which then contacts Historical nodes. Longer time spans increase memory pressure on that Broker.

7. Multi‑Broker Parallelism

Split an N‑day query into N one‑day queries, dispatch them asynchronously, and aggregate results. This reduces per‑Broker memory usage and cuts overall latency, as shown in benchmark tests.

8. Cache Expiration

When Druid’s underlying data is refreshed, cached results become stale. The platform tags each data source with a version timestamp (from MySQL metadata). On each request, the cache timestamp is compared to the metadata version; if older, the cache entry is deleted and the query is re‑executed.

Conclusion

The Zhihu analytics platform demonstrates how a combination of Druid, Redis caching, query splitting, and automatic cache invalidation can sustain high query throughput, reduce latency, and lower resource consumption while supporting both fixed and ad‑hoc analytical needs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AnalyticsBig Datarediscachingquery optimizationDruid
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.