How Transparent Multilevel Cache (TMC) Boosts Performance with Hotspot Detection and Local Caching
The Transparent Multilevel Cache (TMC) solution adds application‑level hotspot detection, local caching, and hit‑rate statistics to a standard distributed cache stack, enabling automatic hotspot discovery, reducing load on backend cache clusters, and improving system stability and latency during traffic spikes.
What Is TMC?
TMC (Transparent Multilevel Cache) is a comprehensive caching solution provided by Youzan PaaS for internal applications.
Built on a generic distributed cache (e.g., CodisProxy + Redis or Youzan's own zanKV), TMC adds:
Application‑level hotspot detection
Application‑level local cache
Application‑level cache‑hit statistics
These features help solve hotspot access problems at the application layer.
Why Build TMC?
E‑commerce merchants on Youzan run unpredictable promotional activities (flash sales, product pushes, order processing) that create sudden cache‑hotspot traffic, overwhelming distributed cache systems and affecting application stability.
TMC automatically discovers hotspots and pre‑places hotspot requests in a local cache, reducing pressure on downstream caches.
Pain Points of Multilevel Cache Solutions
Hotspot detection: How to quickly and accurately find hotspot keys? Data consistency: How to ensure consistency between the local cache and the distributed cache? Effect verification: How can applications view local‑cache hit rates and hotspot keys to verify effectiveness? Transparent integration: How to minimize intrusion and achieve smooth, fast adoption?
TMC focuses on these issues, providing hotspot detection and local caching to reduce impact on downstream cache services.
TMC Overall Architecture
The architecture consists of three layers: Storage layer: Provides basic KV storage, using different services (Codis, zanKV, Aerospike) per business scenario. Proxy layer: Offers a unified cache entry and protocol for applications, handling routing after horizontal sharding. Application layer: Supplies a unified client with built‑in hotspot detection and local caching, transparent to business logic.
This article focuses on the application‑layer client’s hotspot detection and local caching.
Application‑Layer Local Cache
Transparent Integration
Java services can use either the spring.data.redis package with RedisTemplate or the youzan.framework.redis package with RedisClient. Both ultimately create a Jedis object via JedisPool that talks to the proxy layer.
TMC modifies the native JedisPool and Jedis classes to embed hotspot discovery and local caching logic from the Hermes‑SDK during initialization. Using a specific version of the jedis‑jar package, applications gain hotspot detection and local caching without code changes.
Overall Structure
Module Breakdown
Jedis‑Client: Direct entry for Java applications to communicate with the cache service. Hermes‑SDK: Encapsulates hotspot discovery and local caching. Hermes Server Cluster: Receives access events, detects hotspots, and pushes hotspot keys to the SDK. Cache Cluster: Consists of proxy and storage layers, providing a unified distributed cache endpoint. Base Components: Etcd cluster and Apollo configuration center for cluster push and unified configuration.
Basic Workflow
Key Retrieval
When a Java app requests a key via Jedis‑Client, the SDK checks if the key is a hotspot.
Hotspot keys are served from the local cache, bypassing the cache cluster.
Non‑hotspot keys are fetched from the cache cluster via a callable callback.
Each request is asynchronously reported to the Hermes server for hotspot analysis.
Key Expiration
Calls to set(), del(), expire() trigger an invalid() call in the SDK.
For hotspot keys, the local cache entry is invalidated immediately, ensuring strong consistency.
The event is broadcast via Etcd to other SDK nodes, which also invalidate their local copies, achieving eventual consistency.
Hotspot Discovery
The Hermes server continuously collects access events and, every 3 seconds, computes a sliding‑window heat for each key.
Hot keys exceeding a threshold are selected as the Top N and pushed to SDK nodes.
Configuration Reading
Both SDK and server nodes read runtime configuration (e.g., enable/disable flags, black‑white lists, Etcd addresses) from Apollo.
Stability
Asynchronous data reporting: Hermes‑SDK uses rsyslog to report events without blocking business threads. Thread‑isolated communication module: Separate thread pool with bounded queue isolates I/O from business execution. Cache size control: Local cache size is limited to 64 MB (LRU) to prevent JVM heap overflow.
Consistency
Only hotspot keys are cached locally; the majority of data resides in the cache cluster.
When a hotspot key changes, the SDK invalidates the local entry, guaranteeing strong consistency.
Invalidations are broadcast via Etcd, ensuring eventual consistency across all application instances.
Hotspot Discovery Process
Overall Flow
The process consists of four steps: Data collection: SDK reports key access events to Kafka. Heat sliding window: A time wheel records access counts for each key over a 30‑second window. Heat aggregation: Aggregated heat values are stored in Redis as sorted sets. Hotspot detection: The server selects the Top N keys exceeding the heat threshold and pushes them to SDK nodes.
Data Collection
SDK sends events (appName, uniqueKey, sendTime, weight) to Kafka; server nodes consume them in real time.
Heat Sliding Window
Each key maintains a wheel of 10 slots, each representing a 3‑second interval; the sum gives the total accesses in the last 30 seconds.
Heat Aggregation
After sliding‑window calculation, the server aggregates heat per key and stores <key, totalHeat> in Redis.
Hotspot Detection
Periodically, the server extracts keys whose heat exceeds the configured threshold, selects the Top N, and pushes the list to SDK nodes.
Feature Summary
Real‑time
Events are reported via rsyslog + Kafka; the sliding‑window and aggregation run every 3 seconds, detecting hotspots within at most 3 seconds.
Accuracy
The time‑wheel sliding window provides a precise view of recent access distribution.
Scalability
Server nodes are stateless and can scale horizontally based on Kafka partitions; the sliding‑window and aggregation are multithreaded per app.
Real‑World Impact
Fast‑Shop Merchant Promotion
During a short‑term promotion, cache request volume and local‑cache hit rate both rose sharply, with local‑cache hit rate reaching ~80%.
Cache request and hit curves show the increase.
Local‑cache hit‑rate curve.
QPS and Latency Improvements
During the event, request QPS grew while response time (RT) decreased thanks to local caching.
Future Outlook
TMC already serves product, logistics, inventory, marketing, user, gateway, and messaging modules, with more applications being onboarded. Users can tune hotspot thresholds, detection counts, and black‑/white‑list settings to optimize performance.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java Interview Crash Guide
Dedicated to sharing Java interview Q&A; follow and reply "java" to receive a free premium Java interview guide.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
