How Transparent Multilevel Cache (TMC) Eliminates Hotspot Bottlenecks in High‑Traffic E‑Commerce
The article explains Youzan’s Transparent Multilevel Cache (TMC), detailing its architecture, hotspot detection, local caching, consistency mechanisms, and real‑world performance gains during flash‑sale events, showing how it reduces cache pressure and improves latency for Java‑based services.
What Is TMC?
Transparent Multilevel Cache (TMC) is a cache‑as‑a‑service solution built by Youzan’s PaaS team to provide a unified, multi‑level caching layer for internal applications.
Why Build TMC?
E‑commerce merchants frequently run flash‑sale or promotion campaigns that create sudden "cache hotspot" traffic: a small number of keys receive massive request bursts, overwhelming the distributed cache cluster, consuming bandwidth, and destabilising services.
Hotspot events are unpredictable in timing, type, and product.
During a hotspot, a few hot keys generate a flood of cache requests that can saturate the network and degrade application stability.
TMC was created to automatically discover hotspots and pre‑place hotspot requests in an application‑level local cache.
Pain Points of Traditional Multilevel Caches
Fast and accurate hotspot detection.
Ensuring data consistency between the local cache and the downstream distributed cache.
Providing visibility into local‑cache hit rates and hotspot keys for validation.
Achieving transparent integration with minimal intrusion to existing applications.
Overall Architecture
The architecture consists of three layers:
Storage Layer : Provides basic KV storage using different back‑ends (Codis, Zankv, Aerospike) according to business needs.
Proxy Layer : Offers a unified cache entry point and routing for horizontally sharded data.
Application Layer : Supplies a unified client with built‑in hotspot detection and local caching, transparent to business logic.
The article focuses on the application‑layer client’s hotspot detection and local caching features.
Transparent Local Cache Integration
Java services can use either the standard
spring.data.redis RedisTemplateor the Youzan‑provided
youzan.framework.redis RedisClient. In both cases the client ultimately creates a JedisPool and a Jedis instance that communicates with the proxy layer.
TMC modifies the native JedisPool and Jedis classes so that during pool initialization the Hermes‑SDK (which implements hotspot detection and local caching) is also initialized.
When a key is requested, the client first asks Hermes‑SDK whether the key is a hotspot. If it is, the value is returned from the local cache without contacting the cache cluster; otherwise the request is forwarded to the cluster. All key‑access events are asynchronously reported to the Hermes server cluster for hotspot analysis.
For Java services, simply using a specific version of the jedis‑jar enables hotspot detection and local caching without any code changes.
Module Breakdown
Jedis‑Client : Direct entry point for Java applications; API identical to native Jedis.
Hermes‑SDK : Encapsulates hotspot detection and local caching logic.
Hermes Server Cluster : Receives access events, performs hotspot analysis, and pushes hotspot key lists to SDK instances.
Cache Cluster : Consists of proxy and storage layers, providing the distributed cache service.
Infrastructure : Etcd cluster and Apollo configuration center supply cluster‑wide configuration and push capabilities.
Basic Workflow
Key Retrieval : The client asks Hermes‑SDK if the key is a hotspot. Hot keys are served from the local cache; non‑hot keys are fetched from the cache cluster via a callback.
Key Expiration : When set(), del() or expire() is called, the client notifies Hermes‑SDK via invalid(). For hotspot keys, the local cache entry is invalidated immediately, and the invalidation event is broadcast through etcd to other SDK nodes for eventual consistency.
Hotspot Discovery : Hermes servers collect access events, run a sliding‑window aggregation every 3 seconds, and push the top‑N hotspot key list to SDKs via etcd.
Configuration Loading : Both SDK and server read runtime parameters (e.g., thresholds, black/white lists, etcd addresses) from Apollo.
Stability Measures
Asynchronous reporting of access events using rsyslog to avoid blocking business threads.
Dedicated thread pool with bounded queue for the communication module, isolating I/O from business execution.
Local cache size limited to 64 MB (LRU) to prevent JVM heap overflow.
Consistency Guarantees
Only hotspot keys are cached locally; the majority of keys remain in the distributed cache.
When a hotspot key changes, Hermes‑SDK invalidates the local entry and broadcasts the event via etcd, ensuring strong consistency for the cached key and eventual consistency across the cluster.
Hotspot Discovery Process
Data Collection
Hermes‑SDK writes key‑access events to rsyslog, which forwards them to Kafka. Each Hermes server node consumes the Kafka stream in real time.
Event fields: appName, uniqueKey, sendTime, weight.
Sliding Window (Hotness Window)
For each app and each key, a 10‑slot time wheel records the number of accesses in the last 3 seconds per slot, representing a 30‑second sliding window.
Aggregation
Every 3 seconds a mapping task aggregates the per‑slot counts into a total hotness value for each key and stores the result in Redis as a sorted set.
Hotspot Detection
The detection node reads the latest aggregation, selects the top‑N keys exceeding the hotness threshold, and pushes the hotspot list to SDK instances.
Feature Summary
Real‑Time
Events are reported in real time via rsyslog + Kafka; the 3‑second mapping task ensures that a newly emerging hotspot is detected within at most 3 seconds.
Accuracy
The sliding‑window aggregation accurately reflects recent access distribution, providing reliable hotness scores.
Scalability
Hermes server nodes are stateless; horizontal scaling is achieved by adding Kafka partitions. The sliding‑window and aggregation logic are multithreaded and scale with the number of apps.
Practical Results
Kuaishou Merchant Campaign
During a short‑lived product promotion, cache request volume and local‑cache hit volume both rose sharply, with a local‑cache hit rate approaching 80 %.
Double‑11 (Singles’ Day) Sample Applications
Graphs show increased request QPS and reduced response time (RT) for core services, demonstrating that TMC’s local cache offloads pressure from the distributed cache and improves latency.
Future Outlook
TMC already serves product, logistics, inventory, marketing, user, gateway, and messaging modules, with more applications being onboarded. Configuration options such as hotspot thresholds, hotspot key count, and black/white lists allow fine‑tuning for different business scenarios.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Code Ape Tech Column
Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
