Building a Multi‑Level Cache Consistency Framework for Live‑Streaming Platforms
This article describes how a social live‑streaming platform designed and implemented a custom multi‑level cache consistency framework, detailing the background challenges, the architecture of a cache pipeline with Zookeeper‑based node discovery and retry‑enabled execution, and the integration SDKs that enable transparent cache clearing across services.
Background
Since 2021, the social live‑streaming activity platform faced performance bottlenecks as the central memcache became a hotspot for many product scenarios. To reduce network I/O and improve latency, the team introduced a two‑level cache strategy: the original central memcache was used as a secondary cache, while Guava and a local memcache served as the primary cache.
As the platform grew to support multiple products, the unified operations backend could no longer meet divergent business needs. The decision was made to split the operations backend into independent applications, which introduced the problem of clearing caches across different services.
Typical solutions rely on broadcasting data‑change events, such as subscribing to database binlog or using a service registry. The team initially adopted a binlog‑based broadcast, but encountered three major issues: ① The execution chain was hidden, making it difficult to pinpoint which instance held stale cache when inconsistencies arose. ② The code entry point was obscure; only the core development team could understand the cache‑clearing flow, increasing maintenance cost. ③ After the operations backend was split, developers were reluctant to spend extra effort writing central‑cache‑clearing code.
After evaluating open‑source options, the team found them unsatisfactory because most only supported Redis with unreliable pub/sub mechanisms, while the platform used memcache, and the frameworks were outdated, not elegant, and still suffered from hidden execution paths.
Consequently, the team decided to build a custom framework that could handle multi‑level cache consistency scenarios.
Framework Goals
①Provide clear visibility of all cache‑clearing routes and multi‑level cache usage, while remaining simple and easy to adopt. ② Ensure robustness with exception retry mechanisms, comprehensive logging of cache‑clearing steps for troubleshooting, and support for complex scenarios beyond two‑level caches. ③ Fit seamlessly into the existing technology stack.
Detailed Design
1. Model Abstraction
Using a simple two‑level cache‑clear example, the process can be modeled as a singly linked list: first clear the secondary (central) cache, then the primary (local) cache. In extreme cases, this list can degenerate into a tree, allowing arbitrary depth of cache‑clear steps. The cache‑clear event becomes the root node, with leaf nodes representing individual cache‑clear actions; each path is a singly linked list.
2. Cache Pipeline Architecture
The pipeline consists of two main modules: ① Node Discovery Module : Listens to a fixed Zookeeper path for changes, maintains up‑to‑date node information for cache‑clear execution, and periodically performs a full comparison to ensure consistency. ② Event Execution Module : Upon receiving a cache‑clear event, it retrieves the developer‑defined node chain based on the event code, executes cache‑clear operations over long‑lived connections, retries up to three times on failure, and records any permanent failures for later retry by a scheduler.
3. Execution Nodes
①Node Subscription & Discovery : Services embed the pipeline’s SDK; during startup they write node metadata (event code, IP, node identifier) to a predetermined Zookeeper location. The pipeline monitors this path and updates its discovery center accordingly. ② Node Invocation & Execution : The pipeline communicates with execution nodes via long connections. Two strategies are supported: (a) for central‑cache clearing, select a single node that provides the central‑cache clear capability; (b) for local‑cache clearing, invoke all nodes that expose local‑cache clear functions.
4. Integration Approach
The pipeline offers two SDKs to simplify adoption: ① Event‑Sending SDK : Used by the operations backend (or any cache‑clear initiator). When a configuration change occurs, the SDK is called to emit a cache‑clear event, which the pipeline then processes. ② Cache‑Clear Execution SDK : Integrated into domain services that hold cached data. When those services detect a configuration change, they invoke the SDK to participate in the cache‑clear workflow.
5. Cache Event Node Configuration
Each cache‑clear event is identified by a unique event code. Under this code, a hierarchy of node codes defines the execution chain. An operations UI allows developers to visually compose and adjust this chain, providing full transparency of multi‑level cache‑clear routes without embedding the logic in source code.
Summary
The cache pipeline has been in production for nearly a year. Starting from a small pilot in a “party room” service, it now powers large‑scale social live‑streaming activities, supporting revenue‑generating components and transaction platforms. The system has also been adapted for overseas deployments, consolidating multiple product clusters into a single data‑center cluster. Future plans include extending the framework to additional business domains to further enhance scalability and reliability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
