Designing a Near Real-Time Video Recommendation Engine with ES and Adaptive Rate Limiting
This article details the design of a low‑latency video recommendation system, covering requirements, overall architecture, data pipelines, consistency handling, write smoothing, and performance optimizations such as multi‑level caching and Elasticsearch tuning.
Background
The system must deliver newly enabled videos to users quickly and evaluate their quality by combining distribution traffic with posterior data (exposure, click, play). This requires low‑latency access to both prior data (metadata such as tags and author ID attached at video creation) and posterior data (user‑behavior feedback).
Overall Video Recommendation Architecture
Content center pushes videos via MQ; after processing they are stored, indexed and generate forward and inverted index data, resulting in about 10 million items in the storage layer.
Recall layer uses user profile, click history and other features to retrieve thousands of candidates for the coarse‑ranking stage.
Coarse ranking scores the candidates and passes a few hundred to the fine‑ranking stage.
Fine ranking scores again and passes to the re‑ranking stage.
Re‑ranking applies business rules and interventions, finally selecting 10+ videos for the user.
After exposure, events (exposure, click, play, like, comment) are logged, processed in real‑time or batch, and fed back as features to the storage layer. Therefore the recall/inverted‑index design must support real‑time or near‑real‑time latency.
Design Proposal
The previous solution rebuilt the index every half hour, which cannot meet near‑real‑time requirements. The new design must satisfy:
Eventual consistency is sufficient, but final consistency must be guaranteed.
Posterior data write volume is huge.
Recall system requires high concurrency, low latency and high availability.
Solution Research
Redis was discarded due to limited flexibility and heavy custom development. Two alternatives were considered:
Fully custom index – high development cost.
Simple custom solution – may not meet business needs; a mature open‑source solution is preferred.
The final choice is an Elasticsearch‑based indexing service.
Data Flow
Prior Data Pipeline
Prior data originates from the content center and is written to CDB via a parsing service. Two sub‑pipelines exist:
Full‑load pipeline : triggered only when rebuilding the index; dumps data from DB to Kafka, then a write service consumes Kafka and writes to ES.
Incremental pipeline : listens to MySQL binlog, sends messages to Kafka, and the write service consumes Kafka and writes to ES.
Posterior data cannot be written directly to ES because of volume. A Flink job aggregates user‑behavior streams in 1‑minute tumbling windows, outputs incremental posterior data, stores recent 7‑day data in Redis, and then writes aggregated results to ES via a write module.
Data Consistency
Redis write module (read‑modify‑write) : To guarantee atomicity, either Redis locking is used or an MQ queue partitions by rowkey so that the same rowkey is processed by a single consumer/thread.
Kafka‑Redis atomicity : The write module writes to Redis with a timestamp version and commits the Kafka offset only if the version matches, preventing message loss or duplicate consumption.
ES write concurrency : Multiple processes may write to the same index. The solution records a timestamp t1 before a full dump, then replays incremental data up to t1 after the dump. The inconsistency window is limited to about one minute, which is acceptable.
Zero‑downtime online index upgrade flow:
Write Smoothing
Flink‑aggregated data exhibits spikes that overload the ES write service. A fixed threshold is unsuitable, so a self‑adaptive token‑bucket rate limiter is introduced.
A statistics thread counts incoming messages per minute and updates a token bucket representing the allowed write rate.
A read thread stores messages in a queue.
A write thread consumes tokens; if none are available it waits, otherwise it writes to ES.
Recall Performance Optimization
High‑Concurrency Scenario
Recall traffic can reach 500 k QPS; direct writes to ES would overwhelm it. A multi‑level cache (local BigCache → distributed Redis → ES) raises cache hit rate to >95 % and reduces ES traffic to ~5 %.
Multi‑Level Cache Strategy
Local cache is periodically dumped to disk and reloaded on restart.
Cache values contain result and expiration; downstream failures extend expiration and return stale data.
Hot‑key lock prevents thundering‑herd on cache miss.
A fallback rate limiter caps total request volume when overall latency or error rate spikes.
Elasticsearch Tuning
Primary Shard Count
Based on an index size of ~10 GB and official guidance (shard size 1–50 GB), testing settled on 6 primary shards.
Filter Unnecessary Fields
Using _source filtering and filter_path removes about 80 % of the payload size.
Routing Field
Setting the author account ID as the routing field directs queries to a single shard, improving throughput six‑fold and overall ES performance by ~30 %.
Disable Unused Index/Sort Fields
Index templates mark irrelevant fields with "index": false and "doc_values": false to avoid unnecessary indexing and sorting overhead.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
JavaEdge
First‑line development experience at multiple leading tech firms; now a software architect at a Shanghai state‑owned enterprise and founder of Programming Yanxuan. Nearly 300k followers online; expertise in distributed system design, AIGC application development, and quantitative finance investing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
