Backend Development 21 min read

Design of a Bloom Filter‑Based Video Recommendation Deduplication Service for Short Video Platforms

The paper proposes a Bloom‑filter‑based deduplication service for short‑video recommendation that moves three‑month playback histories to disk‑backed Bloom filters while keeping the latest 100 served IDs in Redis, employing write batching, sharding, expiration policies, and an incremental migration strategy to replace memory‑intensive Redis ZSets and dramatically reduce storage costs.

vivo Internet Technology

Mar 30, 2022

Design of a Bloom Filter‑Based Video Recommendation Deduplication Service for Short Video Platforms

vivo short‑video recommendation requires filtering out videos that users have already watched to avoid duplicate recommendations. In a typical request, 2,000–10,000 videos are recalled based on user interests, then de‑duplicated against the user's watch history before ranking.

The current deduplication implementation uses Redis ZSet: playback events and served videos are stored as separate keys in a ZSet, and the recommendation algorithm reads the entire ZSet to perform set‑based deduplication (see Figure 1).

Because playback events are reported with latency, the server also keeps the most recent 100 served video IDs to ensure that even unreported playback events do not lead to duplicate recommendations. However, storing raw video IDs in Redis ZSet consumes a large amount of memory (e.g., 5 × 10⁷ users × 10,000 IDs × 25 B ≈ 12.5 TB). The current limit of 10,000 IDs per user degrades experience for heavy users.

To reduce memory usage, the team investigated mainstream solutions:

Storage format: Use Bloom filters to store multiple hash values of video IDs, drastically reducing space.

Storage medium: Persist Bloom filters on disk‑based KV stores (typically RocksDB on SSD) rather than in‑memory Redis, accepting slightly lower read performance for much larger capacity.

Technical selection:

Playback records: Store three‑month playback history in Bloom filters persisted on disk KV. Redis would still hold the most recent 100 served records.

Served records: Keep only the latest 100 served video IDs in Redis for quick access.

The proposed unified deduplication service will accept playback events, write them into a Bloom filter, and expose RPC (gRPC) interfaces for the recommendation algorithm to query. The write path consists of three steps (Figure 2): deserialize or create the Bloom filter, add the video ID, serialize and write back to disk KV.

Four key challenges were identified:

Write QPS is higher than read QPS; disk KV write performance is lower than Redis, so efficient batching is required.

Bloom filters cannot delete entries; an expiration/eviction strategy is needed.

Cross‑language service integration (Java backend, C++ recommendation engine) via gRPC and Consul.

Migration of existing Redis ZSet data to the new Bloom‑filter based system.

Three design aspects were detailed:

3.1 Overall Process

The service receives served video IDs via Dubbo and stores them in Redis, while playback events are batched and written to disk KV as Bloom filters. Recommendation algorithms call the service to filter recalled videos (Figure 3).

3.2 Traffic Aggregation

Playback IDs are first cached in Redis and written to disk KV in intervals (near‑real‑time) or via periodic batch jobs. Near‑real‑time uses a Redis key as a lock to limit writes to once per N minutes (Figure 4). Batch writing groups data by hour and uses a time‑ring with modulo‑based compensation to ensure reliability (Figure 5).

3.3 Data Sharding

To handle 50 million daily active users, playback data is sharded into 5,000 sets (≈10,000 users per set). Each time‑ring node stores user IDs as keys like played:user:{timeSlot}:{userHash}. Distributed timed tasks are further sharded into 50 groups, each handling a range of user shards (section 3.3).

3.4 Data Expiration

Playback history is kept for three months; Bloom filters are stored per month with a six‑month TTL. During reads, the most recent four months are consulted to guarantee the three‑month no‑repeat rule (Figure 7).

3.5 Summary of Design

The combined traffic aggregation, sharding, and expiration design yields the overall flow shown in Figure 8, where Kafka streams playback events to Redis, then to disk KV after batch processing.

4 Data Migration

Two migration strategies were evaluated to move existing Redis ZSet data to the new Bloom‑filter system. The final approach (Figure 11) scans old Redis keys, exports them to Kafka, and lets the distributed batch jobs ingest the data into the appropriate time‑ring shards, generating Bloom filters on‑the‑fly. This incremental, request‑driven migration avoids large one‑off data loads and ensures smooth transition.

In conclusion, the paper presents a Bloom filter‑based deduplication service for short‑video recommendation, covering problem analysis, solution design, performance considerations, and migration tactics, offering a reference for large‑scale recommendation infrastructure.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Migration Deduplication video recommendation disk KV

Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.