How Vivo Solved Real‑Time Feature Concatenation with RocksDB and Flink
This article explains the evolution of Vivo's real‑time recommendation feature‑concatenation architecture, compares hour‑level, Redis‑streaming and RocksDB state‑backend solutions, and details the memory, performance, startup and HDFS RPC problems encountered along with the concrete fixes applied.
Background
Feature concatenation links online services with recommendation models by merging user behavior logs (exposures, clicks, likes, etc.) with the latest feature snapshots. In Vivo's real‑time recommendation pipeline the concatenation module sits between the ranking service and the model training stage, as shown in the diagram.
Selection of Concatenation Schemes
2.1 Hour‑level Concatenation
Both logs and feature snapshots are written to Hive with hourly partitions. A Spark job runs each hour to join the two tables. Because the job depends on the hourly Hive partitions, events that cross hour boundaries (e.g., a video shown at 19:50, clicked at 19:59, and played at 20:03) may miss the join, resulting in low real‑time coverage.
2.2 Redis‑based Streaming Concatenation
To achieve higher join rates and real‑time guarantees, a Flink pipeline writes feature snapshots to Redis and consumes exposure/click streams from Kafka. The joined records are written back to Kafka for downstream model training. This design provides fault‑tolerance and monitoring but introduces two pain points:
Redis stores dozens of terabytes, leading to high cost.
Traffic spikes require DBA‑driven Redis cluster scaling, increasing operational overhead.
2.3 RocksDB Large‑State Streaming Concatenation
Replacing Redis with Flink's RocksDB state backend eliminates the external key‑value store. RocksDB can handle tens of terabytes by using both memory and disk. The architecture merges exposure, click, and feature streams in a single Flink job, stores intermediate state in RocksDB, and outputs the joined features to downstream Kafka.
Processing steps:
Union exposure, click, and feature streams and keyBy on a common key.
In processElement, store exposure/click events in state; when a feature arrives, query the state for matching events.
If a match is found, emit the joined record and clear the state; otherwise, store the feature and register a timer.
When the timer fires, repeat the lookup and emit if possible.
Using RocksDB reduces memory pressure, avoids Redis scaling, and after extensive tuning achieves 99.99% task stability and >99% join rate.
Problems and Solutions
3.1 TM Lost Issue
Early deployments frequently hit "TM was Lost" errors because TaskManager memory exceeded YARN limits. Investigation revealed that Flink splits the total process memory into JVM metaspace, JVM overhead, framework heap/off‑heap, task heap/off‑heap, managed memory and network buffers. Mis‑configuration of these components caused the overflow.
totalProcessMemorySize = totalFlinkMemorySize + JvmMetaspaceSize + JvmOverheadSize
totalFlinkMemorySize = frameworkOffHeapMemorySize + taskOffHeapMemorySize + managedMemorySize + networkMemorySize + frameworkHeapMemorySize + taskHeapMemorySizeSolution: increase taskmanager.memory.process.size and tune JvmOverhead and managedMemory. Switching to the jemalloc allocator (default in Flink 1.12+ on Kubernetes) also prevented 64 MiB allocation failures.
3.2 RocksDB Performance Monitoring
Tasks exhibited latency spikes without a clear bottleneck. Enabling RocksDB metrics (e.g., state.backend.rocksdb.metrics.enabled=true) revealed low block‑cache hit rates and high write‑buffer pressure.
Key tuning parameters:
Set state.backend.rocksdb.memory.managed to at least 20% of TM memory.
Use predefined options:
state.backend.rocksdb.predefined-options=SPINNING_DISK_OPTIMIZED_HIGH_MEM.
Reduce state.backend.rocksdb.write-buffer-ratio (default 0.5) to free more memory for reads.
Enable partitioned index filters: state.backend.rocksdb.memory.partitioned-index-filters=true and adjust high-prio-pool-ratio.
3.3 Task Delay
When flush‑waiting memory tables or compaction queues grew, write throughput stalled, causing task delay. Monitoring showed high disk I/O; increasing parallelism or raising state.backend.rocksdb.thread.num (default 1) alleviated the bottleneck.
3.4 Slow Task Startup
During state restoration, TaskManagers read large checkpoint files from remote HDFS. Concentrated TM placement saturated a single machine's disk I/O, leaving some subtasks in INITIALIZING for minutes.
Fixes:
Set YARN yarn.scheduler.fair.assignmultiple=false to limit containers per scheduling round.
Implement a custom ResourceManager filter to cap the number of TM containers per host.
3.5 Disk Saturation
Large‑state jobs (tens to hundreds of terabytes) frequently pushed SSD usage above 90%.
Solution: store checkpoints of massive jobs on a separate offline cluster with ample disk, while keeping small‑state jobs on the real‑time cluster.
3.6 HDFS RPC Spike
After deploying new jobs, HDFS RPC traffic surged due to frequent checkpoint file creation, modification, and deletion.
By increasing state.backend.rocksdb.compaction.level.target-file-size-base from 64 MiB to 256 MiB, the number of small files dropped, and RPC rates returned to normal.
Conclusion
4.1 Open Issues
RocksDB tuning remains complex; dozens of parameters must be balanced for peak performance.
State recovery can take tens of minutes for terabyte‑scale checkpoints, which is unacceptable for feature‑concatenation jobs.
Intensive writes accelerate SSD wear‑out.
4.2 Future Planning
The industry is moving toward cloud‑native, containerized deployments. Upcoming Flink releases will support remote‑storage state backends, addressing container disk limits, RocksDB compaction spikes, and rapid scaling of massive state.
Vivo plans to evaluate remote‑storage state backends for feature concatenation once they mature.
4.3 Outlook
By combining hour‑level batch joins, Redis streaming, and RocksDB large‑state streaming, Vivo achieves low latency, massive scale, and seamless batch‑stream integration, while keeping the architecture extensible through Paimon’s dynamic columns.
vivo Internet Technology
Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
