Big Data 21 min read

How Vivo Solved Real‑Time Feature Concatenation with RocksDB and Flink

This article explains the evolution of Vivo's real‑time recommendation feature‑concatenation architecture, compares hour‑level, Redis‑streaming and RocksDB state‑backend solutions, and details the memory, performance, startup and HDFS RPC problems encountered along with the concrete fixes applied.

vivo Internet Technology

Nov 12, 2025

How Vivo Solved Real‑Time Feature Concatenation with RocksDB and Flink

Background

Feature concatenation links online services with recommendation models by merging user behavior logs (exposures, clicks, likes, etc.) with the latest feature snapshots. In Vivo's real‑time recommendation pipeline the concatenation module sits between the ranking service and the model training stage, as shown in the diagram.

Selection of Concatenation Schemes

2.1 Hour‑level Concatenation

Both logs and feature snapshots are written to Hive with hourly partitions. A Spark job runs each hour to join the two tables. Because the job depends on the hourly Hive partitions, events that cross hour boundaries (e.g., a video shown at 19:50, clicked at 19:59, and played at 20:03) may miss the join, resulting in low real‑time coverage.

2.2 Redis‑based Streaming Concatenation

To achieve higher join rates and real‑time guarantees, a Flink pipeline writes feature snapshots to Redis and consumes exposure/click streams from Kafka. The joined records are written back to Kafka for downstream model training. This design provides fault‑tolerance and monitoring but introduces two pain points:

Redis stores dozens of terabytes, leading to high cost.

Traffic spikes require DBA‑driven Redis cluster scaling, increasing operational overhead.

2.3 RocksDB Large‑State Streaming Concatenation

Replacing Redis with Flink's RocksDB state backend eliminates the external key‑value store. RocksDB can handle tens of terabytes by using both memory and disk. The architecture merges exposure, click, and feature streams in a single Flink job, stores intermediate state in RocksDB, and outputs the joined features to downstream Kafka.

Processing steps:

Union exposure, click, and feature streams and keyBy on a common key.

In processElement, store exposure/click events in state; when a feature arrives, query the state for matching events.

If a match is found, emit the joined record and clear the state; otherwise, store the feature and register a timer.

When the timer fires, repeat the lookup and emit if possible.

Using RocksDB reduces memory pressure, avoids Redis scaling, and after extensive tuning achieves 99.99% task stability and >99% join rate.

Problems and Solutions

3.1 TM Lost Issue

Early deployments frequently hit "TM was Lost" errors because TaskManager memory exceeded YARN limits. Investigation revealed that Flink splits the total process memory into JVM metaspace, JVM overhead, framework heap/off‑heap, task heap/off‑heap, managed memory and network buffers. Mis‑configuration of these components caused the overflow.

totalProcessMemorySize = totalFlinkMemorySize + JvmMetaspaceSize + JvmOverheadSize
totalFlinkMemorySize = frameworkOffHeapMemorySize + taskOffHeapMemorySize + managedMemorySize + networkMemorySize + frameworkHeapMemorySize + taskHeapMemorySize

Solution: increase taskmanager.memory.process.size and tune JvmOverhead and managedMemory. Switching to the jemalloc allocator (default in Flink 1.12+ on Kubernetes) also prevented 64 MiB allocation failures.

3.2 RocksDB Performance Monitoring

Tasks exhibited latency spikes without a clear bottleneck. Enabling RocksDB metrics (e.g., state.backend.rocksdb.metrics.enabled=true) revealed low block‑cache hit rates and high write‑buffer pressure.

Key tuning parameters:

Set state.backend.rocksdb.memory.managed to at least 20% of TM memory.

Use predefined options:

state.backend.rocksdb.predefined-options=SPINNING_DISK_OPTIMIZED_HIGH_MEM

Reduce state.backend.rocksdb.write-buffer-ratio (default 0.5) to free more memory for reads.

Enable partitioned index filters: state.backend.rocksdb.memory.partitioned-index-filters=true and adjust high-prio-pool-ratio.

3.3 Task Delay

When flush‑waiting memory tables or compaction queues grew, write throughput stalled, causing task delay. Monitoring showed high disk I/O; increasing parallelism or raising state.backend.rocksdb.thread.num (default 1) alleviated the bottleneck.

3.4 Slow Task Startup

During state restoration, TaskManagers read large checkpoint files from remote HDFS. Concentrated TM placement saturated a single machine's disk I/O, leaving some subtasks in INITIALIZING for minutes.

Fixes:

Set YARN yarn.scheduler.fair.assignmultiple=false to limit containers per scheduling round.

Implement a custom ResourceManager filter to cap the number of TM containers per host.

3.5 Disk Saturation

Large‑state jobs (tens to hundreds of terabytes) frequently pushed SSD usage above 90%.

Solution: store checkpoints of massive jobs on a separate offline cluster with ample disk, while keeping small‑state jobs on the real‑time cluster.

3.6 HDFS RPC Spike

After deploying new jobs, HDFS RPC traffic surged due to frequent checkpoint file creation, modification, and deletion.

By increasing state.backend.rocksdb.compaction.level.target-file-size-base from 64 MiB to 256 MiB, the number of small files dropped, and RPC rates returned to normal.

Conclusion

4.1 Open Issues

RocksDB tuning remains complex; dozens of parameters must be balanced for peak performance.

State recovery can take tens of minutes for terabyte‑scale checkpoints, which is unacceptable for feature‑concatenation jobs.

Intensive writes accelerate SSD wear‑out.

4.2 Future Planning

The industry is moving toward cloud‑native, containerized deployments. Upcoming Flink releases will support remote‑storage state backends, addressing container disk limits, RocksDB compaction spikes, and rapid scaling of massive state.

Vivo plans to evaluate remote‑storage state backends for feature concatenation once they mature.

4.3 Outlook

By combining hour‑level batch joins, Redis streaming, and RocksDB large‑state streaming, Vivo achieves low latency, massive scale, and seamless batch‑stream integration, while keeping the architecture extensible through Paimon’s dynamic columns.