Optimizing Redis Latency for an Online Feature Store: A Batch Query Case Study
This article describes how Tubi improved the latency of its Redis‑backed online feature store for machine‑learning inference by analyzing query patterns, measuring client‑side bottlenecks, and applying optimizations such as binary Avro encoding, MGET usage, virtual partitioning, and parallel deserialization to meet a sub‑10 ms SLA.
Background: Tubi's movie recommendation system relies on machine‑learning models that consume high‑quality features stored in an online feature store (OFS) backed by Redis.
Feature families are grouped as Entity, Context, and Candidate, each with distinct query patterns: point lookups, batch lookups based on context, and candidate lookups.
To keep latency low, different Redis data structures are used; Entity and Context families are stored as simple key‑value pairs, while Candidate families are stored as hash maps.
Challenge: Context‑family queries become batch operations that can request thousands of rows per request, causing a fan‑out effect and P99 latency above the 10 ms SLA.
Initial approach: store each feature as a Redis hash and use Lettuce pipeline with HGETALL. Scala example:
// 创建 RedisClusterClient
val redisUri = RedisURI.create("redis://localhost:6379")
val clusterClient = RedisClusterClient.create(redisUri)
val connection = clusterClient.connect()
val asyncCommands: RedisClusterAsyncCommands[String, String] = connection.async()
val keys = List("key1","key2","key3","key4","key5")
asyncCommands.setAutoFlushCommands(false)
val futures = keys.map(key => asyncCommands.hgetall(key))
asyncCommands.flushCommands()
asyncCommands.setAutoFlushCommands(true)This yielded a P99 latency of 20‑30 ms, still above the target.
First optimization: encode rows as Avro binary and store them as plain key‑value pairs, then replace pipeline with a single MGET.
Result: latency dropped to 3‑4 ms, but later feature families with much higher fan‑out (e.g., 855 rows per request) pushed P99 back to 15 ms.
Second attempt: virtual partitioning – split 800 rows into 10 partitions and issue concurrent MGETs to multiple Redis shards. This did not improve latency.
Root‑cause analysis: separate Lettuce “first‑byte latency” (network) from “completion latency” (client processing). The first‑byte latency was low; the bottleneck was client‑side deserialization, which took up to 10 ms.
Verification: added metrics to measure deserialization time, confirming the hypothesis.
Final solution: parallelize deserialization using Scala Parallel Collections.
if (rows.size > settings.parallelDecodeThreshold) {
rows.par.map { row => deserializeAvroRow(schema, row.getValue, featureFamily, features) }.seq
} else {
rows.map { row => deserializeAvroRow(schema, row.getValue, featureFamily, features) }
}After this change, P99 latency fell to 10 ms, meeting the SLA.
Key takeaways: do not optimize blindly; decompose metrics to locate true bottlenecks, adopt an end‑to‑end perspective, validate assumptions with instrumentation, and collaborate with experts.
Bitu Technology
Bitu Technology is the registered company of Tubi's China team. We are engineers passionate about leveraging advanced technology to improve lives, and we hope to use this channel to connect and advance together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.