Backend Development 12 min read

Performance Testing and Optimization of Tubi's Real-Time Recommendation Service

This article describes how Tubi’s engineering team built and optimized a real‑time recommendation backend, using ScalaMeter microbenchmarks and wrk2 load testing to measure latency, throughput and error rates, and demonstrates scaling the service across multiple machines with custom scripts.

Bitu Technology
Bitu Technology
Bitu Technology
Performance Testing and Optimization of Tubi's Real-Time Recommendation Service

Tubi provides free streaming and needs a backend Predictor Service to recommend content in real time. The article explains the performance testing and optimization steps to meet latency and throughput goals.

Microbenchmark – Using ScalaMeter, a benchmark generates 1,000‑10,000 rows of synthetic data and measures the time to run the prediction model. Results show ~23 ms for 1,000 rows and ~222 ms for 10,000 rows, confirming the model is fast enough to proceed.

val sizes = Gen.range("size")(1000, 10000, 1000)
val input = for { size <- sizes } yield genRows(size)
performance of "RealTimeModelServing" in {
  measure method "predict" in {
    using(input) in { rows => rows.map(predictor.predict) }
  }
}

Load testing – The team uses wrk2 (a rate‑controlled version of wrk) to simulate concurrent users and record latency percentiles, throughput (requests per second), and error rate. Tests at 100 req/s, 150 req/s, and 200 req/s show p99 latency rising from ~108 ms to >180 ms, with CPU saturation identified via the USE method and vmstat.

$ wrk --rate 100 --duration 5m --latency --u_latency --threads 8 --connections 16 --script wrk.lua http://predictor-02.node.tubi:8080/predictor/v1/get-ranking
...
Latency Distribution (HdrHistogram - Recorded Latency)
50.000%   74.43ms
75.000%   87.42ms
90.000%   95.36ms
99.000%  108.42ms
...

To verify horizontal scalability, the service was deployed on four identical EC2 instances. A custom wrk2 script distributes traffic evenly across hosts, and the multi‑machine test handles 400 req/s with p99 latency under 150 ms.

-- usage `wrk --latency -c 8 -t 4 -d 1m -R 300 -s wrk.lua http://predictor.service.tubi:8080/predictor/v1/get-ranking`
local addrs = wrk.lookup(wrk.host, wrk.port or "http")
...
function init(args)
   local msg = "thread addr: %s"
   print(msg:format(wrk.thread.addr))
end

Autobench2 – A Python wrapper automates the whole load‑testing workflow: warm‑up, gradual ramp‑up, metric collection, and result visualization.

$ autobench --verbose --connection 8 --thread 4 --duration 1m \
            --script wrk.lua --warmup_duration 1m --low_rate 10 \
            --high_rate 20 --rate_step 10 http://predictor.service.tubi:8080/predictor/v1/get-ranking

In summary, the article demonstrates why performance testing is essential for real‑time recommendation services, walks through microbenchmarking, load testing, scaling, and introduces the open‑source Autobench2 tool for simplifying the process.

Backendscalabilityperformance testingload testingMicrobenchmarkreal-time recommendation
Bitu Technology
Written by

Bitu Technology

Bitu Technology is the registered company of Tubi's China team. We are engineers passionate about leveraging advanced technology to improve lives, and we hope to use this channel to connect and advance together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.