Operations 14 min read

Refactoring Playback Error Reporting, Metrics, and Recovery in Tubi Web/OTT Player

The article details how Tubi's Web/OTT team restructured player error reporting, statistical metrics, and unified handling, introduced precise error‑tracking enums, defined new recovery strategies for device decoding, network, and cache issues, and validated their impact through extensive experiments that improved user experience and key business KPIs.

Bitu Technology
Bitu Technology
Bitu Technology
Refactoring Playback Error Reporting, Metrics, and Recovery in Tubi Web/OTT Player

Tubi, as a video streaming platform, aims to provide smooth playback to boost user satisfaction and retention. Over the past year, the Tubi Web/OTT team rebuilt the player error reporting, analytics, and handling functions, exploring and enriching retry‑recovery strategies for various playback errors, resulting in significant improvements to business and performance metrics.

Overview

During playback, errors caused by network fluctuations or hardware performance can interrupt the stream, affecting user experience. Before the refactor, error handling suffered from three main pain points: complex and opaque error reporting, scattered handling logic across components, and lack of retry strategies for frequent errors.

Pre‑refactor Architecture

The error‑propagation hierarchy consisted of multiple playback cores for different OTT platforms, a control component layer offering slots for monitoring, session management, and telemetry, and a UI layer responsible for error dialog display.

To address the pain points, the team reorganized logic around three axes: refined error telemetry, clearly defined statistical metrics, and unified error handling.

Playback Error Telemetry

Errors are now reported based on user perception: if the core auto‑retries successfully, the error is not reported externally but logged internally; if auto‑retry fails, the error is escalated to the control component, which may attempt further retries, switch to backup resources, and finally report the aggregated session data. When the control layer exhausts its retries, the UI shows an error dialog and records a playback failure.

Statistical Metrics

All errors are represented by a unified enumeration, simplifying dashboard analysis and mapping each error type to existing or missing recovery strategies. Two key metrics are introduced:

Playback Error Rate – proportion of sessions where the control layer handled errors (potentially recovered).

Playback Failure Rate – proportion of sessions where the UI displayed an error dialog, indicating unrecoverable failures.

Unified Error Handling

All disparate error‑handling code was consolidated into a single control component, improving clarity and reusability. Ineffective legacy logic was refactored based on experimental validation.

Effective Recovery Strategies

Experiments identified several successful strategies:

Device‑Level Decoding Errors – switch to alternative video resources and reload the stream.

Network Request Errors – retry a limited number of times, fallback to backup CDNs, or downgrade video quality.

Cache Gaps – use hls.js’s recoverMediaError method to clear cached data and restart loading, with a cap on invocation frequency to avoid stutter.

Autonomous Recoverable Errors – suppress error dialogs temporarily and wait for the player to self‑recover, which significantly improved metrics.

Below is the post‑refactor error‑handling flow (pseudo‑code):

receivedError = (error: ErrorData) => {
  // 1. Record error in the playback session
  PlaybackSession.getInstance().recordError(error);

  // 2. Apply retry strategy based on error type
  let recoverTimesReachLimit: boolean = false;
  switch (error.code) {
    case ErrorCode.DRM_ERROR:
      recoverTimesReachLimit = this.recoverDRMError(error);
      break;
    case ErrorCode.CODEC_ERROR:
      recoverTimesReachLimit = this.recoverCodecError(error);
      break;
    case ErrorCode.NETWORK_ERROR:
      recoverTimesReachLimit = this.recoverNetworkError(error);
      break;
    // ...
    default:
      recoverTimesReachLimit = true;
      break;
  };
  // If recovery succeeded, exit
  if (!recoverTimesReachLimit) return;

  // 3. Recovery limit reached – mark session as failure
  PlaybackSession.getInstance().recordFailure(error);

  // 4. Show error modal to user
  this.showErrorModal(error);
};

Experimental Results

Across Web and OTT platforms, the implemented strategies yielded notable gains: increased user watch time, retention, ad exposure, and revenue; reduced start‑up failure and playback failure rates. Visualizations show the comparative improvements between control and experiment groups.

Practical Experience

Effective strategy discovery involves collecting error occurrence probabilities, correlating them with business metrics, understanding root causes, and running A/B experiments on appropriate device cohorts. Differences across OTT devices (DRM types, hardware) necessitate tailored experiments.

Summary

By refactoring Tubi’s OTT playback error recovery, the team improved telemetry accuracy, streamlined analysis, and enabled precise metric‑driven debugging. Targeted retry strategies ensured stable playback, delivering a smoother experience for Tubi users.

operationsmetricserror handlingVideo StreamingOTTplayback recovery
Bitu Technology
Written by

Bitu Technology

Bitu Technology is the registered company of Tubi's China team. We are engineers passionate about leveraging advanced technology to improve lives, and we hope to use this channel to connect and advance together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.