Artificial Intelligence 21 min read

How Daft and Ray Supercharge Million‑Hour Video Processing for AI‑Powered Robotics

This article details a scalable, distributed pipeline that uses LAS AI Data Lake, Daft on Ray, and advanced video‑processing techniques—scene detection, splitting, frame sampling, filtering, and caption generation—to transform tens of millions of hours of robot‑captured video into high‑quality, searchable semantic data while dramatically boosting CPU and GPU utilization.

ByteDance Data Platform

Dec 23, 2025

How Daft and Ray Supercharge Million‑Hour Video Processing for AI‑Powered Robotics

Background

Industrial robot dogs continuously record video, generating tens of millions of hours of raw footage that must be processed for inspection, safety, and AI training. Traditional single‑machine, script‑based pipelines cannot handle this scale.

Challenges of Traditional Pipelines

Dispersed processing steps lead to poor manageability.

Heavy I/O and storage pressure from intermediate files.

Low resource utilization and difficulty scaling.

Unstable execution; a failure in one step can halt the whole chain.

Re‑engineering required when data volume jumps from hundreds to thousands of hours.

Solution Overview

By leveraging the LAS AI Data Lake product (Lake Storage + Lake Compute) and Daft’s distributed DataFrame engine on Ray, we built a unified, high‑throughput pipeline that converts raw video into structured, high‑quality clips ready for AI models.

Step 1: Scene Detection

We first split long videos into semantic scenes using PySceneDetect, which analyzes brightness histograms and frame‑wise content differences to locate cut points.

def detect_scenes(self, video_path):
    video = open_video(video_path)
    scene_manager.detect_scenes(video)
    scenes = []
    for start, end in scene_manager.get_scene_list():
        scenes.append((start.get_seconds(), end.get_seconds()))
    return scenes

Step 2: Video Splitting

Each detected scene is extracted with FFmpeg using precise timestamps (‑ss / ‑to) and the -c copy flag to avoid re‑encoding, ensuring lossless, fast cuts.

def _split_and_save_scene(self, scene, video_path, output_dir):
    cmd = ["ffmpeg", "-loglevel", "error", "-ss", str(start_sec), "-to", str(end_sec), "-i", video_path, "-c", "copy", clip_path]
    return clip_path

Step 3: Frame Sampling

From each clip we sample a limited number of frames (default 8) and store them as image files, providing the input format required by downstream vision models.

def _sample_frames(self, clip_path):
    # implementation uses ffmpeg or OpenCV to extract evenly spaced frames
    return frame_paths

Step 4: Frame Filtering

A lightweight model evaluates each sampled frame for blur, exposure, semantic completeness, and overall quality. Clips failing the thresholds (e.g., duration < 4 s, low sharpness) are discarded to avoid wasting GPU cycles.

def _score_predict(self, frames_data):
    # compute blur, brightness, etc., and return pass/fail flag with scores
    return {"passed": True, "scores": scores}

Step 5: Caption Generation

High‑quality clips that pass filtering are fed to a vision‑language model (e.g., Qwen‑VL) with a detailed prompt that asks for concise, objective descriptions of the inspection scene.

def _generate_caption(self, frames_data):
    # call VLM with prompt and return generated text
    return caption

Daft Pipeline and Streaming Execution

All steps are expressed as Daft UDFs with explicit num_cpus and num_gpus allocations. Daft’s explode operator expands video‑level rows into scene‑level rows, enabling massive parallelism. The pipeline runs on a Ray cluster, allowing dynamic scaling, fault tolerance, and checkpointing.

def main():
    daft.context.set_runner_ray()
    df = daft.from_glob_path(s3_path, io_config=io_config).select('path').with_column_renamed('path','video_path')
    df = df.with_column('scene_list', scene_detect_udf(col('video_path')))
    df = df.explode(col('scene_list'))
    df = df.with_column('clip_path', video_split_udf(col('video_path'), col('scene_list')))
    df = df.with_column('frames', frame_sampler_udf(col('clip_path')))
    df = df.with_column('filtered', frame_filter_udf(col('frames')))
    df = df.with_column('caption', caption_udf(col('frames')))
    df.write_parquet(output_s3_path, io_config=io_config)

Performance Optimizations

CPU thread isolation: Setting OMP_NUM_THREADS per Daft actor prevents oversubscription and improves per‑actor throughput.

Zero‑copy Arrow tensors: Converting video frames to Arrow‑compatible tensors eliminates Python object copying for small media.

Ordered vs. unordered execution: Write‑heavy stages use unordered dispatch for maximum parallel I/O, while debugging stages keep order.

Streaming decoupling: Decoding/sampling runs in a separate UDF, feeding frames to GPU inference asynchronously, raising GPU utilization from ~20 % to > 90 %.

Checkpointing with Parquet append‑only: After each stage we anti‑join with previously processed keys, enabling safe pause, resume, and cluster scaling without re‑processing.

Results

CPU utilization rose from a fluctuating 40 %–60 % to a stable 100 % after Daft integration. GPU utilization increased from 20 %–40 % (I/O‑bound) to a steady > 90 % thanks to the asynchronous pipeline. The end‑to‑end throughput scaled linearly with added CPU cores.

Conclusion

The collaboration demonstrates that a world‑model‑centric robot platform can rely on a cloud‑native, AI‑optimized data lake and a Daft‑on‑Ray execution engine to turn massive raw video streams into clean, captioned, and searchable data. The approach delivers both engineering robustness (checkpointing, scaling) and concrete resource gains, enabling AI‑driven inspection, safety monitoring, and downstream model training at the "million‑hour" scale.

Video processing Distributed Computing Ray AI pipeline Daft

Written by

ByteDance Data Platform

The ByteDance Data Platform team empowers all ByteDance business lines by lowering data‑application barriers, aiming to build data‑driven intelligent enterprises, enable digital transformation across industries, and create greater social value. Internally it supports most ByteDance units; externally it delivers data‑intelligence products under the Volcano Engine brand to enterprise customers.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background

Challenges of Traditional Pipelines

Solution Overview

Step 1: Scene Detection

Step 2: Video Splitting

Step 3: Frame Sampling

Step 4: Frame Filtering

Step 5: Caption Generation

Daft Pipeline and Streaming Execution

Performance Optimizations

Results

Conclusion

ByteDance Data Platform

How this landed with the community

Was this worth your time?

0 Comments

Step 1: Scene Detection

Step 2: Video Splitting

Step 3: Frame Sampling

Step 4: Frame Filtering

Step 5: Caption Generation