Goku: How HKU and ByteDance’s New Model Sets New Benchmarks in Commercial Image and Video Generation
The paper presents Goku, a rectified‑flow transformer that jointly generates high‑quality images and videos at commercial scale, detailing its novel architecture, massive high‑quality data pipeline, efficient large‑scale training tricks, and state‑of‑the‑art results on GenEval, DPG‑Bench, VBench and UCF‑101.
Key Highlights
Industry‑leading text‑to‑image and text‑to‑video generation quality, achieving new records on multiple benchmarks.
Introduces the Rectified Flow Transformer to improve joint image‑video generation.
Builds a 36 M video‑text and 160 M image‑text high‑quality dataset using MLLM‑generated captions and LLM correction.
Optimizes compute efficiency and training stability for large‑scale distributed training.
Problem Statement
Existing image and video generators lag in quality, consistency, and computational efficiency.
Large‑scale, high‑quality data is needed to train high‑performance generative models.
Current architectures do not unify image and video representations, limiting cross‑modal generation.
Training such models is computationally expensive and requires better parallelism and fault tolerance.
Proposed Solution – Goku Model
Goku is built on a Rectified Flow Transformer that enables joint image‑video generation.
Constructs a massive dataset (36 M video‑text pairs, 160 M image‑text pairs) filtered with OCR analysis and aesthetic scoring.
Uses a 3D Variational Auto‑Encoder (VAE) to create a shared latent space for images and videos.
Adopts a full‑attention Transformer to improve cross‑modal consistency.
Employs ByteCheckpoint and MegaScale for efficient parallelism and fault‑tolerant training.
Technical Components
Rectified Flow : a flow‑based formulation that linearly interpolates between a prior (e.g., standard normal) and the data distribution, providing faster convergence and clearer theoretical properties.
3D VAE : encodes raw video frames (and images as a special case) into a latent space, then organizes latent tokens into mini‑batches containing both video and image tokens.
Full‑Attention Transformer : replaces the usual temporal‑plus‑spatial attention with a single attention mechanism that processes all multimodal tokens together.
FlashAttention and Sequence Parallelism reduce GPU memory usage and improve throughput for long sequences (up to >220 K tokens).
Patch n’ Pack (NaViT‑style) packs image and video tokens of varying aspect ratios and lengths into the same batch, eliminating the need for data buckets.
3D RoPE positional encoding accelerates convergence compared with sinusoidal embeddings and supports variable‑resolution inputs.
Query‑Key Normalization (Q‑K Norm) with RMSNorm stabilizes training and prevents loss spikes that cause visual artifacts.
Training Strategy
Stage 1 – Text‑Semantic Pairing : pre‑train on text‑to‑image to learn robust visual concepts.
Stage 2 – Joint Image‑Video Learning : integrate image and video tokens in a unified sequence, leveraging high‑quality image data to boost video generation.
Stage 3 – Modality‑Specific Fine‑Tuning : adjust image generation for visual appeal and video generation for temporal smoothness and motion consistency.
Multi‑stage resolution scaling (288×512 → 480×864 → 720×1280) gradually refines detail while keeping compute cost low.
Infrastructure Optimizations
3D Parallelism across sequence, data, and model dimensions.
Sequence Parallel (SP) with all‑to‑all communication for query/key/value shards.
Full‑Shard Data Parallel (FSDP) with HYBRID_SHARD strategy to overlap communication and computation.
Fine‑grained Activation Checkpointing selects only critical layers for checkpoint storage, maximizing GPU utilization.
MegaScale provides automatic fault detection, multi‑level monitoring, and fast restart to keep large‑scale training stable.
ByteCheckpoint enables sub‑second checkpoint I/O for an 8 B model on thousands of GPUs.
Data Curation Pipeline
The final training corpus contains ~1.6 B image‑text pairs and 36 M video‑text pairs, sourced from public datasets (LAION, Panda‑70M, InternVid, OpenVid‑1M, Pexels) and proprietary collections. The pipeline consists of five stages: collection, video clipping, filtering, annotation, and distribution balancing.
Pre‑processing & Standardization : convert all videos to H.264, filter by length (≥4 s), resolution (≥480 px), bitrate (≥500 kbps), and frame‑rate (≥24 fps).
Clip Extraction : use PySceneDetect for coarse shot detection, then DINOv2 similarity to refine clips; limit each clip to ≤10 s.
Aesthetic Filtering : discard clips with average aesthetic score below 4.3 (480 px) or 4.5 (>720 px).
OCR Filtering : remove clips where text coverage exceeds 2 % (480 px) or 1 % (>720 px).
Motion Filtering : compute average optical flow with RAFT; discard clips with motion score <0.3 or >20.0 (480 px) and <0.5 or >15.0 (>720 px).
Captions are generated with InternVL2.0 for dense image subtitles, Tarsier2 for video‑level subtitles, and merged by Qwen2. Motion scores are appended to subtitles to give users control over generated motion dynamics.
Experiments
Text‑to‑Image Results
GenEval : Goku‑T2I scores 0.76, the highest among evaluated models.
T2I‑CompBench : outperforms PixArt‑α, SDXL, DALL‑E 2 on color, shape, and texture alignment.
DPG‑Bench : achieves 83.65 average score, surpassing PixArt‑α (71.11), DALL‑E 3 (83.50) and EMU3 (80.60).
Text‑to‑Video Results
UCF‑101 Zero‑Shot : Goku‑2B generates 13 320 videos; at 128×128 resolution it attains FVD = 217.24 and the best Inception Score, beating all baselines.
VBench : Goku‑T2V leads on overall performance and excels in human motion, dynamics, multi‑object generation, style, quality, and semantic alignment across 16 dimensions.
Image‑to‑Video (I2V) Fine‑Tuning
Using ~4.5 M image‑video‑text triples for 10 k steps, Goku‑I2V produces high‑fidelity videos that remain tightly aligned with accompanying text, demonstrating strong generalization despite limited fine‑tuning.
Ablation Studies
Model Scaling : 8 B version reduces structural distortions (e.g., broken arms, malformed wheels) compared with 2 B.
Joint Training vs. Separate Training : joint image‑video training yields noticeably higher frame quality and stability than training video generation alone.
Conclusion
Goku introduces a unified, flow‑based foundation model for commercial‑grade image and video generation. By combining a rectified‑flow transformer, 3D VAE, full‑attention, and a rigorously curated dataset, it achieves state‑of‑the‑art performance on multiple benchmarks while maintaining efficient, fault‑tolerant large‑scale training.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
