Backend Development 19 min read

From Zero to One: Complete Architecture Design for a Billion‑Scale Short‑Video System

This article dissects the end‑to‑end architecture of a billion‑scale short‑video platform, detailing layered design, core services such as upload, transcoding, recommendation, interaction, storage, and the key challenges of massive video storage, high‑concurrency streaming, low‑latency playback, and real‑time recommendation reliability.

dbaplus Community

Apr 25, 2026

From Zero to One: Complete Architecture Design for a Billion‑Scale Short‑Video System

Introduction

Short video has become one of the highest‑engagement formats on the Internet, with leading platforms serving hundreds of millions of daily active users and processing tens of thousands of video uploads, transcoding, and deliveries per second. Building such a system requires more than simple CRUD and file storage.

Overall Architecture

The system follows a "layered decoupling + micro‑service" model and is divided from bottom to top into four layers: infrastructure component layer, core service layer, algorithm engine layer, and access layer, all supported by a monitoring and operations framework.

Key Components

Access Layer

API Gateway: Kong/APISIX handles request routing, JWT/OAuth2.0 authentication, token‑bucket rate limiting, and gray release based on user ID or region, while also managing CORS and parameter validation.

Load Balancing: Nginx/LVS with DNS round‑robin provides region‑level traffic distribution, directing users to the nearest data center to reduce latency.

Gray Release: Critical features are rolled out to a subset of users using hash‑bucket traffic splitting, allowing real‑time feedback before full deployment.

Core Service Layer

Video Production Service: Split into upload, transcoding, and audit micro‑services. Upload receives chunked video, validates integrity, merges via object storage, then sends a transcoding task to Kafka. Transcoding uses an FFmpeg cluster to produce multiple bitrate versions. Audit integrates with a content‑moderation API.

Video Distribution Service: Implements a push‑pull chain. Push service writes transcoded streams to CDN origin; pull service provides playback URLs with support for resumable download. A cache service uses Redis to store hot video metadata, reducing DB hits.

Interaction Service: Handles likes, comments, shares, and follows. It is further divided into interaction processing, relationship management, and message notification modules.

User Service: Manages account information, login sessions, and user rights. Data is stored in MySQL with sharding by user‑ID hash.

Algorithm Engine Layer

Recommendation Module: Combines a real‑time engine (Flink) with an offline engine (Spark). Offline computes long‑term interests (T+1), while real‑time updates short‑term interests every second. Results are served via a recommendation API.

User Profile Module: Stores user tags (e.g., "likes food", "follows cars") in Elasticsearch, derived from basic info, interaction behavior, and watch time.

Content Moderation Module: Uses AI to detect prohibited content and routes suspicious videos to manual review, ensuring compliance.

Infrastructure Layer

Storage Components: Object storage (S3/OSS) for video files, MySQL (master‑slave) for structured data, Elasticsearch/HDFS for unstructured data, and Redis (cluster with sharding and sentinel) for caching hot video metadata and login states.

Compute Components: Spark for offline batch jobs, Flink for real‑time stream processing.

Message Queue: Kafka (cluster mode) decouples services, e.g., uploading triggers a transcoding task, which upon completion triggers an audit task.

CDN: Integration with Alibaba Cloud or Tencent Cloud CDN to cache transcoded videos at nationwide edge nodes.

Core Function Implementations

Video Upload & Transcoding

Clients upload videos in 1 MB chunks over HTTP/2. The upload service validates each chunk, merges them in object storage, and emits a "transcoding task" to Kafka. The transcoding service consumes the message, runs FFmpeg on a transcoding cluster to generate 480P/720P/1080P streams, and updates video metadata.

Key techniques include chunked upload with resume capability, asynchronous transcoding via Celery with priority queues, and QUIC to improve upload speed on weak networks.

Recommendation & Distribution

When a user opens the app, the client calls the recommendation API. The service first fetches short‑term interest tags from Redis (e.g., videos liked in the last hour), then merges them with long‑term interests from Spark, combines device and region info, filters candidates in Elasticsearch, and returns a personalized list. User actions (play, like, comment) are streamed to Flink for real‑time interest updates.

Key techniques: collaborative‑filtering algorithm, DeepFM model that fuses user and video features, and cache pre‑warming of popular recommendation lists in Redis.

User Interaction (Like/Comment)

For a like, the client sends a request to the interaction service, which validates the login token, checks Redis to prevent duplicate likes, updates a sharded MySQL table and the Redis like set, and publishes a "like notification" message to Kafka. The message service consumes it and pushes a notification to the video author.

Key techniques: Redis distributed lock to avoid race conditions, MySQL sharding by user‑ID hash, read‑write separation (reads from Redis, writes to MySQL master), and CDN pre‑heat for fast interaction data retrieval.

Technical Challenges & Solutions

Massive Video Storage Cost Control

Assuming 1 billion videos of 10‑50 MB each, storage exceeds 100 PB. A naïve use of standard storage would cost tens of millions of yuan annually. Moreover, video access exhibits a "hot‑cold" pattern: the last 30 days account for 80 % of plays, while older videos are rarely accessed.

Solution: tiered storage with three levels—hot (SSD object storage for videos with >1000 daily plays or top 10 % in the last 7 days), warm (HDD standard storage for videos played within 30 days), and cold (archival storage for videos inactive >3 months). A scheduled job (e.g., Crontab) runs nightly to migrate videos based on play counts. A manual archive API allows creators to move videos to cold storage while preserving URLs via soft links.

High‑Concurrency Push & Low‑Latency Playback

During peak hours (20:00‑22:00), the system may receive >1000 new push requests per second. Directly pushing to the origin would saturate bandwidth. Simultaneously, millions of concurrent pull requests can push a single node beyond 10 Gbps, causing packet loss and >3 s latency, with stall rates >5 %.

Solution: Deploy edge nodes as push‑proxy services (based on a custom Nginx‑RTMP module) that accept chunked uploads, perform verification and temporary storage, then asynchronously sync increments to the central origin. CDN employs a three‑level cache hierarchy (edge → region → origin). Hot videos are cached at edge nodes nationwide, achieving <100 ms latency; warm videos reside in regional nodes, reducing origin load.

Client‑side optimizations include a pre‑load mechanism that buffers 50 % of the next recommended video and an adaptive bitrate algorithm that switches streams based on measured network speed, keeping stall rate below 1 %.

Real‑Time Recommendation High Availability

The recommendation pipeline relies on Flink (real‑time), Elasticsearch (search), and the recommendation API. Failure in any component can produce empty or erroneous recommendation lists, potentially dropping DAU by over 40 %.

Multi‑level downgrade strategy:

Level 1 (Flink latency >5 s): Switch to "offline recommendation + short‑term cache" mode, using Spark‑computed long‑term interests and Redis‑cached recent interaction tags.

Level 2 (Elasticsearch health <90 %): Serve a hot‑video list cached in Redis (updated every 10 minutes) and mark the UI as "Popular Recommendations".

Level 3 (API QPS overload): Limit response to 20 videos instead of 50 and reject non‑core requests (e.g., similar‑video queries) to preserve homepage availability.

Data consistency is ensured by pre‑loading offline results into Redis with version tags and performing incremental updates when ES data changes. Automatic recovery is driven by Prometheus + Grafana alerts; once a component recovers, dual‑active deployment seamlessly switches back to normal mode without user impact.

Conclusion

By decomposing the vague requirement into four clear layers, selecting appropriate tools for each (edge nodes for push, Spark + Flink for recommendation, tiered storage for cost, and graceful degradation for availability), developers can construct a short‑video system that withstands billion‑scale traffic, controls storage expenses, and remains fault‑tolerant.

System Architecture Microservices recommendation High Concurrency storage optimization video streaming short video

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.