Backend Development 22 min read

Build a Scalable Short‑Video System: Architecture, Storage, and Real‑Time Recommendations

This article dissects the architecture of a modern short‑video backend, covering layered system design, core services such as video production, distribution, interaction, storage strategies, real‑time and offline recommendation engines, high‑concurrency streaming solutions, and practical techniques for cost control, scalability, and fault tolerance.

Su San Talks Tech

Sep 1, 2025

Build a Scalable Short‑Video System: Architecture, Storage, and Real‑Time Recommendations

Hello, I am Su San.

Introduction

Do any backend developers relate to watching endless short videos on the subway and then being asked by a product manager to "build a short‑video system"? The moment the request arrives, developers start thinking about massive video storage, high‑concurrency streaming, and precise real‑time recommendation.

In today’s internet ecosystem, short video has become one of the highest‑engagement content formats. Platforms with hundreds of millions of daily active users upload, transcode, and distribute thousands of videos per second, while billions of users continuously refresh, watch, and interact.

Supporting such massive scale requires more than a simple CRUD + file storage architecture.

This article uses popular short‑video platforms as a reference prototype and, from a backend perspective, breaks down the system architecture into layers, core functional implementations, and key technical challenges, providing a practical roadmap for developers facing short‑video requirements.

1. Overall Architecture

Short‑video systems are typical "high‑concurrency, massive‑data, low‑latency" applications that must handle three core pipelines: video production (upload, transcode), content distribution (recommendation, CDN), and user interaction (like, comment, share). The design follows a "layered decoupling + micro‑services" approach, divided into four layers: Infrastructure layer, Core service layer, Algorithm engine layer, and Access layer , with a monitoring and operations system to ensure stability.

1.1 Architecture Diagram

1.2 Key Components

(1) Access Layer

API Gateway : Uses Kong/APISIX for request routing, JWT/OAuth2.0 authentication, token‑bucket rate limiting, and gray release based on user ID / region. Handles CORS and request validation, shielding backend complexity.

Load Balancing : Nginx/LVS with DNS round‑robin for region‑level traffic distribution, directing users to the nearest data center to reduce latency.

Gray Release : Configurable at the gateway to expose new features to a subset of users (e.g., hash‑based traffic split).

(2) Core Service Layer

Video Production Service : Handles upload, transcode, audit via three micro‑services – upload (receives video chunks), transcode (calls a transcode cluster for multi‑bitrate output), and audit (integrates content‑moderation APIs).

Video Distribution Service : Implements the "push‑stream – pull‑stream" chain, including push service (writes transcoded streams to CDN origin), pull service (provides playback URLs with resume support), and cache service (Redis cache for hot video metadata).

Interaction Service : Manages likes, comments, follows, and notifications. Splits into interaction (CRUD), relationship (follow/fan management), and messaging (push notifications).

User Service : Stores user profiles, login sessions, and privileges using MySQL with sharding by user ID.

(3) Algorithm Engine Layer

Recommendation Module : Dual engine – real‑time recommendation (Flink) and offline recommendation (Spark) – updates short‑term interests every second and long‑term interests daily, returning personalized video lists via API.

User Profile Module : Aggregates user info, interaction behavior, and watch time, storing tags in Elasticsearch for recommendation and content moderation.

Content Moderation Module : Combines AI‑based image/text detection with manual review to filter violating videos in real time.

(4) Infrastructure Layer

Storage Component : Object storage (S3/OSS) for massive video files; MySQL (master‑slave) for structured data; Elasticsearch/HDFS for unstructured data; Redis cluster for hot video metadata and login sessions.

Compute Component : Offline batch processing with Spark (user profiles, model training) and real‑time stream processing with Flink.

Message Queue : Kafka for decoupling services (e.g., upload → transcode → audit).

CDN : Integration with Alibaba/Tencent CDN to cache transcoded videos at nationwide edge nodes.

2. Core Functional Implementations

2.1 Video Upload and Transcoding

Upload Process : Client performs chunked upload (1 MB per chunk) over HTTP/2 to the upload service, which validates chunks and merges them in object storage. A transcode task message is sent to Kafka; the transcode service consumes it and uses FFmpeg to produce 480p/720p/1080p streams, updating video metadata upon completion.

Key Techniques : Chunked upload with resume, asynchronous transcode via Celery with priority queues, QUIC protocol for better performance on weak networks.

2.2 Video Recommendation and Distribution

Recommendation Flow : Client calls recommendation API; service fetches short‑term interest tags from Redis and long‑term tags from Spark results, combines with region/device info, queries Elasticsearch for candidate videos, and returns a personalized list via load balancer. User actions (play, like, comment) are streamed to Flink for real‑time interest updates.

Key Techniques : Collaborative filtering, DeepFM deep learning model, cache pre‑warming of hot recommendation lists.

2.3 User Interaction (Like / Comment)

Like Flow : Client sends like request to interaction service, which validates login, checks Redis for existing likes to prevent duplicates, updates MySQL (sharded like table) and Redis, and publishes a like notification message to Kafka for push notification.

Key Techniques : Redis distributed lock, MySQL sharding by user ID, read‑write separation, CDN pre‑heat for interaction data.

3. Technical Challenges and Solutions

3.1 Massive Video Storage Cost Control

Each short video is 10‑50 MB; billions of videos require >100 PB. Storing everything in standard storage would cost millions annually. Moreover, video access exhibits a "cold‑hot" pattern: 80 % of plays come from the last 30 days.

Tiered Storage Strategy

Hot Videos (top 10 % in last 7 days or >1 000 plays/day) stored on SSD‑based object storage for millisecond‑level latency.

Warm Videos (played within 30 days but not hot) stored on HDD standard nodes, costing half of SSD.

Cold/Archive Videos (no play for >3 months) migrated to archive storage (≈1/5 cost of standard).

Scheduled cron jobs calculate play counts nightly and trigger automatic tier migration. A manual archive API allows creators to move historic videos while preserving playback URLs via soft links.

Video Compression Optimization

Transcoding defaults to H.265, reducing bitrate by ~30 % compared to H.264. Short videos (15‑60 s) are capped at 500‑1500 kbps per resolution; long videos use VBR to save an additional ~20 %.

Legacy devices that do not support H.265 receive an extra H.264 low‑bitrate version, selected by the client based on device detection.

3.2 High‑Concurrency Streaming and Low‑Latency Playback

Peak periods see >1 000 new streams per second; direct streaming to the origin would saturate bandwidth. Simultaneous million‑scale pull requests can exceed 10 Gbps per node, causing packet loss and >3 s latency.

Push Side – Edge Nodes

Clients resolve DNS to the nearest edge node, upload video chunks there, where the edge node validates and temporarily stores them before asynchronously syncing to the central origin (incremental sync). This offloads the origin from handling bursty uploads.

The edge node runs a custom Nginx‑RTMP proxy supporting HTTP/2 and QUIC. Chunk identifiers enable resume uploads, achieving ~99 % success on unstable networks.

Pull Side – Multi‑Level CDN Cache

Three‑tier cache: edge → regional → origin. Hot videos (>100 k plays) are cached at edge nodes nationwide (latency <100 ms). Warm videos are cached at regional nodes; if absent, the edge pulls from the region instead of the origin.

Client Optimizations

Clients pre‑load the next recommended video (≈50 % of content) while playing the current one for seamless transition. Real‑time network speed detection (every 2 s) switches bitrate: <1 Mbps → 480p, 1‑3 Mbps → 720p, >3 Mbps → 1080p, keeping stall rate <1 %.

3.3 Real‑Time Recommendation High Availability

The real‑time recommendation stack consists of Flink (seconds‑level interest updates), Elasticsearch (millisecond search), and the recommendation API. Failure in any component can render the recommendation list empty, causing >40 % DAU drop.

Multi‑Level Degradation Plan

Level 1 (Realtime engine failure) : If Flink latency >5 s, switch to "offline recommendation + short‑term cache" using Spark T+1 long‑term tags and Redis cached recent interaction tags.

Level 2 (Elasticsearch failure) : If ES health <90 %, serve a cached hot‑video list from Redis (updated every 10 min) with a "Hot Recommendations" label.

Level 3 (Recommendation service overload) : Sentinel limits API QPS (e.g., 1 000 QPS per node). When exceeded, return a simplified list (20 videos) and reject non‑core requests.

Monitoring via Prometheus + Grafana detects failures; once recovered, services automatically revert to normal mode using dual‑active deployment to ensure zero‑perception switchover.

4. Conclusion

When I first received the "build a short‑video system" requirement, I was overwhelmed by questions like "where to store billions of videos" and "how to handle million‑level concurrency". Decomposing the architecture into four layers—access, core services, algorithm engine, and infrastructure—made the problem manageable.

In summary, we achieved three things: breaking vague requirements into technical modules, selecting the right tools for each module (edge nodes for push, Spark + Flink for recommendation), and designing fallback mechanisms for storage cost, recommendation availability, and streaming performance. These solutions are not "black tech" but are practical and deployable, enabling developers to build systems that can handle traffic, control costs, and avoid failures.

If you have encountered similar challenges or have questions about any component, feel free to comment. Future posts may dive into "video transcode cluster pitfalls" or "building a recommendation system from 0 to 1".

After all, the technical journey is faster when we walk it together.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

backend-architecture High concurrency Storage Optimization short video real-time recommendation

Written by

Su San Talks Tech

Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.