How RedProcess Evolved into DES: Optimizing Xiaohongshu’s Multimedia Task Scheduler
The article details the evolution from the first‑generation RedProcess scheduler to the Distributed Execution Scheduler (DES), explaining how architectural redesigns in storage layering, push‑based dispatch, and systematic disaster‑recovery transformed Xiaohongshu’s video‑cloud task scheduling from merely usable to highly efficient and resilient.
Background and Requirements
Video media processing on the server is fundamentally an asynchronous orchestration of encoding tasks, where different quality tiers can differ in execution time by more than threefold, far exceeding the response window of synchronous APIs. Consequently, the system adopts an async model: business services submit jobs via MQ or RPC, and workers write results back to media storage, notifying downstream consumers via callbacks.
The workflow includes probing, branching, multi‑tier concurrent transcoding, media write‑back, and business callbacks, demanding a scheduler with DAG orchestration, cross‑service coordination, and elastic scaling.
RedProcess: First‑Generation Scheduler
RedProcess was built on Netflix Conductor but heavily customized for internal needs. Conductor’s native Dynomite‑based queue and long‑poll workers could not scale horizontally when worker nodes grew, and Redis lacked MySQL‑style persistence.
Key redesigns introduced a Redis ZSet‑based queue model:
Unix timestamps as ZSet scores, with Lua scripts providing atomic pop decisions and delayed‑queue semantics.
Physical multi‑queue binding to implement priority tiers.
Sharded MySQL cluster for durable task records.
Task loss prevention uses heartbeat‑driven ZSet scores; if a worker’s heartbeat stops, the task is re‑queued after a timeout (3–6× the heartbeat interval). Elastic scaling is achieved by exposing a queue‑backlog probe integrated with Kubernetes HPA custom metrics.
Limitations of RedProcess
As business volume grew, several architectural bottlenecks emerged:
Performance: All operations persisted to MySQL, causing linear QPS growth, long expansion cycles, and pressure from large custom fields (10–50 KB per record).
Availability: Hot keys for popular task types created Redis hotspots; lack of fine‑grained rate limiting forced reliance on upstream MQ throttling.
Functionality: Multiple ZSets per task type required round‑robin polling, reducing efficiency; no workflow failure callbacks; live‑stream scenarios lacked fine‑grained scaling and isolation.
Operations: Scheduler could not detect worker versions, leading to “wild pod” task grabs; resource allocation required manual observation and adjustment.
DES: Distributed Execution Scheduler
DES retains DAG flexibility while addressing the above issues through a four‑service redesign: Gateway, Dispatcher, Worker SDK, and Console.
Gateway Service
Acts as the global entry point, consolidating all persistent reads/writes. It employs a write‑back cache: writes go directly to Redis, with index data queued for batch persistence to MySQL (metadata) and object storage (large blobs). Reads first hit Redis; on miss, a single‑flight merge fetches from the persistent layer, preventing cache‑stampede. Failure handling includes automatic TTL suspension and fallback to direct persistent reads/writes.
Event Handler generates IDs and drives DAG decisions, preserving workflow‑ID linkage. The DAG engine supports eight primitives (SIMPLE, FORK_JOIN, SWITCH, INLINE, START_WF, SUB_WF, DYNAMIC_FORK, CALLBACK), with live‑stream paths bypassing DAG to reduce latency.
Dispatcher Service
Core scheduler that receives events from Gateway, performs global enqueue rate limiting, queue re‑insertion, and priority insertion. The Poll Handler maintains a task‑poll heap and schedules tasks based on node resource surplus, supporting sparse or dense strategies and blacklisting faulty nodes.
Worker communication switched from short‑poll to an h2c long‑connection with Server‑Sent Events (SSE) push, dramatically cutting empty‑poll overhead and giving stronger control over long‑lived live‑stream tasks.
Worker SDK
Encapsulates service discovery, long‑connection management, global event aggregation, and task status reporting. A global event handler batches requests and reports node quota to the Dispatcher, enabling real‑time scheduling decisions. Retry handling relies only on retry counters, avoiding heavy database queries.
Console Service
Workflow definitions are stored in a Git repository and deployed via CI/CD, providing versioned schemas and auditability. The console offers manual task submission, query, and an Elasticsearch index limited to workflow and task dimensions. It also separates task logs from shared clusters to avoid interference.
Core Optimizations
Performance: Gateway funnels all hot‑path data to Redis; write‑back batches reduce MySQL QPS. Large unstructured fields move to object storage, leaving only OSS keys in MySQL. Write‑back queue uses List + optimistic lock for ordered writes.
Availability: Queue keys are sharded by workflow‑ID suffix and hashed across Redis nodes, eliminating hot‑key pressure. Distributed rate limiters, configurable from the console, enable fine‑grained throttling per business line and task type.
Feature Enhancements: Priority levels expanded from 5 to 9 using a bitmap to replace polling for emptiness checks. Live‑stream workflows gain SSE push; task dispatch changes from worker pull to server push, leveraging node quota for scheduling strategies. CALLBACK primitive added for error‑state callbacks.
Maintenance: Console now enforces version‑based circuit‑break rules at both enqueue and dequeue stages, preventing “wild pod” task grabs and supporting gray‑release. Clusters S1 and S2 merged into a single cluster with complementary diurnal traffic patterns, boosting utilization and cutting costs.
Disaster‑Recovery Design
DES implements automated fallback paths for real‑world failure scenarios:
Insufficient cluster capacity: workers aggregate multiple task types into a larger resource pool; the server‑side strategy engine allocates tasks to protect high‑priority jobs.
Node failures (disk unwritable/GPU driver errors): Dispatcher detects failure rates, blacklists nodes, and stops task dispatch.
High‑frequency retry avalanches: a dedicated retry‑throttling channel isolates failing tasks into a separate queue.
Redis cluster outage: Data Handler switches to direct persistent reads/writes and rate‑limits upstream traffic.
MySQL maintenance/disable‑write: write‑back queue pauses, workflow strong‑write is disabled, and changes are buffered to a backup queue; upon MySQL recovery, buffered writes are replayed.
Dual‑cloud unit architecture: Gateway registers in both clouds with identical service names; traffic is split globally, Dispatcher prefers same‑cloud access, and cross‑cloud pull occurs when idle workers are detected. Failure in one cloud triggers automatic service‑discovery‑based failover; network partition leaves each unit capable of independent operation.
Overall, the migration from RedProcess to DES demonstrates a systematic leap from “usable” to “well‑engineered,” driven by storage layering, push‑based dispatch, and comprehensive disaster‑recovery mechanisms.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Xiaohongshu Tech REDtech
Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
