Building a Scalable, Observable Recommendation Scheduling Engine from Scratch
This article explains how recommendation systems work, distinguishes online services from offline computation, outlines a typical recommendation flow, and presents a three‑stage evolution (1.0, 2.0, 3.0) with design principles for stability, observability, and efficiency, culminating in a DAG‑based orchestration and traceable execution.
Introduction
Short videos, product clicks, and news reads are all driven by an invisible hand that decides the next content within milliseconds—recall → coarse ranking → fine ranking → intervention → return. This hand is the recommendation system scheduler.
What is a Recommendation System
A recommendation system predicts and pushes items a user may be interested in by analyzing behavior and data, thereby improving conversion, retention, or purchase metrics.
Online Service vs Offline Computation
Recommendation is divided into two categories:
Online Service : real‑time request handling, low‑latency response, high concurrency, millisecond‑level result return.
Offline Computation : periodic user‑profile updates, large‑scale data preprocessing, candidate set pre‑computation, model training and optimization.
Typical Recommendation Flow
The flow coordinates data between online and offline components and follows a scheduling chain from recall to ranking.
System Design Perspectives
Different stakeholders have distinct concerns:
Engineering : stability, observability, performance.
Algorithm : experiment efficiency, traffic/resource allocation.
Evolution Stages
Stage 1.0 – Small Team, Limited Data
Simple deployment with independent or mixed deployment based on scenario importance; basic AB testing; focus on stability and minimal resource usage.
Stage 2.0 – Growing Team, Stable Data
Introduces monitoring dashboards, trace links, improves experiment iteration speed, supports flexible traffic allocation, and begins to externalize data storage.
Stage 3.0 – Large Scale, High Flexibility
Adopts DAG‑based orchestration, hot‑load strategy updates, external C++ engine for data, advanced AB platform with multi‑layer experiments, and emphasizes strategy reuse.
Key Design Mechanisms
Node state management (pending, running, success, failure, timeout).
Automatic triggering of downstream nodes after upstream completion.
Dependency control ensuring all predecessor nodes reach a terminal state before proceeding.
Parallel execution of independent nodes.
Conclusion and Outlook
As the system scales, two core challenges emerge: stability (data storage, fault handling, scenario isolation) and efficiency (development, testing, release, troubleshooting). Future work includes DAG pruning, CPU utilization improvement, and continuous iterative optimization.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
