Building a Scalable, Observable Recommendation Scheduling Engine from Scratch

This article explains how recommendation systems work, distinguishes online services from offline computation, outlines a typical recommendation flow, and presents a three‑stage evolution (1.0, 2.0, 3.0) with design principles for stability, observability, and efficiency, culminating in a DAG‑based orchestration and traceable execution.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
Building a Scalable, Observable Recommendation Scheduling Engine from Scratch

Introduction

Short videos, product clicks, and news reads are all driven by an invisible hand that decides the next content within milliseconds—recall → coarse ranking → fine ranking → intervention → return. This hand is the recommendation system scheduler.

What is a Recommendation System

A recommendation system predicts and pushes items a user may be interested in by analyzing behavior and data, thereby improving conversion, retention, or purchase metrics.

Online Service vs Offline Computation

Recommendation is divided into two categories:

Online Service : real‑time request handling, low‑latency response, high concurrency, millisecond‑level result return.

Offline Computation : periodic user‑profile updates, large‑scale data preprocessing, candidate set pre‑computation, model training and optimization.

Typical Recommendation Flow

The flow coordinates data between online and offline components and follows a scheduling chain from recall to ranking.

Typical recommendation flow diagram
Typical recommendation flow diagram

System Design Perspectives

Different stakeholders have distinct concerns:

Engineering : stability, observability, performance.

Algorithm : experiment efficiency, traffic/resource allocation.

Evolution Stages

Stage 1.0 – Small Team, Limited Data

Simple deployment with independent or mixed deployment based on scenario importance; basic AB testing; focus on stability and minimal resource usage.

Stage 2.0 – Growing Team, Stable Data

Introduces monitoring dashboards, trace links, improves experiment iteration speed, supports flexible traffic allocation, and begins to externalize data storage.

Stage 3.0 – Large Scale, High Flexibility

Adopts DAG‑based orchestration, hot‑load strategy updates, external C++ engine for data, advanced AB platform with multi‑layer experiments, and emphasizes strategy reuse.

Key Design Mechanisms

Node state management (pending, running, success, failure, timeout).

Automatic triggering of downstream nodes after upstream completion.

Dependency control ensuring all predecessor nodes reach a terminal state before proceeding.

Parallel execution of independent nodes.

Trace link diagram
Trace link diagram

Conclusion and Outlook

As the system scales, two core challenges emerge: stability (data storage, fault handling, scenario isolation) and efficiency (development, testing, release, troubleshooting). Future work includes DAG pruning, CPU utilization improvement, and continuous iterative optimization.

recommendationAIscalabilityworkflowobservabilitysystem design
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.