How DPP Evolved from Fixed Engine to DAG‑Based Orchestration for Faster Recommendation Iterations
This article explains the DPP platform’s overall architecture, its key features for rapid iteration, and the three‑stage evolution of its orchestration engine—from the fixed DPP‑Engine to the flexible BizEngine and finally the graph‑based DagEngine—detailing design decisions, protocols, challenges, and future directions.
1. DPP Overall Architecture
DPP builds on the algorithm platform’s engine services—FeatureServer, recall engine, and ranking scorer—to deliver out‑of‑the‑box recall, coarse‑ranking, and fine‑ranking. A hot‑loading mechanism allows algorithm and engineering teams to push new strategies without restarting the service, keeping the online recommendation pipeline stable while enabling rapid business‑logic iteration.
Platform Features
Rapid iteration: System decoupling lets algorithm and strategy code be updated independently.
Automated effect analysis: Integration with the data platform standardizes BI reporting.
Flexible experiments: A layered experiment framework supports multi‑layer, multi‑experiment configurations.
Easy diagnostics: Intermediate results are logged for fine‑grained analysis; built‑in monitoring, alerts, and debugging tools simplify troubleshooting.
2. DPP Engine Evolution
The orchestration engine has evolved through three stages—fixed orchestration (DPP‑Engine), flexible orchestration (BizEngine), and graph‑based DAG orchestration (DagEngine)—each improving iteration efficiency and resource utilization.
Fixed Orchestration – DPP‑Engine
DPP‑Engine abstracts the recommendation workflow into six fixed layers (INIT → Recall → Fusion → Coarse‑rank → Fine‑rank → Intervention). Each layer contains multiple components; the engine serially schedules the layers, merges component outputs, and passes the aggregated list to the next layer.
While this design supports rapid iteration, it is rigid: the six‑layer structure cannot be altered, and embedding DPP‑Engine inside the DPP system hinders independent evolution of the engine.
Flexible Orchestration – BizEngine
BizEngine executes components based on configuration supplied by strategy teams. It supports both serial and concurrent execution within a layer. Requests are processed through bucket and layer configurations, allowing each bucket to define its own layer composition (defaulting to the six layers).
Key challenges identified:
Serial scheduling across layers adds latency.
Potential conflicts between BizEngine’s thread pool and custom strategy‑level scheduling.
Component granularity varies (CPU‑bound, IO‑bound, custom concurrent tasks), making resource management difficult.
Increasing migration and refactoring costs due to lack of component sharing mechanisms.
Graph‑Based Orchestration – DagEngine
DagEngine models business logic as a directed acyclic graph (DAG). Nodes are operators, edges represent data flow. This mirrors TensorFlow’s operator‑tensor model and enables a unified expression of recall, fusion, ranking, and filtering steps.
Benefits of graph‑based orchestration:
Unified operator interface promotes reuse across algorithms.
Flexible workflow customization through graph composition.
Parallel and asynchronous execution of operators reduces latency.
Operator Definition
Operators implement a generic Processor<O> interface. The engine invokes the run method, which receives a ComputeContext and zero‑or‑more input DataFrame objects and returns a DataFrame<O> result.
public interface Processor<O> {
/**
* Execution logic
* @param computeContext execution context
* @param inputs upstream DataFrames
* @return execution result
*/
DataFrame<O> run(ComputeContext computeContext, DataFrame... inputs);
}Operators are annotated with @DagProcessor to describe their type (IO or CPU), a human‑readable description, and any side‑values they emit for downstream dependency checks.
@Documented
@Retention(RetentionPolicy.RUNTIME)
@Target({ElementType.TYPE})
public @interface DagProcessor {
/** Mark as IO or CPU, influencing DagEngine scheduling */
String type() default "IO";
/** Operator description */
String desc() default "";
/** Side‑values emitted by the operator for dependency validation */
String sideValues() default "";
}Configuration and data dependencies are declared with @ConfigAnno and @DependsDataAnno. The engine injects configuration values at runtime based on experiment/AB‑test results.
Graph and Sub‑graph Configuration
Both graphs and sub‑graphs are defined in YAML files. Sub‑graphs act as reusable templates of operators; graphs combine operators and sub‑graphs to describe a complete scenario.
# Sub‑graph definition (template)
name: 'RecallSubgraph1'
type: 'subgraph'
configs:
- name: 'configKey1'
value: 'defaultValue'
nodes:
- name: 'firstRecallOp1'
op: 'com.dag.demo.recall.FirstRecallOP'
depends: []
configs:
- name: 'configKey1'
value: 'firstRecallOp1Value'
- name: 'otherRecall1'
op: 'com.dag.demo.recall.OtherRecallOP'
depends: ['firstRecallOp1'] # Graph definition (scenario)
name: 'ScenarioGraph'
type: 'graph'
configs:
- name: 'configKey1'
value: 'defaultValue'
nodes:
- name: 'firstRecallOp1'
op: 'com.dag.demo.recall.FirstRecallOP'
depends: []
- name: 'otherRecall1'
op: 'com.dag.demo.recall.OtherRecallOP'
depends: ['firstRecallOp1']
- name: 'someRecallComplex1'
op: '$RecallSubgraph1'
configs:
- name: 'configKey1'
value: 'overrideValue'
depends: ['firstRecallOp1']At runtime, operator configurations are resolved by merging experiment configurations. Default values can be hard‑coded in the operator or specified in the YAML. The naming convention for runtime keys is <subGraphName>.<operatorName>.<keyName> for operators inside a sub‑graph, and _.<operatorName>.<keyName> for top‑level operators. The @ConfigAnno(key="keyName") annotation injects the appropriate value, supporting JSON and DTO binding.
3. Summary
The DPP orchestration engine has progressed from a fixed six‑layer DPP‑Engine, through a configurable BizEngine, to a graph‑based DagEngine. Each evolution step reduces scheduling overhead, improves resource utilization, and enables more flexible workflow composition. In production, the DAG approach has yielded lower latency and higher throughput, while allowing strategy engineers to focus on implementing operators rather than scheduling logic. Future work includes extending DAG orchestration to additional business scenarios, strengthening operator reuse and standardization, and further optimizing DagEngine performance (e.g., DataFrame handling) to approach an end‑to‑end “full‑graph” representation of recommendation pipelines.
DeWu Technology
A platform for sharing and discussing tech knowledge, guiding you toward the cloud of technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
