Rearchitecting the Advertising AB Testing Platform: Service Decomposition, Data Modeling, DAG Scheduling, and DDD Practices
The article describes how Volcano Engine's DataTester team refactored the advertising AB testing platform by splitting services, redesigning the data model with MySQL and ClickHouse, introducing DAG‑based scheduling and a time‑wheel algorithm, and applying domain‑driven design and rigorous unit testing to improve stability, scalability, and maintainability.
When enterprises launch large‑scale ad campaigns, they need a scientific way to evaluate different advertising strategies before spending money. Volcano Engine’s AB testing product introduced an "advertising‑placement AB experiment" to measure average conversion costs across variables such as creatives, landing pages, audience packages, and budgets.
The underlying data capabilities are critical: account authorization, plan creation, and data querying must work together to ensure accurate experiment reports.
Early versions of DataTester faced five major issues: (1) support for multiple ad platforms made authorization logic tangled; (2) tight coupling of authorization, data collection, and business logic hindered troubleshooting; (3) each data‑capture task required a separate scheduled job, leading to an explosion of cron jobs; (4) an inefficient data model caused slow query performance; and (5) excessive custom features made the code hard to maintain.
To address these problems, the platform was rebuilt with the following architectural changes:
**Service decomposition** – separate Authorization Service, Data‑Capture Service, Business Backend Service, and a minimal set of scheduled tasks, each with a single responsibility.
**Data model redesign** – store metadata in MySQL and reporting data in ClickHouse, balancing write‑heavy and read‑heavy workloads.
**Domain‑Driven Design (DDD)** – introduce interface‑based programming for each ad platform, allowing platform‑specific implementations without affecting core logic.
**Strict unit‑test coverage and CI/CD pipelines** – enforce high test coverage and automated deployment to catch bugs early.
**Unified codebase for SaaS and private‑cloud deployments** – use environment variables to switch modes, reducing development effort.
Authorization Service handles the first step of connecting to various ad platforms, storing tokens or credentials, and dispatching data‑capture tasks per account.
Data‑Capture Service ensures data consistency by scheduling daily and hourly fetch jobs, refreshing OAuth2 tokens, and providing real‑time fetch APIs.
OAuth2 authorization is illustrated with a step‑by‑step flow (register developer account, construct auth URL, user grants permission, platform returns auth_code , exchange for Access Token and Refresh Token , refresh before expiry). A template‑method pattern is suggested to avoid duplicated code across platforms.
Data‑Capture Architecture uses a DAG (Directed Acyclic Graph) to model task dependencies. The DAG definition includes fields such as schedule_interval , dag_id , and an array of tasks with IDs, upstream/downstream relationships, and operator types. Example JSON:
{
"schedule_interval":"*/60 * * * *",
"dag_id":"${account_id}_today_insights",
"tasks":[
{"task_id":"dummy_task","downstream_task_ids":["account_meta_task","ad_meta_task"],"is_dummy":true,"operator_name":"DummyOperator"},
{"task_id":"account_meta_task","operator_type":"basic","operator_name":"ad_meta_operator"},
{"task_id":"ad_meta_task","downstream_task_ids":["ad_daily_insight_task"],"operator_name":"ad_meta_operator"},
{"task_id":"ad_daily_insight_task","operator_name":"insight_operator"}
]
}The scheduler parses the DAG and generates a pipeline where tasks execute according to their dependencies.
To efficiently run thousands of scheduled jobs, a **time‑wheel algorithm** is employed: tasks are placed on a circular time wheel (e.g., a 7×24‑slot day‑level wheel and a 3600‑slot second‑level wheel). When the pointer reaches a slot, tasks are dispatched to the second‑level wheel for precise execution, dramatically reducing traversal overhead.
On the backend side, the team applied **Domain‑Driven Design** to separate concerns into four layers: User Interface, Application, Domain, and Infrastructure. This structure promotes high cohesion, low coupling, and easier extension.
Rigorous **unit testing** practices are emphasized: tests must be isolated from external dependencies, use mocks (e.g., GoMock), and be integrated into CI pipelines to enforce coverage thresholds.
In summary, the refactor has increased the platform’s stability and scalability, enabling hundreds of enterprises to run millions of AB experiments with reliable, data‑driven decision making.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.