How Volcano Engine Rebuilt Its Ad‑Testing Platform for Scalability and Reliability
This article explains how Volcano Engine identified the tangled authorization, data‑fetching, and performance problems of its advertising AB‑testing platform and refactored it by splitting services, redesigning the data model with MySQL and ClickHouse, applying DAG scheduling, time‑wheel algorithms, Domain‑Driven Design, and rigorous unit testing to achieve a more stable, extensible backend solution.
Overview
Volcano Engine’s AB‑testing platform for advertising needed a scientific way to compare different ad strategies, but early implementations relied on ad‑hoc testing and suffered from tangled authorization logic, excessive scheduled tasks, slow queries, and hard‑to‑maintain code.
Challenges before Refactor
Support for multiple ad platforms made authorization logic increasingly complex.
Authorization, data collection, and business logic were tightly coupled, making debugging difficult.
Each data‑capture type required a separate timed job, leading to an unmanageable number of tasks.
Inefficient data model caused slow report queries as data volume grew.
Over‑customized features resulted in fragile code.
Refactoring Solutions
Service decomposition: split into an Authorization Service, Data‑Fetch Service, Business Backend Service, and a minimal set of scheduled tasks.
Data model redesign: store metadata in MySQL for fast updates and report data in ClickHouse for high‑performance analytics.
Adopt Domain‑Driven Design (DDD) with interface‑driven programming, allowing each ad platform to implement its own adapter.
Enforce strict unit‑test coverage and CI/CD pipelines to ensure code quality and rapid bug detection.
Unified codebase for SaaS and on‑premise deployments using environment‑variable configuration.
Core Modules
Authorization Service handles granting ad‑account tokens (OAuth2 or password‑based) and stores credentials for downstream tasks.
Data‑Fetch Service synchronizes ad‑platform data at hour‑ and day‑level, supports custom token refresh intervals, and provides real‑time fetch APIs.
Business Backend Service uses authorized accounts to create campaigns, manage assets, and aggregate query results.
Data Model and Storage
Metadata (IDs, names, timestamps) is stored in MySQL, while high‑volume report metrics (clicks, impressions, spend) reside in ClickHouse, leveraging its Map type for flexible schema expansion.
DAG Scheduling
Tasks are expressed as a Directed Acyclic Graph (DAG) to capture dependencies; the Scheduler parses the DAG and dispatches jobs to Workers. Example DAG definition:
<code>{
"schedule_interval":"*/60 * * * *",
"dag_id":"${account_id}_today_insights",
"tasks":[
{"task_id":"dummy_task","downstream_task_ids":["account_meta_task","ad_meta_task"],"is_dummy":true,"operator_name":"DummyOperator"},
{"task_id":"account_meta_task","operator_type":"basic","operator_name":"ad_meta_operator"},
{"task_id":"ad_meta_task","downstream_task_ids":["ad_daily_insight_task"],"operator_name":"ad_meta_operator"},
{"task_id":"ad_daily_insight_task","operator_name":"insight_operator"}
]
}</code>Time‑Wheel Algorithm
To efficiently execute millions of scheduled tasks, a hierarchical time‑wheel is used: a day‑level wheel (7×24 slots) feeds tasks into a second‑level wheel (3600 slots) for second‑precision execution, reducing traversal from tens of thousands of tasks to a few dozen time slots.
Domain‑Driven Design (DDD)
Four layers structure the system:
User Interface Layer : receives requests, performs simple validation, returns results.
Application Layer : orchestrates use‑cases without embedding business rules.
Domain Layer : core business logic expressed as rich, encapsulated models, independent of external frameworks.
Infrastructure Layer : provides technical implementations such as databases, caches, and message queues.
Unit Testing Benefits
Accelerates development and refactoring by quickly locating bugs.
Enforces high cohesion and low coupling in code design.
Improves overall code quality and prevents regressions.
Encourages mock‑based isolation of external dependencies.
Integrates with CI/CD pipelines to enforce coverage thresholds.
ByteDance Data Platform
The ByteDance Data Platform team empowers all ByteDance business lines by lowering data‑application barriers, aiming to build data‑driven intelligent enterprises, enable digital transformation across industries, and create greater social value. Internally it supports most ByteDance units; externally it delivers data‑intelligence products under the Volcano Engine brand to enterprise customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.