Big Data 10 min read

Upgrade of Dependency Model in Bilibili Data Platform

Bilibili’s data platform upgraded its dependency model by shifting from project‑level to task‑level dependencies, adding root and end nodes, using virtual tasks for external data, introducing offset handling, implementing an abstract DependencySubject and asynchronous callbacks, achieving sub‑second latency for tens of thousands of daily tasks while planning automated lineage and richer rule support.

Bilibili Tech
Bilibili Tech
Bilibili Tech
Upgrade of Dependency Model in Bilibili Data Platform

The article discusses the evolution of the dependency model used in Bilibili's data warehouse, emphasizing the need for more precise task-level dependencies to improve data production accuracy and timeliness.

Background: Data warehouse construction relies on data models and tables that form upstream‑downstream relationships. The existing scheduling system treats projects as the unit of dependency, which cannot express cross‑project task dependencies, leading to inaccurate data lineage.

Terminology: Key terms such as job (task), DAG (dependency), project, instance, bizDate (business date), and offset are defined to provide a common vocabulary for the subsequent discussion.

Dependency Model Upgrade:

1. From project‑level to task‑level dependency – introducing a root node and an end node for each project so that all user‑created jobs depend on the root and are depended on by the end node, effectively converting project dependencies into task dependencies.

2. Bridging external dependencies – two solutions were evaluated. The chosen approach uses virtual tasks that represent external data outputs (e.g., hourly tables). Downstream jobs depend on these virtual tasks, which in turn map to the actual upstream jobs, providing accurate and fine‑grained dependency handling.

3. Dependency offset – defines how downstream jobs reference specific business dates of upstream jobs (e.g., T‑1, T‑2) and supports both set‑based and range‑based offset configurations.

Technical Implementation:

• Abstract Dependency Model – a DependencySubject object encapsulates any upstream dependency, identified by a composite DependencySubjectId (e.g., jobId=1234&bizDate=20220101). This model supports project, job, and table‑level dependencies and can express time ranges.

• Asynchronous Dependency Callback – the core component DependencyCenter evaluates dependencies, performs inspections, and triggers callbacks when all upstream tasks are satisfied. Each DependencySubject has a dedicated detector that can use polling, messaging, or API calls.

• Performance – the dependency service processes up to 80,000 tasks and 150,000 daily instances with sub‑second latency (average 1.6 s, max 6.3 s) and 100 % accuracy, operating stably for three years.

Future Plans:

Automation of dependency generation based on accurate data lineage.

Enriching dependency rules to include weak dependencies and flexible selection criteria.

Extending support for complex rule evaluation in operations and baseline tools.

The article concludes with an invitation for feedback and links to previous related talks on data security, data development, and real‑time DQC.

big datatask schedulingdata platformdata dependencyBilibilidependency model
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.