Introducing Bilibili's SQLScan: Architecture, Key Technologies, and Production Impact
Bilibili's SQLScan is a static‑code analysis tool that parses Hive, Spark, Presto and Flink SQL via Antlr4, builds a unified AST, applies engine‑specific metadata plugins for rule enforcement, provides field‑lineage and cost‑analysis services, and has processed hundreds of thousands of daily queries, intercepting thousands of problematic statements to improve data quality and operational efficiency.
This article is part of Bilibili's big‑data system diagnostics series and focuses on the first product, SQLScan, a static code analysis tool for SQL.
1. Platform Overview Bilibili’s data platform provides end‑to‑end capabilities such as data development, integration, services, management, security, governance, and diagnostics, forming a full‑lifecycle development environment that improves development efficiency and data analysis speed.
2. Background The platform processes hundreds of thousands of offline ETL and query tasks daily, encountering frequent data‑quality issues (e.g., loss of 2 million users due to implicit type conversion). Lack of release control and automated diagnostics leads to hidden risks. To address this, a DataOps‑based lifecycle management platform was built, with SQLScan as a core pre‑deployment diagnostic component.
3. SQLScan Principles and Key Technologies
3.1 Introduction SQLScan statically scans SQL code, detecting quality, performance, and compliance problems and enforcing rules through the development workflow.
3.2 Architecture SQLScan consists of three modules: parsing, compilation, and service.
• Parsing layer uses Antlr4 to parse Hive, Spark, Presto, and Flink dialects, generating a unified AST to decouple parsing from later stages.
• Compilation layer traverses the AST with engine‑specific metadata plugins (e.g., getTable, getPartition). It builds a metadata cache to resolve temporary tables created by CTAS statements and applies built‑in interception rules such as table/column existence, Cartesian product detection, CTE cyclic dependencies, multi‑level view nesting, ANSI compliance, and column type mismatches.
• Service layer provides four main features: custom rule interception, cost analysis, table access‑control integration, and field lineage tracking.
3.3 Field Lineage By traversing the AST, SQLScan maps output columns to their source columns across tables, enabling precise lineage graphs (e.g., col1 as b originates from mid.table_t.col1).
3.4 Cost Analysis Using external metadata (Hive Metastore, ClickHouse system tables) and compiled field profiles, SQLScan estimates query cost. Missing statistics trigger asynchronous ANALYZE statements on Presto to enrich metadata. Cost is calculated based on input data size, number of operators (GROUP BY, JOIN, SORT, etc.), and yields a monetary estimate.
4. Production Results
SQLScan has been stable for eight months, handling 360 k daily query diagnostics, intercepting ~20 k SQL statements per day with an average execution time of 300 ms, thereby ensuring high‑quality data and SQL output.
Examples of alerts include missing columns, ANSI violations, and prohibited DDL operations (e.g., TRUNCATE).
5. Future Outlook
• Building a data‑masking engine based on field‑lineage to propagate sensitivity labels.
• Guiding users to query higher‑level warehouse layers (ADS/DWS) instead of raw ODS tables to reduce resource waste.
• Leveraging user‑defined column types for pre‑execution semantic checks (e.g., mismatched timestamp granularities in UNION queries).
6. References
[1] Enabling static analysis of SQL queries at Meta (https://engineering.fb.com/2022/11/30/data-infrastructure/static-analysis-sql-queries/)
[2] 网易数据开发SQL Scan实践 (https://www.bilibili.com/video/BV1LP411v72f)
[3] 先设计后开发,先标准后建模,网易 DataOps 实践
[4] 大数据之路阿里巴巴大数据实践
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.