Big Data 11 min read

Introducing Bilibili's SQLScan: Architecture, Key Technologies, and Production Impact

Bilibili's SQLScan is a static‑code analysis tool that parses Hive, Spark, Presto and Flink SQL via Antlr4, builds a unified AST, applies engine‑specific metadata plugins for rule enforcement, provides field‑lineage and cost‑analysis services, and has processed hundreds of thousands of daily queries, intercepting thousands of problematic statements to improve data quality and operational efficiency.

Bilibili Tech

Sep 15, 2023

Introducing Bilibili's SQLScan: Architecture, Key Technologies, and Production Impact

This article is part of Bilibili's big‑data system diagnostics series and focuses on the first product, SQLScan, a static code analysis tool for SQL.

1. Platform Overview Bilibili’s data platform provides end‑to‑end capabilities such as data development, integration, services, management, security, governance, and diagnostics, forming a full‑lifecycle development environment that improves development efficiency and data analysis speed.

2. Background The platform processes hundreds of thousands of offline ETL and query tasks daily, encountering frequent data‑quality issues (e.g., loss of 2 million users due to implicit type conversion). Lack of release control and automated diagnostics leads to hidden risks. To address this, a DataOps‑based lifecycle management platform was built, with SQLScan as a core pre‑deployment diagnostic component.

3. SQLScan Principles and Key Technologies

3.1 Introduction SQLScan statically scans SQL code, detecting quality, performance, and compliance problems and enforcing rules through the development workflow.

3.2 Architecture SQLScan consists of three modules: parsing, compilation, and service.

• Parsing layer uses Antlr4 to parse Hive, Spark, Presto, and Flink dialects, generating a unified AST to decouple parsing from later stages.

• Compilation layer traverses the AST with engine‑specific metadata plugins (e.g., getTable, getPartition). It builds a metadata cache to resolve temporary tables created by CTAS statements and applies built‑in interception rules such as table/column existence, Cartesian product detection, CTE cyclic dependencies, multi‑level view nesting, ANSI compliance, and column type mismatches.

• Service layer provides four main features: custom rule interception, cost analysis, table access‑control integration, and field lineage tracking.

3.3 Field Lineage By traversing the AST, SQLScan maps output columns to their source columns across tables, enabling precise lineage graphs (e.g., col1 as b originates from mid.table_t.col1).

3.4 Cost Analysis Using external metadata (Hive Metastore, ClickHouse system tables) and compiled field profiles, SQLScan estimates query cost. Missing statistics trigger asynchronous ANALYZE statements on Presto to enrich metadata. Cost is calculated based on input data size, number of operators (GROUP BY, JOIN, SORT, etc.), and yields a monetary estimate.

4. Production Results

SQLScan has been stable for eight months, handling 360 k daily query diagnostics, intercepting ~20 k SQL statements per day with an average execution time of 300 ms, thereby ensuring high‑quality data and SQL output.

Examples of alerts include missing columns, ANSI violations, and prohibited DDL operations (e.g., TRUNCATE).

5. Future Outlook

• Building a data‑masking engine based on field‑lineage to propagate sensitivity labels.

• Guiding users to query higher‑level warehouse layers (ADS/DWS) instead of raw ODS tables to reduce resource waste.

• Leveraging user‑defined column types for pre‑execution semantic checks (e.g., mismatched timestamp granularities in UNION queries).

6. References

[1] Enabling static analysis of SQL queries at Meta (https://engineering.fb.com/2022/11/30/data-infrastructure/static-analysis-sql-queries/)

[2] 网易数据开发SQL Scan实践 (https://www.bilibili.com/video/BV1LP411v72f)

[3] 先设计后开发，先标准后建模，网易 DataOps 实践

[4] 大数据之路阿里巴巴大数据实践

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data data lineage Bilibili cost analysis DataOps SQL static analysis SQLScan

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.