Customizing Spark SQL with Macro‑Based Extensions for Column Exclusion and JSON Path Support
This article explains how Tubi customizes Spark SQL using lightweight macro‑based extensions to simplify column exclusion, JSON path queries, and other complex operations without modifying Spark's source code, detailing the two‑stage processing, example macros, and benefits for big‑data workloads.
We Need to Customize Spark SQL
When using Spark SQL, extreme cases such as excluding specific columns from a wide table become cumbersome; the DataFrame API can handle it, but pure SQL lacks readability and flexibility.
Example of column exclusion using the DataFrame API is shown (image).
In plain SQL, listing 198 column names is impractical, and Hive‑compatible syntax requires enabling a parser option with poor readability:
set spark.sql.parser.quotedRegexColumnNames=true;
select `(col_to_exclude_1)?+.+` from tblTo solve this, Tubi introduces custom Spark SQL syntax implemented via macros.
// select all columns excluding `col_to_exclude_1`
select all_columns_except("tbl", "col_to_exclude_1")
// select all columns excluding `col_to_exclude_1` and `col_to_exclude_2`
select all_columns_except("tbl", "col_to_exclude_1", "col_to_exclude_2")Macros also enable enhanced functionality such as flexible JSON path queries on S3.
select * from json.`s3://bucket/dir/2020-12-12-13*`Because Spark SQL does not support wildcard timestamps, a custom macro is used.
We Do Not Want to Modify Spark Source Code
Tubi’s approach expands macros in a first stage using UDF‑style syntax, then routes the resulting statements to appropriate commands in a second stage.
Macro expansion occurs once, while the generated UDF may be executed billions of times.
In the second stage, pattern matching on the abstract syntax tree directs DROP statements to a ForbiddenDropCommand, regular Spark SQL to SparkSQLCommand, and other patterns to custom commands such as Delta Lake integrations.
Case Study: all_columns_except
Assuming a table tbl with columns col1‑col6, the statement
select all_columns_except("tbl", "col1", "col5") from tblis expanded in stage one to
select col2, col3, col4, col6 from tbland then routed to SparkSQLCommand for execution.
Case Study: tubi_json
The query
select * from tubi_json.`s3://bucket/dir/2020-12-13*`passes unchanged through macro expansion and is routed to TubiJSONCommand, which loads matching S3 objects, creates a temporary view, and runs spark.sql("select * from temp_view") .
Benefits of this approach include:
No need to modify Spark source code.
No custom ANTLR4 grammar; existing Spark SQL parser is reused.
Painless implementation and upgrade of custom syntax on top of Spark.
ANTLR4‑Generated Parser Code Is Complex
Macro replacement and SQL pattern recognition are the main challenges; understanding ANTLR4’s Rewriter API simplifies macro handling, while AST analysis can be performed with Visitor, Listener, or XPath approaches.
Scala‑style collection‑like operators can be designed for AST processing, as illustrated by the following code (image).
Further details were presented at a Scala Meetup and will be shared at the Datafun year‑end big‑data architecture forum.
Bitu Technology
Bitu Technology is the registered company of Tubi's China team. We are engineers passionate about leveraging advanced technology to improve lives, and we hope to use this channel to connect and advance together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.