Big Data 9 min read

Customizing Spark SQL with Macro‑Based Extensions for Column Exclusion and JSON Path Support

This article explains how Tubi customizes Spark SQL using lightweight macro‑based extensions to simplify column exclusion, JSON path queries, and other complex operations without modifying Spark's source code, detailing the two‑stage processing, example macros, and benefits for big‑data workloads.

Bitu Technology
Bitu Technology
Bitu Technology
Customizing Spark SQL with Macro‑Based Extensions for Column Exclusion and JSON Path Support

We Need to Customize Spark SQL

When using Spark SQL, extreme cases such as excluding specific columns from a wide table become cumbersome; the DataFrame API can handle it, but pure SQL lacks readability and flexibility.

Example of column exclusion using the DataFrame API is shown (image).

In plain SQL, listing 198 column names is impractical, and Hive‑compatible syntax requires enabling a parser option with poor readability:

set spark.sql.parser.quotedRegexColumnNames=true;
select `(col_to_exclude_1)?+.+` from tbl

To solve this, Tubi introduces custom Spark SQL syntax implemented via macros.

// select all columns excluding `col_to_exclude_1`
select all_columns_except("tbl", "col_to_exclude_1")
// select all columns excluding `col_to_exclude_1` and `col_to_exclude_2`
select all_columns_except("tbl", "col_to_exclude_1", "col_to_exclude_2")

Macros also enable enhanced functionality such as flexible JSON path queries on S3.

select * from json.`s3://bucket/dir/2020-12-12-13*`

Because Spark SQL does not support wildcard timestamps, a custom macro is used.

We Do Not Want to Modify Spark Source Code

Tubi’s approach expands macros in a first stage using UDF‑style syntax, then routes the resulting statements to appropriate commands in a second stage.

Macro expansion occurs once, while the generated UDF may be executed billions of times.

In the second stage, pattern matching on the abstract syntax tree directs DROP statements to a ForbiddenDropCommand, regular Spark SQL to SparkSQLCommand, and other patterns to custom commands such as Delta Lake integrations.

Case Study: all_columns_except

Assuming a table tbl with columns col1‑col6, the statement

select all_columns_except("tbl", "col1", "col5") from tbl

is expanded in stage one to

select col2, col3, col4, col6 from tbl

and then routed to SparkSQLCommand for execution.

Case Study: tubi_json

The query

select * from tubi_json.`s3://bucket/dir/2020-12-13*`

passes unchanged through macro expansion and is routed to TubiJSONCommand, which loads matching S3 objects, creates a temporary view, and runs spark.sql("select * from temp_view") .

Benefits of this approach include:

No need to modify Spark source code.

No custom ANTLR4 grammar; existing Spark SQL parser is reused.

Painless implementation and upgrade of custom syntax on top of Spark.

ANTLR4‑Generated Parser Code Is Complex

Macro replacement and SQL pattern recognition are the main challenges; understanding ANTLR4’s Rewriter API simplifies macro handling, while AST analysis can be performed with Visitor, Listener, or XPath approaches.

Scala‑style collection‑like operators can be designed for AST processing, as illustrated by the following code (image).

Further details were presented at a Scala Meetup and will be shared at the Datafun year‑end big‑data architecture forum.

Big Dataspark sqlScalaMacrosCustom SQL
Bitu Technology
Written by

Bitu Technology

Bitu Technology is the registered company of Tubi's China team. We are engineers passionate about leveraging advanced technology to improve lives, and we hope to use this channel to connect and advance together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.