Apache Calcite Overview: Architecture, SQL Processing Flow, and Practical Example
This article introduces Apache Calcite as a modular data‑management framework, explains its architecture and SQL processing pipeline—from parsing and validation to relational‑algebra conversion, optimization, and execution—and demonstrates a complete CSV‑based query example with code snippets.
Apache Calcite is a dynamic data‑management framework that provides SQL parsing, validation, optimization, and execution. Its modular, plug‑in architecture allows users to employ any combination of modules or extend the system for custom models.
Architecture : Calcite consists of a JDBC layer built on the Avatica framework, a SQL parser and validator that convert the input SQL string into a SqlNode abstract syntax tree, a query optimizer that transforms the SqlNode tree into a relational‑algebra RelNode tree, and an execution engine that generates physical plans.
SQL Processing Flow follows five steps: (1) Parser creates a SqlNode AST; (2) Validator checks the tree against metadata and may rewrite it; (3) Converter turns the validated SqlNode into a RelNode representing relational algebra; (4) Optimizer applies rule‑based and cost‑based transformations to produce an optimized RelNode tree; (5) Execute generates an executable plan.
Case Study : Two CSV tables, users and orders, are defined as follows:
id:string,name:string,age:int
1,Jack,28
2,John,21
3,Tom,32
4,Peter,24 id:string,user_id:string,goods:string,price:double
001,1,Cola,3.5
002,1,Hat,38.9
003,2,Shoes,199.9
004,3,Book,39.9
005,4,Phone,2499.9The query to join and aggregate the data is:
SELECT u.id, name, age, sum(price)
FROM users AS u join orders AS o ON u.id = o.user_id
WHERE age >= 20 AND age <= 30
GROUP BY u.id, name, age
ORDER BY u.idSQL Parsing uses JavaCC to produce a SqlNode tree, with the root node being a SqlOrderBy that contains a SqlSelect for the core query.
String sql = "SELECT u.id, name, age, sum(price) " +
"FROM users AS u join orders AS o ON u.id = o.user_id " +
"WHERE age >= 20 AND age <= 30 " +
"GROUP BY u.id, name, age " +
"ORDER BY u.id";
SqlParser parser = SqlParser.create(sql, SqlParser.Config.DEFAULT);
SqlNode rootSqlNode = parser.parseStmt();SQL Validation creates a schema, registers the tables, and validates the SqlNode. The validator rewrites the tree, converting the SqlOrderBy into a SqlSelect with an orderBy field.
// Build schema and validator (simplified)
SimpleTable userTable = SimpleTable.newBuilder("users")
.addField("id", SqlTypeName.VARCHAR)
.addField("name", SqlTypeName.VARCHAR)
.addField("age", SqlTypeName.INTEGER)
.withFilePath("/path/to/user.csv")
.withRowCount(10)
.build();
// ... similarly for orderTable and schema creation ...
SqlValidator validator = SqlValidatorUtil.newValidator(
SqlStdOperatorTable.instance(), catalogReader, typeFactory, validatorConfig);
SqlNode validateSqlNode = validator.validate(rootSqlNode);Conversion to RelNode uses SqlToRelConverter to obtain a logical plan represented by a tree of RelNode implementations.
VolcanoPlanner planner = new VolcanoPlanner(RelOptCostImpl.FACTORY, Contexts.of(config));
planner.addRelTraitDef(ConventionTraitDef.INSTANCE);
RelOptCluster cluster = RelOptCluster.create(planner, new RexBuilder(typeFactory));
SqlToRelConverter converter = new SqlToRelConverter(null, validator, catalogReader,
cluster, StandardConvertletTable.INSTANCE, SqlToRelConverter.config()
.withTrimUnusedFields(true).withExpand(false));
RelNode relNode = converter.convertQuery(validateSqlNode, false, true);The resulting logical plan looks like:
LogicalSort(sort0=[$0], dir0=[ASC])
LogicalAggregate(group=[{0, 1, 2}], EXPR$3=[SUM($3)])
LogicalProject(id=[$0], name=[$1], age=[$2], price=[$6])
LogicalFilter(condition=[AND(>=$2, 20), <=($2, 30)])
LogicalJoin(condition=[=($0, $4)], joinType=[inner])
LogicalTableScan(table=[[s, users]])
LogicalTableScan(table=[[s, orders]])Optimization applies rule sets (e.g., predicate push‑down, projection push‑down) using HepPlanner or VolcanoPlanner. After optimization, the plan is converted to physical Enumerable nodes.
RuleSet rules = RuleSets.ofList(
CoreRules.FILTER_TO_CALC,
CoreRules.PROJECT_TO_CALC,
CoreRules.FILTER_CALC_MERGE,
CoreRules.PROJECT_CALC_MERGE,
CoreRules.FILTER_INTO_JOIN,
EnumerableRules.ENUMERABLE_TABLE_SCAN_RULE,
// ... other rules ...
);
Program program = Programs.of(RuleSets.ofList(rules));
RelNode optimizerRelTree = program.run(planner, relNode,
relNode.getTraitSet().plus(EnumerableConvention.INSTANCE),
Collections.emptyList(), Collections.emptyList());The optimized physical plan shows Enumerable operators, indicating that the query will be executed by interpreting the plan in Java.
EnumerableSort(sort0=[$0], dir0=[ASC])
EnumerableAggregate(group=[{0, 1, 2}], EXPR$3=[SUM($3)])
EnumerableCalc(expr#0..6=[{inputs}], proj#0..2=[{exprs}], price=[$t6])
EnumerableHashJoin(condition=[=($0, $4)], joinType=[inner])
EnumerableCalc(...)
EnumerableTableScan(table=[[s, users]])
EnumerableTableScan(table=[[s, orders]])Execution binds the physical plan to a data context and iterates over the resulting rows.
EnumerableRel enumerable = (EnumerableRel) optimizerRelTree;
Map<String, Object> internalParameters = new LinkedHashMap<>();
EnumerableRel.Prefer prefer = EnumerableRel.Prefer.ARRAY;
Bindable bindable = EnumerableInterpretable.toBindable(internalParameters, null, enumerable, prefer);
Enumerable bind = bindable.bind(new SimpleDataContext(rootSchema.plus()));
Enumerator enumerator = bind.enumerator();
while (enumerator.moveNext()) {
Object[] values = (Object[]) enumerator.current();
System.out.println(String.join(",", Arrays.stream(values).map(Object::toString).toArray(String[]::new)));
}The execution prints the aggregated results:
1,Jack,28,42.40
2,John,21,199.90
4,Peter,24,2499.90Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
