Applying ANTLR4 for Arithmetic Calculator and SQL Parsing over CSV Data
The article demonstrates how ANTLR4 can replace manual parsing by building a four‑operation calculator and a trimmed SQL parser for Presto, showing the workflow from grammar definition to generated lexer/parser and visitor code, then applying the SQL parser to query CSV data efficiently.
The article introduces the rapid growth of big data since its first appearance in a Chinese government work report in 2014 and explains why traditional databases cannot handle PB‑scale data. It motivates the use of specialized tools (Hive, Spark, Presto, Kylin, Druid, ClickHouse, Elasticsearch) that all expose SQL‑like interfaces, which in turn requires custom SQL parsers.
ANTLR, a mature parser generator created in 1989, is presented as a solution for building such parsers. Its ability to generate lexers and parsers for many languages (Java, C, Python, SQL, etc.) makes it suitable for both educational and production scenarios.
1. Manual implementation of a four‑operation calculator
A simple stack‑based calculator is shown, illustrating how operator precedence can be handled without a parser generator.
package org.example.calc;
import java.util.*;
public class CalcByHand {
public static Set
opSet1 = new HashSet<>();
public static Set
opSet2 = new HashSet<>();
static{
opSet1.add("+");
opSet1.add("-");
opSet2.add("*");
opSet2.add("/");
}
public static void main(String[] args) {
String exp="1+3*4";
String[] tokens = exp.split("((?<=\\+|\\-|\\*|\\/)|(?=\\+|\\-|\\*|\\/))");
Stack
opStack = new Stack<>();
Stack
numStack = new Stack<>();
int proi=1;
for(String token: tokens){
token = token.trim();
if(opSet1.contains(token)){
opStack.push(token);
proi=1;
}else if(opSet2.contains(token)){
proi=2;
opStack.push(token);
}else{
numStack.push(token);
if(proi==2){
calcExp(opStack,numStack);
}
}
}
while (!opStack.isEmpty()){
calcExp(opStack,numStack);
}
String finalVal = numStack.pop();
System.out.println(finalVal);
}
private static void calcExp(Stack
opStack, Stack
numStack) {
double right=Double.valueOf(numStack.pop());
double left = Double.valueOf(numStack.pop());
String op = opStack.pop();
String val;
switch (op){
case "+": val = String.valueOf(left+right); break;
case "-": val = String.valueOf(left-right); break;
case "*": val = String.valueOf(left*right); break;
case "/": val = String.valueOf(left/right); break;
default: throw new UnsupportedOperationException("unsupported");
}
numStack.push(val);
}
}While functional, this approach lacks support for parentheses and error handling.
2. ANTLR4‑based implementation
The article explains the three‑step ANTLR workflow: write a .g4 grammar, run the ANTLR tool to generate lexer/parser code, and then write a Visitor or Listener to process the parse tree.
Example grammar for a simple calculator (LabeledExpr.g4):
grammar LabeledExpr; // rename to distinguish from Expr.g4
prog: stat+ ;
stat: expr NEWLINE # printExpr
| ID '=' expr NEWLINE # assign
| NEWLINE # blank
;
expr: expr op=('*'|'/') expr # MulDiv
| expr op=('+'|'-') expr # AddSub
| INT # int
| ID # id
| '(' expr ')' # parens
;
MUL : '*'; DIV : '/'; ADD : '+'; SUB : '-';
ID : [a-zA-Z]+ ; INT : [0-9]+ ;
NEWLINE: '\r'? '\n' ; WS : [ \t]+ -> skip ;Running the tool:
antlr4 -package org.example.calc -no-listener -visitor .\LabeledExpr.g4Generated Java classes (e.g., LabeledExprLexer.java , LabeledExprParser.java ) are then used in a driver program:
ANTLRInputStream input = new ANTLRInputStream(is);
LabeledExprLexer lexer = new LabeledExprLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
LabeledExprParser parser = new LabeledExprParser(tokens);
ParseTree tree = parser.prog(); // parse
EvalVisitor eval = new EvalVisitor();
eval.visit(tree);Visitor implementation for addition/subtraction:
@Override
public Integer visitAddSub(LabeledExprParser.AddSubContext ctx) {
int left = visit(ctx.expr(0));
int right = visit(ctx.expr(1));
if (ctx.op.getType() == LabeledExprParser.ADD) return left + right;
return left - right; // SUB
}Both Visitor and Listener modes are discussed, with a comparison of their characteristics.
3. Building a SQL parser from Presto source
The article shows how to trim Presto’s massive SqlBase.g4 to a minimal grammar that can parse simple SELECT … FROM … statements. The trimmed grammar (SelectBase.g4) is presented in full.
grammar SqlBase;
... // (trimmed rules shown in the article)
SELECT: 'SELECT';
FROM: 'FROM';
IDENTIFIER: (LETTER | '_') (LETTER | DIGIT | '_' | '@' | ':')*;
WS: [ \r\n\t]+ -> channel(HIDDEN);After generating the parser with:
antlr4 -package org.example.antlr -no-listener -visitor .\SqlBase.g4the article demonstrates a Visitor that extracts the SELECT list and FROM clause, builds a QuerySpecification object, and then uses Apache Commons CSV to read the corresponding CSV file and print results.
// Example of extracting table and fields
QuerySpecification spec = (QuerySpecification) query.getQueryBody();
Table table = (Table) spec.getFrom().get();
List
items = spec.getSelect().getSelectItems();
List
fieldNames = new ArrayList<>();
for (SelectItem item : items) {
SingleColumn col = (SingleColumn) item;
fieldNames.add(((Identifier)col.getExpression()).getValue());
}
String file = String.format("./data/%s.csv", table.getName());
Reader in = new FileReader(file);
Iterable
records = CSVFormat.RFC4180.withFirstRecordAsHeader().parse(in);
// ... format and print rows ...Sample output screenshots illustrate successful queries such as SELECT City FROM cities against a CSV dataset.
4. Conclusion
The article concludes that ANTLR4 provides a powerful, reusable way to implement language front‑ends, from simple calculators to full‑featured SQL parsers, and that experimenting with trimmed grammars from real‑world projects (Presto) deepens understanding of compiler principles and big‑data query engines.
vivo Internet Technology
Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.