Big Data 20 min read

Applying ANTLR4 for Arithmetic Calculator and SQL Parsing over CSV Data

The article demonstrates how ANTLR4 can replace manual parsing by building a four‑operation calculator and a trimmed SQL parser for Presto, showing the workflow from grammar definition to generated lexer/parser and visitor code, then applying the SQL parser to query CSV data efficiently.

vivo Internet Technology
vivo Internet Technology
vivo Internet Technology
Applying ANTLR4 for Arithmetic Calculator and SQL Parsing over CSV Data

The article introduces the rapid growth of big data since its first appearance in a Chinese government work report in 2014 and explains why traditional databases cannot handle PB‑scale data. It motivates the use of specialized tools (Hive, Spark, Presto, Kylin, Druid, ClickHouse, Elasticsearch) that all expose SQL‑like interfaces, which in turn requires custom SQL parsers.

ANTLR, a mature parser generator created in 1989, is presented as a solution for building such parsers. Its ability to generate lexers and parsers for many languages (Java, C, Python, SQL, etc.) makes it suitable for both educational and production scenarios.

1. Manual implementation of a four‑operation calculator

A simple stack‑based calculator is shown, illustrating how operator precedence can be handled without a parser generator.

package org.example.calc;
import java.util.*;
public class CalcByHand {
    public static Set
opSet1 = new HashSet<>();
    public static Set
opSet2 = new HashSet<>();
    static{
        opSet1.add("+");
        opSet1.add("-");
        opSet2.add("*");
        opSet2.add("/");
    }
    public static void main(String[] args) {
        String exp="1+3*4";
        String[] tokens = exp.split("((?<=\\+|\\-|\\*|\\/)|(?=\\+|\\-|\\*|\\/))");
        Stack
opStack = new Stack<>();
        Stack
numStack = new Stack<>();
        int proi=1;
        for(String token: tokens){
            token = token.trim();
            if(opSet1.contains(token)){
                opStack.push(token);
                proi=1;
            }else if(opSet2.contains(token)){
                proi=2;
                opStack.push(token);
            }else{
                numStack.push(token);
                if(proi==2){
                    calcExp(opStack,numStack);
                }
            }
        }
        while (!opStack.isEmpty()){
            calcExp(opStack,numStack);
        }
        String finalVal = numStack.pop();
        System.out.println(finalVal);
    }
    private static void calcExp(Stack
opStack, Stack
numStack) {
        double right=Double.valueOf(numStack.pop());
        double left = Double.valueOf(numStack.pop());
        String op = opStack.pop();
        String val;
        switch (op){
            case "+": val = String.valueOf(left+right); break;
            case "-": val = String.valueOf(left-right); break;
            case "*": val = String.valueOf(left*right); break;
            case "/": val = String.valueOf(left/right); break;
            default: throw new UnsupportedOperationException("unsupported");
        }
        numStack.push(val);
    }
}

While functional, this approach lacks support for parentheses and error handling.

2. ANTLR4‑based implementation

The article explains the three‑step ANTLR workflow: write a .g4 grammar, run the ANTLR tool to generate lexer/parser code, and then write a Visitor or Listener to process the parse tree.

Example grammar for a simple calculator (LabeledExpr.g4):

grammar LabeledExpr; // rename to distinguish from Expr.g4
prog:   stat+ ;
stat:   expr NEWLINE                # printExpr
    |   ID '=' expr NEWLINE         # assign
    |   NEWLINE                     # blank
    ;
expr:   expr op=('*'|'/') expr      # MulDiv
    |   expr op=('+'|'-') expr      # AddSub
    |   INT                         # int
    |   ID                          # id
    |   '(' expr ')'                # parens
    ;
MUL :   '*'; DIV : '/'; ADD : '+'; SUB : '-';
ID  :   [a-zA-Z]+ ; INT : [0-9]+ ;
NEWLINE: '\r'? '\n' ; WS  : [ \t]+ -> skip ;

Running the tool:

antlr4 -package org.example.calc -no-listener -visitor .\LabeledExpr.g4

Generated Java classes (e.g., LabeledExprLexer.java , LabeledExprParser.java ) are then used in a driver program:

ANTLRInputStream input = new ANTLRInputStream(is);
LabeledExprLexer lexer = new LabeledExprLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
LabeledExprParser parser = new LabeledExprParser(tokens);
ParseTree tree = parser.prog(); // parse
EvalVisitor eval = new EvalVisitor();
eval.visit(tree);

Visitor implementation for addition/subtraction:

@Override
public Integer visitAddSub(LabeledExprParser.AddSubContext ctx) {
    int left = visit(ctx.expr(0));
    int right = visit(ctx.expr(1));
    if (ctx.op.getType() == LabeledExprParser.ADD) return left + right;
    return left - right; // SUB
}

Both Visitor and Listener modes are discussed, with a comparison of their characteristics.

3. Building a SQL parser from Presto source

The article shows how to trim Presto’s massive SqlBase.g4 to a minimal grammar that can parse simple SELECT … FROM … statements. The trimmed grammar (SelectBase.g4) is presented in full.

grammar SqlBase;
... // (trimmed rules shown in the article)
SELECT: 'SELECT';
FROM: 'FROM';
IDENTIFIER: (LETTER | '_') (LETTER | DIGIT | '_' | '@' | ':')*;
WS: [ \r\n\t]+ -> channel(HIDDEN);

After generating the parser with:

antlr4 -package org.example.antlr -no-listener -visitor .\SqlBase.g4

the article demonstrates a Visitor that extracts the SELECT list and FROM clause, builds a QuerySpecification object, and then uses Apache Commons CSV to read the corresponding CSV file and print results.

// Example of extracting table and fields
QuerySpecification spec = (QuerySpecification) query.getQueryBody();
Table table = (Table) spec.getFrom().get();
List
items = spec.getSelect().getSelectItems();
List
fieldNames = new ArrayList<>();
for (SelectItem item : items) {
    SingleColumn col = (SingleColumn) item;
    fieldNames.add(((Identifier)col.getExpression()).getValue());
}
String file = String.format("./data/%s.csv", table.getName());
Reader in = new FileReader(file);
Iterable
records = CSVFormat.RFC4180.withFirstRecordAsHeader().parse(in);
// ... format and print rows ...

Sample output screenshots illustrate successful queries such as SELECT City FROM cities against a CSV dataset.

4. Conclusion

The article concludes that ANTLR4 provides a powerful, reusable way to implement language front‑ends, from simple calculators to full‑featured SQL parsers, and that experimenting with trimmed grammars from real‑world projects (Presto) deepens understanding of compiler principles and big‑data query engines.

JavaBig DataCompilerANTLRSQL parsingCSVpresto
vivo Internet Technology
Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.