Antlr 4: Doing it the wrong way

In my day job I work as a backend developer, most of the live-ops tasks fall onto my team, and we have to build a lot of tools to help us with that.

On of those tools is a filter engine, a chain of filters that can be used to change the behavior of the code without deploying new code. It works by parsing a filter written in JSON, it goes through the filters defined in the JSON, and then it applies them to the code. The problem with using JSON is that it is not a very expressive language, and it is not very easy to read or write. And if you are dealing with complex filters, it can become a nightmare to maintain with all the brackets and commas.

So, I decided to use Antlr 4 to build a filter engine grammar that would allow us to write filters in a more expressive way. I could very easily use the filter-engine core and just switch out the JSON for a more expressive language, one that is purpose-built for handling our filters. (a DSL if you will)

This post is about sharing what I learned while trying to use Antlr 4 to build a filter engine. It is not a tutorial but a set of lessons learned and mistakes made along the way. I hope it will help you avoid the same pitfalls I encountered.

What even is Antlr 4?

I am going to assume you know what Antlr 4 is, if you don't, you can read the official documentation or watch Terence Parr's introduction video.

The core concept of Antlr is simple, it is a parser generator that takes a grammar file and generates a parser for that grammar. The parser can then be used to parse input and produce an abstract syntax tree (AST) that represents the structure of the input. What differentiates Antlr from other parser generators such as Flex is that it is exceedingly simple to use, so simple in fact, it even has a tool for generating visitors and listeners for the AST, which makes it very easy to traverse the tree and perform actions on it.

Trials and Tribulations

What am I even doing?

Working on this little project was rough, I had no prior experience using a parser generator, and I had no idea what I was doing. What made it even worse was the lack of real documentation, the Antlr project is so old and grizzled that the documentation is sparse and outdated, and the examples are not very helpful. The docs just assume you know what you are doing, and they don't explain the concepts very well.

If you are in a similar situation, I recommend two things:

Watch Terence Parr's video, it is a great overview of the concepts and how to use Antlr. I've linked it above.
I know this might sound bad but ask AI, it is a surprisingly good resource for learning Antlr. It helped me figure out some of the more 'conventional' standard practices that comes with designing a grammar.

Lexer? Parser?

This is rather silly, but I spent a lot of time trying to figure out what the difference between a lexer and a parser is in context of Antlr. Turns out there is no real difference between them in Antlr, you can literally put everything is a single file. I just ended up putting my Lexer tokens at the end of the same grammar file.

grammar FilterScript;

// Parser rules
script      : (variableDeclaration | statement)+ EOF ;
statement   : expression ;
expression  : orExpression ;

...

// Lexer rules
VAR : 'VAR' ;
TRUE : 'true' ;
FALSE : 'false' ;
OR : 'OR' ;
AND : 'AND' ;
NOT : 'NOT' ;

...

Logical Operators and Precedence

My filter engine needed to support logical operators such as AND, OR, and NOT. I mean, how else would you build a filter engine? I had no idea how to handle operator precedence, and I spent a lot of time trying to figure it out. Turns out it is very simple and I am just going to share it here:

orExpression    : andExpression (OR andExpression)* ;
andExpression   : notExpression (AND notExpression)* ;
notExpression   : NOT notExpression | primaryExpression ;
primaryExpression : YourPrimaryExpressionRule

Wow, this is... Meaningless. Okay, here is my attempt at explaining it:

'orExpression' is the highest precedence, it can contain multiple 'andExpression's separated by 'OR'.
'andExpression' is the next highest precedence, it can contain multiple 'notExpression's separated by 'AND'.
'notExpression' is the next highest precedence, it can contain a 'NOT' operator followed by another 'notExpression' or a 'primaryExpression'.
'primaryExpression' is the lowest precedence, it can be any expression that does not contain logical operators. This means that 'NOT' has the highest precedence, followed by 'AND', and then 'OR'. This is how logical operators are usually handled in programming languages, so it makes sense to do it this way.

And here is an example of how this works:

var minLevel = 10;
var highValue = 50;

LEVEL >= $minLevel AND NOT COUNTRY == "US" OR PAYER true AND VALUE_X > $highValue;

Due to operator precedence, the first expression is interpreted as:

(LEVEL >= $minLevel AND (NOT COUNTRY == "US")) OR (PAYER true AND VALUE_X > $highValue)

We can use parentheses to change the precedence, for example:

LEVEL >= $minLevel AND (NOT COUNTRY == "US" OR PAYER true) AND VALUE_X > $highValue;

Not only does this make the expression more readable, but it also allows us to change the precedence of the operators. Turns out this is a common practice in programming languages, and it is a good idea to support it in your filter engine.

Infix vs Prefix Logical Operators

I've been showing you examples of infix logical operators, where the operator is placed between the operands. This is the most common way to write logical expressions, and it is what most programming languages use. The problem with infix operators is that they are hard to parse, and they require a lot of special handling in the grammar. What I did was to use infix operators but parse them as if they were prefix operators. Something like this:

Infix: (X AND Y) OR Z
Prefix: OR (
    AND (X, Y),
    Z
)

Using infix grammar and converting it into prefix form makes dealing with operator logic on code so much easier. It is the best of both worlds, you get the readability of infix operators and the simplicity of prefix operators.

I can say much more about this, but I will leave it at that for now. If you want to learn more about language design and operator precedence, I recommend reading Language Implementation Patterns by Terence Parr. It is a great book that covers a lot of the concepts I have mentioned in this post.

Integration with Your Application

This is where the real fun begins. Once you have your grammar and parser set up, you need to integrate it with your application. This is where you will use the generated parser to parse input and produce an abstract syntax tree (AST). You can then use the AST to evaluate the expressions and apply the filters to your code. Antlr provides a visitor pattern that makes it easy to traverse the tree and perform actions on it.

Visiting the Parse Tree

Once you have your grammar and parser set up, you can start using it to parse input and produce an abstract syntax tree (AST). Antlr provides a visitor pattern that makes it easy to traverse the tree and perform actions on it. You can generate a visitor class using the Antlr tool, and then extend it to implement your own logic for evaluating the expressions. Here is an example of how to use the visitor pattern to evaluate a simple expression:

public class FilterEvaluator extends FilterScriptBaseVisitor<Boolean> {
    private final Map<String, Object> variables;

    public FilterEvaluator(Map<String, Object> variables) {
        this.variables = variables;
    }

    @Override
    public Boolean visitOrExpression(FilterScriptParser.OrExpressionContext ctx) {
        Boolean left = visit(ctx.andExpression(0));
        for (int i = 1; i < ctx.andExpression().size(); i++) {
            left = left || visit(ctx.andExpression(i));
        }
        return left;
    }

    @Override
    public Boolean visitAndExpression(FilterScriptParser.AndExpressionContext ctx) {
        Boolean left = visit(ctx.notExpression(0));
        for (int i = 1; i < ctx.notExpression().size(); i++) {
            left = left && visit(ctx.notExpression(i));
        }
        return left;
    }

    @Override
    public Boolean visitNotExpression(FilterScriptParser.NotExpressionContext ctx) {
        if (ctx.NOT() != null) {
            return !visit(ctx.notExpression());
        } else {
            return visit(ctx.primaryExpression());
        }
    }

    @Override
    public Boolean visitPrimaryExpression(FilterScriptParser.PrimaryExpressionContext ctx) {
        // Handle primary expressions like comparisons, variable references, etc.
        // This is where you would implement your logic for evaluating primary expressions.
        return true; // Placeholder
    }
}

I know, looks like a lot of code, but it is actually quite simple. The visitor pattern allows you to define methods for each rule in your grammar, and then you can implement the logic for evaluating the expressions in those methods. I won't go into detail about this but once you make a class that extends the generated visitor class, you can override the methods for each rule in your grammar and implement your own logic for evaluating the expressions.

Conclusion

I've made a presentation about this tool at my work place, introducing my team to Antlr 4 and how to use it to build a filter engine. It was a great experience, and I learned a lot about language design and parser generators. I am attaching below a part of this presentation (I can't share the whole thing as it contains somewhat sensitive information), it is a summary of what I have learned while working on this project. Hopefully it will help you avoid some of the pitfalls I encountered.

Slides

I know writing your own language is a total overkill for most applications, but sometimes creating your own representation of a problem is the best way to solve it

Mediocre Electron Herding

Antlr 4, Doing it the wrong way