I just announced the new Spring Boot 2 material, coming in REST With Spring:

>> CHECK OUT THE COURSE

1. Overview

In this tutorial, we’ll do a quick overview of the ANTLR parser generator and show some real-world applications.

2. ANTLR

ANTLR (ANother Tool for Language Recognition) is a tool for processing structured text.

It does this by giving us access to language processing primitives like lexers, grammars, and parsers as well as the runtime to process text against them.

It’s often used to build tools and frameworks. For example, Hibernate uses ANTLR for parsing and processing HQL queries and Elasticsearch uses it for Painless.

And Java is just one binding. ANTLR also offers bindings for C#, Python, JavaScript, Go, C++ and Swift.

3. Configuration

First of all, let’s start by adding antlr-runtime to our pom.xml:

<dependency>
    <groupId>org.antlr</groupId>
    <artifactId>antlr4-runtime</artifactId>
    <version>4.7.1</version>
</dependency>

And also the antlr-maven-plugin:

<plugin>
    <groupId>org.antlr</groupId>
    <artifactId>antlr4-maven-plugin</artifactId>
    <version>4.7.1</version>
    <executions>
        <execution>
            <goals>
                <goal>antlr4</goal>
            </goals>
        </execution>
    </executions>
</plugin>

It’s the plugin’s job to generate code from the grammars we specify.

4. How Does it Work?

Basically, when we want to create the parser by using the ANTLR Maven plugin, we need to follow three simple steps:

  • prepare a grammar file
  • generate sources
  • create the listener

So, let’s see these steps in action.

5. Using an Existing Grammar

Let’s first use ANTLR to analyze code for methods with bad casing:

public class SampleClass {
 
    public void DoSomethingElse() {
        //...
    }
}

Simply put, we’ll validate that all method names in our code start with a lowercase letter.

5.1. Prepare a Grammar File

What’s nice is that there are already several grammar files out there that can suit our purposes.

Let’s use the Java8.g4 grammar file which we found in ANTLR’s Github grammar repo.

We can create the src/main/antlr4 directory and download it there.

5.2. Generate Sources

ANTLR works by generating Java code corresponding to the grammar files that we give it, and the maven plugin makes it easy:

mvn package

By default, this will generate several files under the target/generated-sources/antlr4 directory:

  • Java8.interp
  • Java8Listener.java
  • Java8BaseListener.java
  • Java8Lexer.java
  • Java8Lexer.interp
  • Java8Parser.java
  • Java8.tokens
  • Java8Lexer.tokens

Notice that the names of those files are based on the name of the grammar file.

We’ll need the Java8Lexer and the Java8Parser files later when we test. For now, though, we need the Java8BaseListener for creating our MethodUppercaseListener.

5.3. Creating MethodUppercaseListener

Based on the Java8 grammar that we used, Java8BaseListener has several methods that we can override, each one corresponding to a heading in the grammar file.

For example, the grammar defines the method name, parameter list, and throws clause like so:

methodDeclarator
	:	Identifier '(' formalParameterList? ')' dims?
	;

And so Java8BaseListener has a method enterMethodDeclarator which will be invoked each time this pattern is encountered.

So, let’s override enterMethodDeclarator, pull out the Identifier, and perform our check:

public class UppercaseMethodListener extends Java8BaseListener {

    private List<String> errors = new ArrayList<>();

    // ... getter for errors
 
    @Override
    public void enterMethodDeclarator(Java8Parser.MethodDeclaratorContext ctx) {
        TerminalNode node = ctx.Identifier();
        String methodName = node.getText();

        if (Character.isUpperCase(methodName.charAt(0))) {
            String error = String.format("Method %s is uppercased!", methodName);
            errors.add(error);
        }
    }
}

5.4. Testing

Now, let’s do some testing. First, we construct the lexer:

String javaClassContent = "public class SampleClass { void DoSomething(){} }";
Java8Lexer java8Lexer = new Java8Lexer(CharStreams.fromString(javaClassContent));

Then, we instantiate the parser:

CommonTokenStream tokens = new CommonTokenStream(lexer);
Java8Parser parser = new Java8Parser(tokens);
ParseTree tree = parser.compilationUnit();

And then, the walker and the listener:

ParseTreeWalker walker = new ParseTreeWalker();
UppercaseMethodListener listener= new UppercaseMethodListener();

Lastly, we tell ANTLR to walk through our sample class:

walker.walk(listener, tree);

assertThat(listener.getErrors().size(), is(1));
assertThat(listener.getErrors().get(0),
  is("Method DoSomething is uppercased!"));

6. Building our Grammar

Now, let’s try something just a little bit more complex, like parsing log files:

2018-May-05 14:20:18 INFO some error occurred
2018-May-05 14:20:19 INFO yet another error
2018-May-05 14:20:20 INFO some method started
2018-May-05 14:20:21 DEBUG another method started
2018-May-05 14:20:21 DEBUG entering awesome method
2018-May-05 14:20:24 ERROR Bad thing happened

Because we have a custom log format, we’re going to first need to create our own grammar.

6.1. Prepare a Grammar File

First, let’s see if we can create a mental map of what each log line looks like in our file.

<datetime> <level> <message>

Or if we go one more level deep, we might say:

<datetime> := <year><dash><month><dash><day> …

And so on. It’s important to consider this so we can decide at what level of granularity we want to parse the text.

A grammar file is basically a set of lexer and parser rules. Simply put, lexer rules describe the syntax of the grammar while parser rules describe the semantics.

Let’s start by defining fragments which are reusable building blocks for lexer rules.

fragment DIGIT : [0-9];
fragment TWODIGIT : DIGIT DIGIT;
fragment LETTER : [A-Za-z];

Next, let’s define the remainings lexer rules:

DATE : TWODIGIT TWODIGIT '-' LETTER LETTER LETTER '-' TWODIGIT;
TIME : TWODIGIT ':' TWODIGIT ':' TWODIGIT;
TEXT   : LETTER+ ;
CRLF : '\r'? '\n' | '\r';

With these building blocks in place, we can build parser rules for the basic structure:

log : entry+;
entry : timestamp ' ' level ' ' message CRLF;

And then we’ll add the details for timestamp:

timestamp : DATE ' ' TIME;

For level:

level : 'ERROR' | 'INFO' | 'DEBUG';

And for message:

message : (TEXT | ' ')+;

And that’s it! Our grammar is ready to use. We will put it under the src/main/antlr4 directory as before.

6.2. Generate Sources

Recall that this is just a quick mvn package, and that this will create several files like LogBaseListenerLogParser, and so on, based on the name of our grammar.

6.3. Create our Log Listener

Now, we are ready to implement our listener, which we’ll ultimately use to parse a log file into Java objects.

So, let’s start with a simple model class for the log entry:

public class LogEntry {

    private LogLevel level;
    private String message;
    private LocalDateTime timestamp;
   
    // getters and setters
}

Now, we need to subclass LogBaseListener as before:

public class LogListener extends LogBaseListener {

    private List<LogEntry> entries = new ArrayList<>();
    private LogEntry current;

current will hold onto the current log line, which we can reinitialize each time we enter a logEntry, again based on our grammar:

    @Override
    public void enterEntry(LogParser.EntryContext ctx) {
        this.current = new LogEntry();
    }

Next, we’ll use enterTimestampenterLevel, and enterMessage for setting the appropriate LogEntry properties:

    @Override
    public void enterTimestamp(LogParser.TimestampContext ctx) {
        this.current.setTimestamp(
          LocalDateTime.parse(ctx.getText(), DEFAULT_DATETIME_FORMATTER));
    }
    
    @Override
    public void enterMessage(LogParser.MessageContext ctx) {
        this.current.setMessage(ctx.getText());
    }

    @Override
    public void enterLevel(LogParser.LevelContext ctx) {
        this.current.setLevel(LogLevel.valueOf(ctx.getText()));
    }

And finally, let’s use the exitEntry method in order to create and add our new LogEntry:

    @Override
    public void exitLogEntry(LogParser.EntryContext ctx) {
        this.entries.add(this.current);
    }

Note, by the way, that our LogListener isn’t threadsafe!

6.4. Testing

And now we can test again as we did last time:

@Test
public void whenLogContainsOneErrorLogEntry_thenOneErrorIsReturned()
  throws Exception {
 
    String logLine ="2018-May-05 14:20:24 ERROR Bad thing happened";

    // instantiate the lexer, the parser, and the walker
    LogListener listener = new LogListener();
    walker.walk(listener, logParser.log());
    LogEntry entry = listener.getEntries().get(0);
 
    assertThat(entry.getLevel(), is(LogLevel.ERROR));
    assertThat(entry.getMessage(), is("Bad thing happened"));
    assertThat(entry.getTimestamp(), is(LocalDateTime.of(2018,5,5,14,20,24)));
}

7. Conclusion

In this article, we focused on how to create the custom parser for the own language using the ANTLR.

We also saw how to use existing grammar files and apply them for very simple tasks like code linting.

As always, all the code used here can be found over on GitHub.

I just announced the new Spring Boot 2 material, coming in REST With Spring:

>> CHECK OUT THE LESSONS