ANTLR Tutorial - Dependency injection language

It’s been a while since the last time I wrote something here. It turns out that I have been tinkering with ANTLR for the last nine months as part of my new job. So my idea is to write a series of posts to teach you how to create your own languages with ANTLR.

ANTLR is a pretty good top-down LL(*) lexer and parser that allows us to define languages/DSLs without the restrictions of developing it inside a host language, as Java. As a drawback, writting a language with a parser is much more complicated that hosted DSLs.

I really think that the best way of learning any technology is to practice. That is why I am going to do it in a practical way, building a small language: a dependency injection language.

We are going to implement a language to define beans and give values to its attributes via setters or contructors. The value types allowed for attributes will be:

  • References to other beans
  • Basic types: String, Integer, Long, boolean.
  • Lists
  • Maps

In addition, it will allow us to define a default prefix package for beans that will ease the declaration of the bean type. So for example, we might define a bean type as “dao.UserDAO” instead of the more verbose “com.centuryminds.tutorial.dao.UserDAO”.

# This is a comment
default package prefix com.centuryminds.tutorial

def bean userDAO1 of type dao.UserDAO as
   tableName: "test"
   cacheSize: 200
end

def bean userDAO2 of type dao.UserDAO as
   init "test", 200 # constructor params
   otherAttribute: "other"
end

def bean userManager of type manager.UserManager as
   userDAO: userDAO1
end

def bean myBean of type com.nodefaultpackage.bean.Mybean as
   listVar: ["test", "othertest"]
   mapVar: {"test": 3, "another": 4}
end

ANTLR basics

A parser constructed with a tool like ANTLR needs to be splitted into several phases. In this way each part keeps simpler. These phases generally are the following:

  • Lexer: It reads a character stream and generates a sequential list of recognized tokens.
  • Parser: It receives a list of tokens and identifies the structure of the sentences according to the grammar. Once we have identified the structure we can generate two kinds of outputs: 1) Simply generate the final compiler output or 2) Go further and generate an Abstract Syntax Tree (AST henceforth) for later validations/manipulations.
  • AST Semantic Validation Phase: Once we’ve finished validating the source structure, we have to validate it semantically. A semantic error means something is grammatically correct but completly meaningless. This phase can be invoked plenty of times and/or work in association with a pre-semantic phase, depending on the grammar complexity.
  • AST Manipulation Phases: Once we know that what the source code “says” is correct and meaningful, we can pass it through a set of AST manipulation phases that transform the tree representing the code into something more specialized, optimized, etc. For example, removing unused variables, adding extra functionality, static check for nulls, and so forth.
  • Output Generation Phase: This is the last phase of a compiler, in which we generate the output: x86 code, Java code, bytecode, xml, whatever you want.

Do we need all the phases? Of course not, it depends totally on the complexity of the grammar. The mandatory ones are lexer and parser.

As for our small language, we’ll implement the following phases:

  • Lexer/Parser: With ANTLR you can define both in the same grammar file.
  • Pre-semantic phase: We are going to identify all the beans present in a file and its associated class types. This will permit us to not depend on the order the beans are defined (without this phase we wouldn’t be able to reference another bean that is defined below in the file, since we haven’t parsed that bean yet).
  • Semantic validation: Validation of all the types of the code. Has UserManager a field “dao” that is of type UserDAO? Does UserDAO has a field named “table” of type string that we can assign the literal “myTable”?
  • Output generation phase: In our case it is going to be a Java object structure ready to be accessed by our DI framework. Other options comprise the generation of a Spring XML file, Guice code, etc.

Basic ANTLR Examples

To develop with ANTLR I recommend ANTLRWorks. This tool is a specialized IDE that supports the definition of ANTLR based languages. Furthermore, with it (and with its built-in debugger!) we’ll be able to test our lexer/parsers.

Let’s start off with a basic example:

grammar Injector;
options {
 output=AST;
}
tokens{
 BEANS;
 BEAN;
 PACKAGE_LIST;
}
@header {
package com.centuryminds.injector;
}
@lexer::header
{
package com.centuryminds.injector;
}

beans	:	beanDef+ -> ^(BEANS beanDef+)
;

beanDef: 'def' Identifier 'of' 'type' packageName -> ^(BEAN Identifier packageName)
;

packageName
	: Identifier ('.' Identifier)*	-> ^(PACKAGE_LIST Identifier+)
;

Identifier    : LETTER ( LETTER | '0'..'9' | '_' )*
;

fragment
LETTER: ('a'..'z' | 'A'..'Z')
;

WS  :  (' '|'\t') {$channel=HIDDEN;}
    ;
NEWLINE : ('\n'|'\r' ) {$channel=HIDDEN;}
    ;

LINE_COMMENT
    : '#' ~('\n'|'\r')* '\r'? NEWLINE {$channel=HIDDEN;}
    ;

This grammar recognizes this kind of source code:

def bean1 of type one.example.Example
def bean2 of type another.example
def bean3 of type NoPackage

This is the AST generated one the code above has been parsed using our grammar:
simplebeans.png

But, let’s take a look at each statement in the grammar file:

First of all, we define the grammar name:

grammar Injector;

It’s vital that the grammar file is named accordingly to the grammar name. In this case the file would be named “Injector.g”

In the next line we define that the output of the parser will be an AST:

options {
 output=AST;
}

These lines:

tokens{
 BEANS;
 BEAN;
 PACKAGE_LIST;
}

describe tokens that have not been read from the input -aka virtual tokens, but generated by the parser. In the example, we create three virtual tokens, one for the Beans list, one for each bean definition and one for a Java package definition.

The following lines set the Java package for the generated Lexer and Parser:

@header {
package com.centuryminds.injector;
}
@lexer::header
{
package com.centuryminds.injector;
}

After this header code, we have two kinds of statements: token definitions and rules definitions. Tokens are recognized by the lexer. Rules are applied by the parser to identify the structure of the source code.

Tokens can be defined in two ways:

  • As a named token:
    WORD: ('a'..'z' | 'A'..'Z')+

    It might be referenced by other Tokens/Rules as to avoid repeated code. The name of the tokens must start with an upper case letter. E.g. “Word” is a valid token identifier, “word” is not ( It will be identified as an action).

  • As an anonymous token inside a rule:
    rule: 'define' 'alias' WORD 'as' WORD

    where ‘define’, ‘alias’ and ‘as’ are anonymous tokens.

We can also define fragment tokens. These are not really tokens by themselves, but fragments to avoid repetition. For example, LETTER fragment in the following code:

Identifier    : LETTER ( LETTER | '0'..'9' | '_' )*
;

fragment
LETTER: ('a'..'z' | 'A'..'Z')
;

Oftentimes, we don’t want some tokens to be visible for the parser, so that it doesn’t have to deal with them. In this case, we can ignore them or locate them inside a different token channel. There is a special channel named HIDDEN that maintains the information but hides it to the parser. In our example we hide the whitespaces, new lines and comments.

WS  :  (' '|'\t') {$channel=HIDDEN;}
    ;
NEWLINE : ('\n'|'\r' ) {$channel=HIDDEN;}
    ;

LINE_COMMENT
    : '#' ~('\n'|'\r')* '\r'? NEWLINE {$channel=HIDDEN;}
    ;

Finally, we define the rules:

beans	:	beanDef+ -> ^(BEANS beanDef+)
;

beanDef: 'def' Identifier 'of' 'type' packageName -> ^(BEAN Identifier packageName)
;

packageName
	: Identifier ('.' Identifier)*	-> ^(PACKAGE_LIST Identifier+)
;

A rule has the following syntax for AST generation:

ruleName : ruleAction  -> ^(ROOT_ELEMENT child1 child2 ...);

For example:

beanDef: 'def' Identifier 'of' 'type' packageName -> ^(BEAN Identifier packageName);

Here you are a rule named beanDef that recognizes “def myBean of type a.b.c”-like strings. In order to recognize the package name it invokes a sub-rule called packageName that in turn returns a sub-AST. Once we parse the structure, we build its corresponding AST with the virtual token BEAN as root, Identifier as the first child and packageName sub tree as the second one.

Of course a rule can have more options: rule parameters, return types, init blocks, etc. I recommend to take a look at the documentation, or await the next tutorial :).

Arbitrary Java code can be attached at any moment of the parsing:

beanDef: 'def' {System.out.println("after def");}
              Identifier  {System.out.println("Identifier: "+ $Identifier.text);}
              'of' 'type' packageName  {System.out.println("end of the rule");}
               -> ^(BEAN Identifier packageName);

The ANTLR tokens and rules can be accessed from this Java code: 1) The token text can be retrieved in the form $Identifier.text. 2) The AST corresponding to the sub-rule using $packageName.start

I’m wrapping it up. In the second part of this tutorial we will see the complete grammar file and the pre-semantic phase. Meanwhile, you can try to complete the grammar as an exercise.

Links:

2 Comments

  1. kanthi swaroop:

    Now thats interesting. Typical applications of ANTLR would be in parsing domain. My first interaction with ANTLR was for a data generation application. We had this system which consume huge xml streams. The schema types were defined via xsi and testing the code for all conditions was inhumane! Even coming up with data for functional flows was painful. So the solution was to comeup with a template of the xml (using schema) and another template for possible datatypes. Parse the template, create a inmemory tree and walk the tree several times (using custom java code). We picked up a random domain value for substitution for each walk and ended up having a comprehensive data sets. There were certain correlation problems though and it was a different story. I really wish, you wrote the article back then ;)

  2. Good ANTLR Resource - Adam Jordens@littlesquare:~/:

    [...] of others, if you’re looking to get started with ANTLR, here’s a useful introduction blog post. A quick search on dzone would have saved some trial and [...]

Leave a comment