Basic lexer and parser tests for your Xtext grammar

If you spend some time in the TMF forum, you will notice that every once in a while there are questions of the type: ‘I want to define a terminal rule for SomeMoreOrLessComplicatedStuff’ or ‘I have defined a terminal rule, but I get the error mismatched input …’, ‘I want keyword X to be usable as a name for an entity’, etc.
This post is not about how to make the choice between using a terminal or a datatype rule or how to come up with good terminal rules (these are non-trivial topics). It is about (unit-) testing, whether the grammar reflects your intentions, that is checking whether xyz will actually cause the terminal rule you intended to fire (which often enough is not the case) and whether your datatype rules will successfully parse the strings you wrote them for. This post will not deal with value conversion.
In particular if your language is evolving and you introduce new keywords, terminal rules, it is important that you have a simple and quick way of validating if the changes did not break anything. Simply having one big sample model file that you open in the generated editor to see if there are syntax errors is hardly sufficient.
Xtext 1.0.0 comes with a nice entry point for writing your tests: AbstractXtextTests. If you check out the Xtext source code and look at the type hierarchy of this class, you will find plenty of code that might serve as inspiration for tests covering much of the language and editor infrastructure of an Xtext based project.

Our sample grammar looks as follows

grammar org.xtext.example.mydsl.MyDsl with org.eclipse.xtext.common.Terminals 
generate myDsl "http://www.xtext.org/example/mydsl/MyDsl"

Model: (entities+=Entity)*;

Entity:'entity' name=ID (extends=[Entity|QualifiedName])? '{'
    (properties+=Property)*
'}';

Property: 'property' name=SPECIAL_ID; 

QualifiedName: ID('.'ID)*;
terminal SPECIAL_ID: ('A'..'Z')+('_'INT)?;

As a side note for those not seeing it right away: The terminal rules SPECIAL_ID and ID overlap. ABC fits both the ID and the SPECIAL_ID pattern, but SPECIAL_ID will win during tokenisation (lexing) as it has higher priority than ID (which is only imported). As a consequence you cannot name an entity ABC – the grammar requires an ID (ABC would thus be a syntax error). You cannot name it “entity” either, as that is a keyword in the grammar. If you wanted to allow that you would have to write name=(ID|SPECIAL_ID|’entity’) or introduce a corresponding datatype rule.

The following class could serve as the basic infrastructure for testing terminal and datatype rules:

public abstract class AbstractBasicLexerAndParserTest extends
    AbstractXtextTests {

  private Lexer lexer;
  private ITokenDefProvider tokenDefProvider;
  private IAntlrParser parser;

  protected Lexer getLexer() {
    return lexer;
  }

  protected ITokenDefProvider getTokenDefProvider() {
    return tokenDefProvider;
  }

  protected IAntlrParser getAntlrParser() {
    return parser;
  }

  @SuppressWarnings("rawtypes")
  abstract Class getStandaloneSetupClass();

  @SuppressWarnings("unchecked")
  @Override
  protected void setUp() throws Exception {
    super.setUp();
    with(getStandaloneSetupClass());
    lexer = get(Lexer.class);
    tokenDefProvider = get(ITokenDefProvider.class);
    parser = get(IAntlrParser.class);
  }

  /**
   * return the list of tokens created by the lexer from the given input
   * */
  protected List<Token> getTokens(String input) {
    CharStream stream = new ANTLRStringStream(input);
    getLexer().setCharStream(stream);
    XtextTokenStream tokenStream = new XtextTokenStream(getLexer(),
        getTokenDefProvider());
    @SuppressWarnings("unchecked")
    List<Token> tokens = tokenStream.getTokens();
    return tokens;
  }

  /**
   * return the name of the terminal rule for a given token
   * */
  protected String getTokenType(Token token) {
    return getTokenDefProvider().getTokenDefMap().get(token.getType());
  }

  /**
   * check whether an input is chopped into a list of expected token types
   * */
  protected void checkTokenisation(String input, String... expectedTokenTypes) {
    List<Token> tokens = getTokens(input);
    assertEquals(input, expectedTokenTypes.length, tokens.size());
    for (int i = 0; i < tokens.size(); i++) {
      Token token = tokens.get(i);
      assertEquals(input, expectedTokenTypes[i], getTokenType(token));
    }
  }

  /**
   * check that an input is not tokenised using a particular terminal rule
   * */
  protected void failTokenisation(String input, String unExpectedTokenType) {
    List<Token> tokens = getTokens(input);
    assertEquals(input, 1, tokens.size());
    Token token = tokens.get(0);
    assertNotSame(input, unExpectedTokenType, getTokenType(token));
  }

  /**
   * return the parse result for an input given a specific entry rule of the
   * grammar
   * */
  protected IParseResult getParseResult(String input, String entryRule) {
    return getAntlrParser().parse(entryRule, new StringReader(input));
  }

  /**
   * check that the input can be successfully parsed given a specific entry
   * rule of the grammar
   * */
  protected void checkParsing(String input, String entryRule) {
    IParseResult la = getParseResult(input, entryRule);
    assertEquals(input, 0, la.getParseErrors().size());
  }

  /**
   * check that the input cannot be successfully parsed given a specific entry
   * rule of the grammar
   * */
  protected void failParsing(String input, String entryRule) {
    IParseResult la = getParseResult(input, entryRule);
    assertNotSame(input, 0, la.getParseErrors().size());
  }

  /**
   * check that input is treated as a keyword by the grammar
   * */
  protected void checkKeyword(String input) {
    // the rule name for a keyword is usually
    // the keyword enclosed in single quotes
    String rule = new StringBuilder("'").append(input).append("'")
        .toString();
    checkTokenisation(input, rule);
  }

  /**
   * check that input is not treated as a keyword by the grammar
   * */
  protected void failKeyword(String keyword) {
    List<Token> tokens = getTokens(keyword);
    assertEquals(keyword, 1, tokens.size());
    String type = getTokenType(tokens.get(0));
    assertFalse(keyword, type.charAt(0) == '\'');
  }
}

And the following is an actual test class for the above grammar. It should give you an idea of how to use the abstract test class.

public class MyDslLexerAndParserTest extends AbstractBasicLexerAndParserTest {

  @SuppressWarnings("rawtypes")
  @Override
  Class getStandaloneSetupClass() {
    //here you should return the StandaloneSetup class
    //of the language you want to test
    return MyDslStandaloneSetup.class;
  }
  
  //for convenience, define constants for the
  //rule names in your grammar
  //the names of terminal rules will be capitalised
  //and "RULE_" will be appended to the front
  private static final String ID="RULE_ID";
  private static final String SPECIAL_ID="RULE_SPECIAL_ID";
  private static final String INT="RULE_INT";
  private static final String WS="RULE_WS";
  private static final String SL_COMMENT="RULE_SL_COMMENT";
  
  private static final String FQN="QualifiedName";

  
  public void testID(){
    checkTokenisation("a", ID);
    checkTokenisation("abc", ID);
    checkTokenisation("abc123", ID);
    checkTokenisation("abc_123", ID);
    checkTokenisation("^entity", ID);
    
    //fail as entity is a keyword
    failTokenisation("entity", ID);
    //fail as A is a SPECIAL_ID
    failTokenisation("A", ID);
  }
  
  public void testSpecialID(){
    checkTokenisation("A", SPECIAL_ID);
    checkTokenisation("ABC", SPECIAL_ID);
    checkTokenisation("ABC_123", SPECIAL_ID);
    
    //fail as underscore is missing
    failTokenisation("ABC123", SPECIAL_ID);
  }

  public void testSLCOMMENT(){
    checkTokenisation("//comment", SL_COMMENT);
    checkTokenisation("//comment\n", SL_COMMENT);
    checkTokenisation("// comment \t\t comment\r\n", SL_COMMENT);
  }
  
  public void testKeywords(){
    checkKeyword("entity");
    checkKeyword("property");
    checkKeyword(".");
    
    //Entity is not a keyword
    failKeyword("Entity");
  }
  
  public void testTokenSequences(){
    checkTokenisation("123 abc", INT, WS, ID);
    checkTokenisation("123 \t//comment\n abc", INT, WS, SL_COMMENT,WS,ID);
    
    //note that no white space is necessary!
    checkTokenisation("123abc", INT, ID);
  }
  
  public void testQualifiedName(){
    checkParsing("abc.d", FQN);
    //note that white spaces and comments are hidden
    //so they are allowed within qualified names
    //if you don't want that, you have to start the rule
    //definition with QualifiedName hidden():
    //thereby making all tokens visible to the rule
    checkParsing("abc   .   d", FQN);
    checkParsing("abc /*comment*/  .\t\n//comment\n  d", FQN);
    
    //fail as ABC is a SPECIAL_ID
    failParsing("ABC.d", FQN);
  }
  
  //this test has nothing to do with the blog post topic
  //but it illustrates a simple way for unit testing your
  //language, querying the instantiated model
  public void testModel(){
    Model m;
    try {
      //missing names
      m=(Model) getModelAndExpect("entity{}entity{}", EXPECT_ERRORS);
      int entityCount=m.getEntities().size();
      assertEquals(2, entityCount);
    
      m=(Model)getModelAndExpect("entity name{}", 0);
      String name=m.getEntities().get(0).getName();
      assertEquals("name", name);

      //name must be ID not SPECIAL_ID
      m=(Model)getModelAndExpect("entity NAME{property p:PNAME}", EXPECT_ERRORS);
      int propertyCount=m.getEntities().get(0).getProperties().size();
      assertEquals(1, propertyCount);
      
    } catch (Exception e) {}
  }
}

Now, what do you think will be the token types created for 123ABC_123ab //comment? If you think “nothing simpler than that”, add the corresponding test and if you were wrong, improve the fail messages in the abstract class, so that they are more helpful for finding your mistake.

3 Responses to “Basic lexer and parser tests for your Xtext grammar”

  1. Loek Cleophas Says:

    It seems your example code has an error in failTokenisation; the line assertNotSame(input, unExpectedTokenType, getTokenType(token));
    does not take into account that two different string objects can have the same string content…

  2. Alexander Nittka Says:

    Back when I wrote the post, the method worked for me as posted. But it is well possible that another assert-method should be used.

  3. Barrie Treloar Says:

    A XText 2 updated version is available at

    http://baerrach.blogspot.com.au/2012/12/lexer-and-parsers-tests-for-xtext.html

Leave a Reply