Parser on stacks and lambda expressions (Axolotl)

In September, I published an article describing the theory behind the Shunting Yard-based parser. This article is a hands-on sequel that describes the implementation of the parser. In the near future, it is planned to publish a number of articles in which other implementations will be based on a similar theory. This article presents an implementation using lambda expressions, which, in my opinion, is the simplest but least efficient.

Code examples provided in Java language

Lexical analyzer

Within the scope of this article, the lexical analyzer will not be implemented. To understand the parser, a few interfaces and an enum are enough.

Token

At the parsing stage, we will only need the token type. If there is no sign in the language ;you can also add access to the line number where the token is located.

Token type

The token should be broken down by type – various operators, literals, keywords, etc. Token types are also divided into groups.

Token Type Group

It is convenient to divide token types into the following groups:

  • delimiter – separators. For example, parenthesis, comma, semicolon, etc.

  • operator – operators. For example, assignment operator, plus, minus, etc.

  • keyword — key and reserved words that partially define the syntax of the language. They cannot be used as names of variables, functions, etc.

  • identify — names of variables and functions.

  • literal – numbers, strings, symbols and, in principle, any constant values.

Interface for interacting with the tokenizer

public interface TokenStream {

    @NonNull Token next();

    Token get();

    boolean hasNext();
}
  • Token next() — returns the token at the current position and moves on to the next one.

  • Token get() — returns the token at the current position without moving the pointer.

  • boolean hasNext() – returns trueif there is a next token.

This completes the description of the lexical analyzer. Next, let's move on to syntactic analysis.

Parser architecture

The previous article looked at three stacks: the statement stack, the state stack, and the results stack. An implementation with lambda expressions will only use the state stack.

Basic elements of a parser

A state in this analyzer is a handler of some syntax structure or part of it.

The parser includes:

  1. Control partwhich causes states to be executed and ensures that states do not repeat themselves cyclically.

  2. Statescontaining the basic method for analyzing part of the structure.

  3. State controllerwhich stores patterns and frequently used states.

Basic implementation rules

  1. The control part only adds the initial state to the stack.

  2. The same state should not be repeated at the top of the stack in a row.

  3. The state must either remove itself from the stack or add new ones to the top.

Implementation of the control part

private final TokenStream stream;

private final Stack<State> states = new Stack<>();

public @NonNull FileNode analyze() {
    FileNode file = new FileNode();
    states.push(new FileState(states));

    while(states.size() != 0) {
        /*
         * Проверяем, не повторилось ли состояние и
         * не замкнулись ли они между собой.
        */
        states.peek().analyze();
    }

    return file;
}

The checks described in the comments are necessary only if you cannot assert that all states are implemented correctly (that is, always… but for the purposes of this article we will omit this point).

In addition, this implementation provides error recovery. For example, it is allowed to put try-catch above execution analyze state and deleting the state in case of an error (in an implementation with error recovery, rule No. 1 must be corrected). Before doing this, it is necessary to recover from errors within the state, if possible.

Implementing states and a state controller

Let's define State for further work

public interface State {

    void analyze();
}

In the implementation of the state controller, we will write several static methods:

  • void custom(State state) — adds a given state to the stack, which then removes itself.

  • void body(Consumer<List<Node>> result) — adds state for processing the function body (for example, declarations of variables, conditions) and passes the result through result.

  • void expression(Consumer<Expression> result) — adds state for processing expressions and passes it through result.

Methods body() And expression() for the corresponding states have many subtleties, and a separate article will be devoted to them later.

Example implementation of if-else

Let's look at the implementation for the simplest if-else and assume that we already have states to parse expressions and code blocks. To satisfy rules 2 and 3, if-else breaks down into several states. We add states in reverse order, taking into account the stack:

  1. Processing TokenType.IF And TokenType.LEFT_PARENT.

  2. We add a state to the stack for expressiondirecting the result to the conditional construct object.

  3. Adding a state for bodydirecting the result to the conditional construct object.

  4. Checking availability TokenType.ELSE. If else found, addbodydirecting the result to the conditional construct object.

  5. We pass the final object through a lambda expression.

Writing the actions in reverse order, we get the following implementation of the state:

  public static void analyze(Consumer<IfStatement> result) {
    IfStatement ifStatement = new IfStatement();

    StateController.custom(() -> result.accept(ifStatement));
    StateController.custom(() -> {
        if (analyzer.boolEat(TokenType.ELSE))
            StateController.body(ifStatement::setElseBody);
    });
    StateController.body(ifStatement::setBody);
    StateController.custom(() -> analyzer.eat(TokenType.RIGHT_PARENT));
    StateController.custom(() -> {
        analyzer.eat(TokenType.IF);
        analyzer.eat(TokenType.LEFT_PARENT);
        StateController.expression(ifStatement::setExpression);
    });
  }
  • @NonNull Token eat(TokenType type) — returns the current token and moves on to the next one if it finds it. If the type does not match, it throws an error.

  • boolean boolEat(TokenType type) – returns true and moves on to the next token if the type matches. If the type does not match, it returns false.

Results

We have a simple and slightly inefficient implementation that is easily scalable and upgradeable for error recovery. I would like to call the implementations not just by their features, but to come up with a unique name for each. Due to the simplest possibility of adding error recovery, you can compare the implementation with the axolotl, since they have good regeneration.

Axolotl

Axolotl

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *