Parser on stacks and lambda expressions (Axolotl)
In September, I published an article describing the theory behind the Shunting Yard-based parser. This article is a hands-on sequel that describes the implementation of the parser. In the near future, it is planned to publish a number of articles in which other implementations will be based on a similar theory. This article presents an implementation using lambda expressions, which, in my opinion, is the simplest but least efficient.
Code examples provided in Java language
Lexical analyzer
Within the scope of this article, the lexical analyzer will not be implemented. To understand the parser, a few interfaces and an enum are enough.
Token
At the parsing stage, we will only need the token type. If there is no sign in the language ;
you can also add access to the line number where the token is located.
Token type
The token should be broken down by type – various operators, literals, keywords, etc. Token types are also divided into groups.
Token Type Group
It is convenient to divide token types into the following groups:
delimiter
– separators. For example, parenthesis, comma, semicolon, etc.operator
– operators. For example, assignment operator, plus, minus, etc.keyword
— key and reserved words that partially define the syntax of the language. They cannot be used as names of variables, functions, etc.identify
— names of variables and functions.literal
– numbers, strings, symbols and, in principle, any constant values.
Interface for interacting with the tokenizer
public interface TokenStream {
@NonNull Token next();
Token get();
boolean hasNext();
}
Token next()
— returns the token at the current position and moves on to the next one.Token get()
— returns the token at the current position without moving the pointer.boolean hasNext()
– returnstrue
if there is a next token.
This completes the description of the lexical analyzer. Next, let's move on to syntactic analysis.
Parser architecture
The previous article looked at three stacks: the statement stack, the state stack, and the results stack. An implementation with lambda expressions will only use the state stack.
Basic elements of a parser
A state in this analyzer is a handler of some syntax structure or part of it.
The parser includes:
Control partwhich causes states to be executed and ensures that states do not repeat themselves cyclically.
Statescontaining the basic method for analyzing part of the structure.
State controllerwhich stores patterns and frequently used states.
Basic implementation rules
The control part only adds the initial state to the stack.
The same state should not be repeated at the top of the stack in a row.
The state must either remove itself from the stack or add new ones to the top.
Implementation of the control part
private final TokenStream stream;
private final Stack<State> states = new Stack<>();
public @NonNull FileNode analyze() {
FileNode file = new FileNode();
states.push(new FileState(states));
while(states.size() != 0) {
/*
* Проверяем, не повторилось ли состояние и
* не замкнулись ли они между собой.
*/
states.peek().analyze();
}
return file;
}
The checks described in the comments are necessary only if you cannot assert that all states are implemented correctly (that is, always… but for the purposes of this article we will omit this point).
In addition, this implementation provides error recovery. For example, it is allowed to put try-catch above execution analyze
state and deleting the state in case of an error (in an implementation with error recovery, rule No. 1 must be corrected). Before doing this, it is necessary to recover from errors within the state, if possible.
Implementing states and a state controller
Let's define State for further work
public interface State {
void analyze();
}
In the implementation of the state controller, we will write several static methods:
void custom(State state)
— adds a given state to the stack, which then removes itself.void body(Consumer<List<Node>> result)
— adds state for processing the function body (for example, declarations of variables, conditions) and passes the result throughresult
.void expression(Consumer<Expression> result)
— adds state for processing expressions and passes it throughresult
.
Methods body()
And expression()
for the corresponding states have many subtleties, and a separate article will be devoted to them later.
Example implementation of if-else
Let's look at the implementation for the simplest if-else
and assume that we already have states to parse expressions and code blocks. To satisfy rules 2 and 3, if-else
breaks down into several states. We add states in reverse order, taking into account the stack:
Processing
TokenType.IF
AndTokenType.LEFT_PARENT
.We add a state to the stack for
expression
directing the result to the conditional construct object.Adding a state for
body
directing the result to the conditional construct object.Checking availability
TokenType.ELSE
. Ifelse
found, addbody
directing the result to the conditional construct object.We pass the final object through a lambda expression.
Writing the actions in reverse order, we get the following implementation of the state:
public static void analyze(Consumer<IfStatement> result) {
IfStatement ifStatement = new IfStatement();
StateController.custom(() -> result.accept(ifStatement));
StateController.custom(() -> {
if (analyzer.boolEat(TokenType.ELSE))
StateController.body(ifStatement::setElseBody);
});
StateController.body(ifStatement::setBody);
StateController.custom(() -> analyzer.eat(TokenType.RIGHT_PARENT));
StateController.custom(() -> {
analyzer.eat(TokenType.IF);
analyzer.eat(TokenType.LEFT_PARENT);
StateController.expression(ifStatement::setExpression);
});
}
@NonNull Token eat(TokenType type)
— returns the current token and moves on to the next one if it finds it. If the type does not match, it throws an error.boolean boolEat(TokenType type)
– returnstrue
and moves on to the next token if the type matches. If the type does not match, it returnsfalse
.
Results
We have a simple and slightly inefficient implementation that is easily scalable and upgradeable for error recovery. I would like to call the implementations not just by their features, but to come up with a unique name for each. Due to the simplest possibility of adding error recovery, you can compare the implementation with the axolotl, since they have good regeneration.