ANTLR Difficulties: Writing a Ruby Grammar

image At Rostelecom-Solar, we are developing a static code analyzer for vulnerabilities and NDV, which also works on parse trees. To build them, we use optimized version ANTLR4 – a tool for developing compilers, interpreters and translators.

IN repositories you can find grammars of many programming languages. However, it lacks the Ruby grammar, which apparently no one has implemented. There is only a grammar of a similar homemade language, parsing only the simplest cases. This is not surprising, since the Ruby grammar is difficult to implement, since the language has a non-trivial syntax. But it would be very useful for those who, for example, want to write their own language and think about how to do it. Or for those who need to solve the technical difficulties discussed in our article. Well, we’ll have to write a new grammar, which we’ll do right here.

ANTLR proposes to split language analysis into lexer and parser.

Lexer is engaged in generating tokens based on specified sequences of characters from the alphabet of the language. If a sequence of characters matches the definition of several tokens, the longest is chosen, and among them – the first in priority (which is specified by the order of writing).

The parser forms sentences (complete commands) of the language using tokens (also called terminal characters) obtained from the lexer.

In general, the spelling of grammar is no different from other languages. You can learn with the help author’s book, documentation or according to numerous tutorials like of this

In this article, the basic implementation will be omitted and only the technical difficulties of writing will be considered.

Lexer problems

In general, the only possible problem in a lexer is the issuance of a valid token by a sequence of characters and the attendant obstacles.

Interpolation

Some strings in Ruby allow for interpolation – inserting arbitrary code inside using syntax #{code}… For example:

a = 3
"Hello #{if a%2==1 then "Habr!" else "World!" end}"

We will cope with the help of modes – specific “small lexers” designed for a specific task, like our parsing a string:

DoubleQuote: '"' {++nestedStringLevel;}    -> pushMode(InterpolationString);

At each nesting level, you need to maintain the number of open braces due to situations of the form (you need to exit interpolation at the 2nd closing curly brace):

"Kappa #{
    buf=""
    [1, 2, 3].each { |x| buf += x.to_s }
    buf
}"

To do this, let’s create a stack openedCurlyBracesInsideString… In total, we have a token inside the mod:

StartInterpolation: '#{' {openedCurlyBracesInsideString.push(0);}    -> pushMode(DEFAULT_MODE);

Now you need to exit the normal mode (DEFAULT_MODE) in time, for this we add the code to the tokens:

OpenCurlyBracket:             '{' {
    if (nestedStringLevel > 0 && !openedCurlyBracesInsideString.empty()) {
        final int buf = openedCurlyBracesInsideString.pop();
        openedCurlyBracesInsideString.push(buf + 1);
    }
};
CloseCurlyBracket:            '}' {
    if (nestedStringLevel > 0 && openedCurlyBracesInsideString.peek() <= 0) {
       popMode();
       openedCurlyBracesInsideString.pop();
    } else {
        if (!openedCurlyBracesInsideString.empty()) {
            final int buf = openedCurlyBracesInsideString.pop();
            openedCurlyBracesInsideString.push(buf - 1);
        }
    }
};

% -notations

Ruby has an additional Perl-inspired writing syntax strings, arrays of strings and characters (which in Ruby is not a character in the usual sense), regular expressions and shell commands. The syntax for these commands is% followed by an optional type identifier and a separator character. For example: %w|a b c| - an array of three lines. However, you can also use paired brackets as a separator: {} [] () <>... It won't work just to specify all possible identifiers - then, for example, the sequence

q = 3
5%q

will not be recognized correctly. The lexer will simply eat the longest character string by creating a token %q...

By creating a check for the absence of an obviously invalid separator such as a space character, a digit and a letter and adding it to the token as a predicate (a condition that must be met in a certain place to continue constructing the token),

StringArrayConstructorwToken: '%w' {canBeDelimiter()}?;

get protection working nearly always (more on that later). We also add an expectation of the corresponding separator depending on the alternative.

StartArrayConstructorNotInterpolated
    : StringArrayConstructorwToken ( Brackets {setPairDelimiter();} | ~[(<{[] {setCharDelimiter();} ) {startStringMode(ArrayConstructorw);}

fragment Brackets: '(' | '[' | '{' | '<';

Where startStringMode - utility function for mode switching and nesting support.

public void startStringMode(final int mode) {
    pushMode(mode);
    ++nestedStringLevel;
}

Counterexample: 5%q|1 - parsed correctly only in the context of a parser, when it is known that after 5 assignments there can be no string.

One might think that it is enough to find a pairwise separator by looking ahead, but there is an example for such a case - 5%q|1|1... Whence it becomes clear that when splitting into a lexer and a parser, such a case cannot be parsed.

However, this happens very rarely, so ¯ _ (ツ) _ / ¯ will do. Inside the mode, we just wait for the separator.

ArraywWhitespace: WhitespaceAll                           -> skip;
ArraywText:       ({!isDelimiter()}? ArraywTextFragment)+ -> type(StringPart);
ArraywEnd:        . {nestedStringLevel--;}                -> type(HereDocEnd), popMode;

Where type changes the type of generated tokens for convenience.

Division or start of regex

The regular expression syntax is as follows /regexp/ (as well as the aforementioned percentage notation). The same problem of the parser context arises as in the previous paragraph, to mitigate it we create a check

public boolean canBeRegex() {
    return isPrevWS && " tru000Bfbn".indexOf((char) _input.LA(1)) == -1 || isPrevNL || isOp || prevNonWsType == StartInterpolation;
}

and add to the token

Divide:                       "https://habr.com/" {
    if (canBeRegex()) {
        startHereDoc(RegExp);
    }
};

To recalculate variables isOp, isPrevNL, isPrevWS also override emit-function of the final creation of the token

@Override
public void emit(final Token token) {
    final String txt = token.getText();
    final int type = token.getType();
    isPrevNL = type == NL;
    isPrevWS = type == WS;
    if (!isPrevWS && !isPrevNL && type != SingleLineComment && type != MultiLineComment) {
        isOp = OPERATORS.contains(type);
        prevNonWsChar = txt.charAt(txt.length() - 1);
        prevNonWsType = type;
    }
    super.emit(token);
}

Where OPERATORS - hashmap of all operators.

Parser problems

Whitespace characters

Quite an unpleasant surprise was the effect of spaces on parsing. They usually do not affect the grammar in any way and are simply removed from the stream using -> skip or transfer to another channel. However, a number of languages ​​still distinguish between some constructs with their help.

For example, a+3 and a + 3 cannot be a function call without parentheses, but а +3 can. Therefore, all parser rules look like this (NL - newline, WS - whitespace):

    | expression WS* op=('+' | '-') (NL | WS)* expression

To mitigate the problem, write listener , which will clean our parse tree from such debris.

public class RemovingTokensListener implements ParseTreeListener {
        private List unwantedTokens;

        ...

        @Override
        public void visitTerminal(final TerminalNode node) {
            if (this.unwantedTokens.contains(node.getSymbol().getType())) {
                ((ParserRuleContext) node.getParent().getRuleContext()).removeLastChild();
            }
        }
}

Heredoc - Lexer and parser

Special syntax for specifying multi-line strings. Maybe so

<

or even that (interestingly, markdown doesn't recognize the syntax correctly).

value = 123
print <

The problem is that you need to understand where the line ends, and it would also be convenient to know which content belongs to which line. To do this, first create a list of heredocs pending parsing (the parser, depending on the context and mode, can load an arbitrary number of tokens forward)

public final HeredocsHolder HEREDOCS = new HeredocsHolder();

public static final class HereDocEntry {
    public final String name;
    public final HereDocType type;
    public final boolean isInterpolated;
    public int partsNumber;

    public HereDocEntry(final String name, final HereDocType type, final boolean isInterpolated) {
        this.name = name;
        this.type = type;
        this.isInterpolated = isInterpolated;
        this.partsNumber = 0;
    }
}

public static final class HeredocsHolder {
    public final List entries;
    public int toProcess;

    HeredocsHolder() {
        this.entries = new ArrayList<>();
        this.toProcess = 0;
    }
}

and we will add them as they become available

StartHereDoc
    : HereDocToken HereDocName {
        heredocIdentifier = getHeredocIdentifier(''');
        setText(getText().trim());
        keepHereDoc(HereDoc, false);
    }
    ;

public void keepHereDoc(final int mode, final boolean isInterpolated) {
    HEREDOCS.entries.add(new HereDocEntry(heredocIdentifier, getHereDocType(), isInterpolated));
    ++HEREDOCS.toProcess;
    isFirstNL = true;
}

Further, having seen the end of the line with pending heredocs, we call the processing function.

NL:                           'n' {
    final int next = _input.LA(1);
    if (HEREDOCS.toProcess > 0 && isFirstNL) {
        startHereDocRoutine();
        isFirstNL = false;
        insideHeredoc = true;
    }
};

The processing function itself is shown below: here we process only the last heredocs (the lexer could have gone ahead, but the parser in this case has not yet absorbed the tokens, so we save the information)

public void startHereDocRoutine() {
    final int stopIdx = HEREDOCS.entries.size() - HEREDOCS.toProcess;
    for (int i = HEREDOCS.entries.size() - 1; i >= stopIdx; --i) {
        if (HEREDOCS.entries.get(i).isInterpolated) {
            pushMode(HereDocInterpolated);
        } else {
            pushMode(HereDoc);
        }
    }
    nestedStringLevel += HEREDOCS.toProcess;
    currentHeredocIt = HEREDOCS.entries.listIterator(HEREDOCS.entries.size() - HEREDOCS.toProcess);
    currentHeredoc = currentHeredocIt.next();
}

Need to overwrite nextToken to exit the mode and count the number of tokens for each heredoc

@Override
public Token nextToken()
{
    final CommonToken token = (CommonToken)super.nextToken();
    final int ttype = token.getType();
    if (insideHeredoc && ttype == StartInterpolation) {
        ++heredocTokensCount;
    }
    if (_mode == HereDoc || _mode == HereDocInterpolated) {
        if (ttype == VarName) {
            ++heredocTokensCount;
        } else if (ttype == StringPart) {
            ++heredocTokensCount;
            final String txt = token.getText();
            if (CheckHeredocEnd(txt)) {
                token.setText(txt.trim());
                token.setType(HereDocEnd);
                nestedStringLevel--;
                popMode();
                currentHeredoc.partsNumber = heredocTokensCount;
                if (currentHeredocIt.hasNext()) {
                    currentHeredoc = currentHeredocIt.next();
                }
                heredocTokensCount = 0;
                --HEREDOCS.toProcess;
                if (_mode == DEFAULT_MODE) {
                    insideHeredoc = false;
                }
            }
        }
    }
    return token;
}

Now let's start with the parser.

Through @parser::members add to the parser: a link to the lexer, string nodes where we will transfer their content, the number of interpolation nodes (by analogy with heredocTokensCount lexer), as well as a stack of statements indicating the need for processing.

    private final RubyLexer lexer = (RubyLexer)_input.getTokenSource();
    private final List CONTEXTS = new ArrayList<>();
    private final List heredocRulesCount = new ArrayList<>();
    private final Stack statements = new Stack<>();

    private static final class StatementEntry {
        public boolean needProcess;
        public int currentHeredoc;

        StatementEntry() {
            this.needProcess = false;
            this.currentHeredoc = 0;
        }
    }

Let's modify the code unit directly:

statement
@init {
    statements.push(new StatementEntry());
}
@after {
    if (statements.peek().needProcess) {
        processHeredocs($ctx);
    }
    statements.pop();
}
    : statementWithoutHeredocTail ({statements.peek().needProcess}? interpolatedStringPart* HereDocEnd {++statements.peek().currentHeredoc;})*
    ;

@init - the code that is executed when the parser enters the rule, @after - when leaving.

The stack is necessary for assigning heredocs to the required statement, since there may be new ones inside the interpolation.

In order to correctly recognize cases where heredoc can be an ordinary expression and where a string can be considered tokens in a row, as well as complex cases when the end of a line will lie behind the current expression, we write the following parser code:

string:
...
    | StartInterpolatedHereDoc (memberAccess* WS* NL interpolatedStringPart* HereDocEnd)? {
        if ($ctx.HereDocEnd() == null) {
            CONTEXTS.add($ctx);
            heredocRulesCount.add(0);
            statements.peek().needProcess = true;
        } else {
             lexer.HEREDOCS.entries.remove(0);
        }
    }
...

For the very calculation of interpolation nodes, we modify the rule code with the content of the line (+ 2 here you need to count tokens that open and close interpolation)

interpolatedStringPart
locals[int nodesCount = 0]
    : StringPart
    | VarName
    | StartInterpolation (WS* statement {++$nodesCount;})* (WS* rawStatement {++$nodesCount;})? WS* '}' {
        final int cur = statements.peek().currentHeredoc;
        if (cur < heredocRulesCount.size()) {
            heredocRulesCount.set(cur, heredocRulesCount.get(cur) + $nodesCount + 2);
        }
    }
    ;

Where locals - a list of local variables (you need to refer to them through $), and whitespace tokens are not counted, since are removed during tree construction by our listener.

Now let's write the function itself processHeredocs... Let's count how many nodes to pick up

int heredocNodesCount = 0;
for (int i = 0; i < CONTEXTS.size(); ++i) {
    heredocNodesCount += lexer.HEREDOCS.entries.get(i).partsNumber;
    heredocNodesCount += heredocRulesCount.get(i);
}

Starting with which child, we will begin to throw the content of the lines in their places

int currentChild = ctx.getChildCount() - heredocNodesCount;

Hooking content to the corresponding node

for (int i = 0; i < CONTEXTS.size(); ++i) {
    final RubyLexer.HereDocEntry entry = lexer.HEREDOCS.entries.remove(0);
    final ParserRuleContext currentContext = CONTEXTS.get(i);
    final int currentNodesCount = entry.partsNumber + heredocRulesCount.get(i);
    for (int j = 0; j < currentNodesCount; ++j, ++currentChild) {
        final ParseTree child = ctx.getChild(currentChild);
        if (child instanceof TerminalNode) {
            ((TerminalNodeImpl) child).setParent(currentContext);
            currentContext.addChild((TerminalNodeImpl) child);
        } else if (child instanceof ParserRuleContext) {
            ((ParserRuleContext) child).setParent(currentContext);
            currentContext.addChild((ParserRuleContext) child);
        } else {
            // parser failed
            clear();
            return;
        }
    }
}

We clear the node itself, heredoc contexts and the list of the number of interpolation nodes

for (int i = 0; i < heredocNodesCount; ++i) {
    ctx.removeLastChild();
}
clear();

private void clear() {
    CONTEXTS.clear();
    heredocRulesCount.clear();
}

The final touch is to remove an unnecessary intermediate rule for processing heredocs - statementWithoutHeredocTailby re-hanging the children of the node to its ancestor using the same listener

public class RemovingRulesListener implements ParseTreeListener {
    private List unwantedRules;

    ...

    @Override
    public void exitEveryRule(final ParserRuleContext ctx) {
        if (this.unwantedRules.contains(ctx.getRuleIndex())) {
            final ParserRuleContext parentCtx =
                    (ParserRuleContext) ctx.getParent().getRuleContext();
            parentCtx.children.remove(ctx);
            ctx.children.forEach(
                    child -> {
                        if (child instanceof RuleContext) {
                            ((RuleContext) child).setParent(parentCtx);
                            parentCtx.addChild((RuleContext) child);
                        } else if (child instanceof TerminalNode) {
                            ((TerminalNodeImpl) child).setParent(parentCtx);
                            parentCtx.addChild((TerminalNodeImpl) child);
                        }
                    }
            );
        }
    }
}

Ambiguity

An unsolved problem was the distinction between function calls and, for example, ordinary addition (as well as the symbol of any simultaneously unary and binary operation), which can only be solved using symbol tables at runtime.

The bottom line is that upon entering a +a +a +a... at each step, there can be both ordinary addition and a function call without arguments (although in this case Ruby requires no space after the sign of the first argument), which apparently leads to an exponential growth of walking along the prediction graph.

However, simply disallowing ANTLR space after a unary operator in the context will not work - you will have to manually rewrite the left recursive expression, which is automatically resolved for the case without arguments. Relying on the fact that “nobody” writes more than thirty terms without a reason, this problem disappears.

Conclusion

Experimentally, this grammar can parse 99% of files.

So, aws-sdk-rubycontaining 3024 ruby ​​files falls only by seven, fastlanecontaining 1028, by 2-x, and Ruby on Rails from 2081, at 19.

However, there are still fundamentally sore points like heredocs included in expression

option(:sts_regional_endpoints,
  default: 'legacy',
  doc_type: String,
  docstring: <<-DOCS) do |cfg|
Passing in 'regional' to enable regional endpoint for STS for all supported
regions (except 'aws-global'), defaults to 'legacy' mode, using global endpoint
for legacy regions.
  DOCS
  resolve_sts_regional_endpoints(cfg)
end

or used as expressions, any block types

def test_group_by_with_order_by_virtual_count_attribute
    expected = { "SpecialPost" => 1, "StiPost" => 2 }
    actual = Post.group(:type).order(:count).limit(2).maximum(:comments_count)
    assert_equal expected, actual
end if current_adapter?(:PostgreSQLAdapter)

I hope the techniques described in this article will help you cope with the difficulties of your grammar. If you think that any of the problems can be solved better - welcome to the comments.

Author: Fedor Usov, developer of Solar appScreener

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *