Tasks about PEG parsers

Once I wanted to make a parsing contest for Codeforces. Created two types of assignments:

  1. An informal description of the language in which the grammar is to be created is given (for example, “a language with correct bracket sequences”);

  2. Examples of strings in the language for which the grammar needs to be restored are given.

The problem of the first type of task: different people will understand an informal description in different ways, so it is not the ability to compose grammars that is tested, but the ability to understand task descriptions. A formal description will not work, because it suggests the correct grammar.

The problem of the second type of assignment is the limited number of examples. You can “not catch” what the structure of the language looks like. It is also possible to create a grammar that simply contains all the given examples. How to appeal is not clear.

I ended up doing game program CrateGramin which you can solve tasks of the second type, while checking the strings for belonging to a guessed language.

We introduce three definitions that will come in handy later:

L language – many lines.
Grammar of the L language – a function of type (input: string) -> bool, which returns true if the input string is included in the L language, false otherwise.
AST (Abstract syntax tree or abstract syntax tree) is a hierarchical structure into which the string is translated after parsing. The nodes of the tree correspond to the rules of the grammar, and the leaves correspond to the parts of the parsed string.

caterpillar logic

Caterpillar logic is a game where you have to guess the rule that the snakes match (I was inspired by this game). “Correct” snakes are placed in the valid column, and incorrect snakes are placed in the invalid column. Snakes consist of segments of 4 colors. The game is that we make up new snakes from segments, and the program says whether they fit the rule or not. Thus, the pattern can be understood. To pass the level, you need to distribute 15 snakes into columns (these 15 snakes are generated by the game when we click “I know the rule and ready for the test!”).

caterpillar logic

caterpillar logic

In essence, caterpillar logic is the same as black box. What I like about this approach is that the player can check combinations of segments against the rule.

However, there is a small chance of guessing the rule (\frac{1}{2^{15}}). I don’t like this, so in order to pass the level in CrateGram, you don’t need to specify the correctness of 15 sequences, but formally describe the rule through PEG. This rule will be tested on several tests.

What is Parsing Expression Grammar?

Parsing Expression Grammar (or PEG) is a grammar “format” that uses regular expression operators. List of operators:

"abc" - точное совпадение строки abc
[abc] - совпадение одного из символов в скобках (или промежутка символов, если записаны через `-`)
e* - от 0 до бесконечности совпадений e
e+ - от 1 до бесконечности совпадений e
&e - e должно находиться далее
!e - e не должно находиться далее

e1 e2 - выражения e1 и e2 должны идти друг за другом
e1 / e2 - если выражение e1 не парсится, тогда парсим e2
r = ... - создание нового правила с названием r

Simple example:

// в каждой грамматике должно быть стартовое правило, с которого начинается парсинг
// здесь оно названо `root`
root = "2" digit digit digit
digit = [0-9]

This is an example of a grammar for the language of all numbers from 2000 to 2999. To match the root rule, the string must begin with a two. Three repetitions of the rule follow. digitwhich describes a single digit.

Amendment

In general, for the grammar to be correct, you need to add !. at the end of the rule root. This addition means that there should be no characters after the fourth digit (. – means any character).

There is a more well-known way of specifying grammars: EBNF or Extended Backus-Naur form (there is also just BNF). Its difference from PEG is that one line can be processed in different ways (give several ASTs). There is a well-known dangling else problem. The bottom line is that for nested if-then-else, the subsequent else can refer to both the outer if and the inner one.

Case study for EBNF

Let the expression `if A then if B then C else D` be given. There are two possible AST options for it:

"root": [
  "if ",
  "A",
  " then ",
  [
	  "if ",
	  "B",
	  " then ",
	  "C",
	  " else ",
	  "D"
  ]
]

or

"root": [
  "if ",
  "A",
  " then ",
  [
	  "if ",
	  "B",
	  " then ",
	  "C",
  ],
  " else ",
  "D"
]

PEG does not have this, grammar and expression parsing is unambiguous. And the PEG parser is easier to write.

Parsing for PEG

Let

root = expr !.
expr = "if " ([A-Z] / expr) " then " ([A-Z] / expr) (" else " ([A-Z] / expr))?
// AST для `if A then if B then C else D`
{
  "root": {
    "expr": [
      "if ",
      "A",
      " then ",
      {
        "expr": [
          "if ",
          "B",
          " then ",
          "C",
          " else ",
          "D"
        ]
      },
      ""
    ]
  }
}

Tasks in the image and likeness of snakes

This is how the CrateGram interface looks like

This is how the CrateGram interface looks like

IN CrateGram at each level, you need to create a grammar that will only accept valid strings. As in Caterpillar logic, the player can check the validity of strings (add test button, below you can write a test).

This is done simply: there is a hidden grammar that validates the user’s input. To check the answer (the player’s grammar), there are several test strings that check the user’s grammar. Test strings are made like this: a string is randomly generated, checked by a hidden grammar, and then put into valid or invalid depending on the check.

Initially, I wanted to generate valid strings, but for this I need to get rid of the operators ! And &. Is there an algorithm for doing this? (section 4), but I did not master while tests are a fairly reliable way to check.

Other features of CrateGram

There is also a Playground where you can simply create and check grammars, as in PEG.js. Advantage over PEG.js in detailed AST tree. It will contain the names of the rules.

Lack of primitive error messages (and most likely performance). If you create a grammar root = rootthen PEG.js will write Line 1, column 8: Possible infinite loop when parsing (left recursion: root -> root). CrateGram will crash already when writing a test.

In the future, for the AST, it will be possible to add expanded/ignored rules with an underscore as in Lark or unary minus to remove from AST as in better-parse.

conclusions

When I used PEG parsers, it seemed to me that it was a complex program that I could not understand. When I wrote my PEG parser, this feeling disappeared, along with the blind spot in understanding parsing algorithms.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *