# Parser combinators in Python

## DISCLAIMER

This article uses Python generators, so the reader is expected to be familiar with them a little more than just using `yield`. It should also be noted that some examples are simplified for the sake of compactness of the narrative.

## TLDR

The article offers a look at the experience of developing a combinator parser for Python, which resulted in a library PGPC for developing parsers in Python. The library was inspired Parsec.
Emulation is of particular interest. `do`-notations via Python generators, hence the name of the library: Python Generator-based Parser Combinator library.

An example of using the library to parse a phrase `"Hello, World!"`:

``````@topology
def parse_word(word: str):
parsed = []
for c in word:
letter = yield char(c)
parsed.append(letter)

return parsed

text = "Hello, World!"
parser = parse_word(text)
print(parser(Scanner(text, Position())))
``````

## Motivation

The task of parsing text appeared at the dawn of programming languages, and during this time a rigorous mathematical apparatus was created for it. The classic approach to analyzing the syntax and semantics of an input sequence involves developing:

• lexer – a program that breaks a stream into tokens or lexemes;

• parser – a program that tries to form some kind of language construction based on the input sequence of tokens.

This approach is described in detail in the book with a dragon “Compilers. Principles, approaches and tools.” by A.V. Aho, M.S. Lam, R. Seti, D.D. Ulman.

On the other hand, the parser can be viewed as a function that:

• the input string associates a certain object,

• returns the rest of the string that was not parsed:

For Haskell, the parser type looks like this: `String -> (T, String)`.

Based on the idea that a parser is a function, and functions are usually combined into compositions, parser combinators appeared. The point is to build more complex ones from small parsers using the composition of parsers.

## Parser combinators

The combinator parser is a powerful and flexible approach to text parsing that has its advantages over classical approaches:

• expressive and readable code: parser combinators allow you to compose parsers through a combination of simple parsers, resulting in code that is close to BNF form (Backus-Naur form);

• compositionality: parser combinators strive for composition by nature, in the sense that one can easily assemble a larger, more complex parser from smaller parsers. Composition allows you to develop parsers incrementally, repeatedly reusing existing parsers to parse different structures of the target language;

• streaming parsing: parser combinators can work in conditions where the input sequence is formed in real time;

• independence from the programming language: the combinator parser can be implemented in any language that supports higher-order functions;

• rapid prototyping: parser combinators are great for prototyping, allowing you to quickly start working on a parser without the need for complicated workspace setup;

• testing: each parser consists of simpler parsers, which means that each parser can be easily tested.

Parser combinators offer a flexible and intuitive approach that makes parser development easier and improves code readability. They are most convenient for working with DSL languages ​​and various data formats.

## Development

The development of the generator parser will be carried out in Python with the active use of Type Hints.

### Basic abstractions

The general idea is that a parser is a function from an input sequence into an object of an arbitrary type. So, after reading the input string, you can form some object.
The main abstractions are:

• `Position` – current position of the parser in the input sequence;

• `Scanner` – a class parameterized by the type of the element of the input sequence, for abstraction by the input sequence;

• `Parser` – a class parameterized by the type of the object being parsed, representing the parser itself. To use objects `Parser` in the context of functions, in the class overridden `__call__` method.

#### position

Class `Position` allows you to abstract the position in the input sequence where the previous parser stopped. It is expected that the position can take many forms, for example, when parsing a file, it is necessary to remember the line and column in addition to the absolute offset relative to the beginning of the file.

``````class Position:
def __init__(self, offset: int = 0):
self.__offset = offset

def increase(self, offset: int):
self.__offset += offset
``````

#### Scanner

Class `Scanner`, parameterized by the element type of the input sequence, allows you to abstract the input sequence. The standard class interface includes methods:

• `mark` – remember the current position to return to it later;

• `reset` – return to the last memorized position;

• `advance_if` – move the current position forward if the current element of the input sequence passes the test according to the predicate passed to the method. As a result, either the current element itself, which passed the test, is returned, or `None`

``````class Scanner(Generic[ELEMENT]):
def __init__(self, content: Sequence[ELEMENT], pos: Position):
self.__pos: Position = pos
self.__content: Sequence[ELEMENT] = content
self.__marks: List[Position] = []

def mark(self) -> None:
self.__marks.append(self.__pos)

def reset(self) -> None:
if self.__marks:
self.__pos = self.__marks.pop()

@property
def current(self) -> ELEMENT:
if self.__pos.offset >= len(self.content):
raise EOFError(f"No value at {self.__pos.offset}")
result = self.content[self.pos.offset]
return result

def advance_if(self, predicate: Callable[[ELEMENT], bool]) -> ELEMENT | None:
element = self.current
if predicate(element):
self.__pos.increase(1)
return element
else:
return None
``````

#### parser

Class `Parser`, parameterized by the type of the object after parsing, allows you to abstract away the parsing logic. The parser must be used in the context of functions, and therefore it is redefined in it `__call__` method that takes `Scanner` and returns an object of the specified type.

Examples of parser types:

• `Parser[datetime.date]` – a parser that can extract the date from the input sequence;

• `Parser[int]` – a parser that can extract a number from the input sequence.

When creating an object `Parser[V]` a function is passed to the constructor `Callable[[Scanner], V]`The that will be used to extract data from the input string.

The standard parser interface includes a method `and_then`, which allows you to glue two parsers into one. Per input method `and_then` takes a function `consumer` type `Callable[[V], Parser[U]]` and implements the following logic:

• an object `Scanner` is passed as input to the current parser (`self.__parser`), which returns an object of type `V`;

• resulting object of type `V` current parser gets to the input of the function `consumer`;

• function `consumer` returns a new parser which is the result of the method `and_then`.

``````def and_then(self, consumer: Callable[[V], Parser[U]]) -> Parser[U]:
def __with_scanner(scanner: Scanner) -> U:
v = self.__parser(scanner)
return consumer(v)(scanner)

return Parser(__with_scanner)
``````

Method `and_then` is called a monadic binder in functional programming languages, and in `Haskell` And `Idris` is based on it `do`– a notation that allows you to write functional code with callbacks in the form of “flat” code, reminiscent of imperative code.

#### Error processing

If the input sequence cannot be parsed by the parser, then a `Parser.ParserError` an exception that can be handled by standard Python tools.

### Primitive Parsers

#### Parser satisfy

Parser `satisfy` is the main primitive on the basis of which the rest of the parsers are built. It takes the following predicate as input: `Callable[[str], bool]`and the output is a parser `Parser[str]`.

Parsing of the input string succeeds for every element of the input string for which the predicate returns `True`. The element itself is the result of parsing, and `Scanner` moves to the next element in the input sequence.

``````def satisfy(predicate: Callable[[str], bool]) -> Parser[str]:
def __with_scanner(scanner: Scanner) -> str:
if not result:
raise Parser.ParserError(f"Unexpected '{scanner.current}' at {scanner.pos}")
return result

return Parser(__with_scanner)
``````

#### char parser

Parser `char` accepts a character as input and checks whether the current element of the input sequence is equal to the received character.

``````def char(c: str) -> Parser[str]:
return satisfy(lambda x: c == x)
``````

#### ws parser

Parser `ws` checks if the current character is whitespace.

``````def ws() -> Parser[str]:
return satisfy(lambda x: x.isspace())
``````

#### opt parser

Parser `opt` is a parser that takes a parser as input and returns a parser. The peculiarity is that if the passed parser failed to parse the input sequence, then this is not an error. This parser is used to parse optional data that may not be present in the input sequence.

``````def opt(parser: Parser[V]) -> Parser[V | None]:
def __with_scanner(scanner: Scanner) -> V | None:
try:
scanner.mark()
return parser(scanner)
except ValueError:
scanner.reset()
return None

return Parser(__with_scanner)
``````

In case of an error, it is necessary to cancel everything that the passed parser managed to do, for this purpose, before starting the parser, a flag is set to save the state, and in case of an error, the scanner returns to the saved state.

#### parser many

Parser `many` is a parser that takes a parser as input and returns a parser. The peculiarity is that the passed parser is applied to the input sequence until an error is encountered. The result of the parser is a list of successfully parsed objects.

``````def many(parser: Parser[V]) -> Parser[List[V]]:
def __with_scanner(scanner: Scanner) -> List[V]:
result = []
try:
while True:
element = parser(scanner)
result.append(element)
except ValueError:
return result

return Parser(__with_scanner)
``````

#### alt parser

Parser `alt` is a parser that tries to apply two parsers in turn to the input sequence and returns the result of the first successful parser. The parser is implemented as a method `alt` class `Parser`:

``````class Parser(Generic[V]):

...

def alt(self, parser: 'Parser[U]') -> 'Parser[V | U]':
def __with_scanner(scanner: Scanner) -> V:
try:
scanner.mark()
return self.__parser(scanner)
except ValueError:
scanner.reset()
return parser(scanner)

return Parser(__with_scanner)
``````

### An example of using parsers

Develop a phone number parser. Phone number format requirements:

1. starts at `79` or `89`,

2. may optionally be present at the beginning `+`,

3. code (`912`, `964`etc) can be inside brackets,

4. there can be hyphens between numeric groups,

5. there are no gaps.

#### Solution

The solution will be implemented taking into account the following ideas:

• method `and_then` allows you to “glue” parsers into one big parser: apply the first parser, followed by the second, third, etc.;

• method `and_then` takes a lambda function that takes the result of the previous parser as the first argument;

• the result of the lambda function will also be a parser, but this new parser can also call the method `and_then`;

• keep calling `and_then` can be unlimited many times;

• thanks to the closure the most recent call `and_then` will have access to all the results of previous parsers.

A nice bonus is that error handling happens inside the combinator parser itself, which leads to more or less clean business logic code.

``````def phone_number(phone_number: str) -> str:
phone_parser: Parser[List[str]] = opt(char('+')) \
.and_then(lambda _plus: char('7').alt('8')
.and_then(lambda _seven_or_eight: opt(char('('))
.and_then(lambda _: char('9')
.and_then(lambda _nine: digit()
.and_then(lambda code1: digit()
.and_then(lambda code2: opt(char(')'))
.and_then(lambda _: digit()
.and_then(lambda d1: digit()
.and_then(lambda d2: digit()
.and_then(lambda d3: opt(char('-'))
.and_then(lambda _: digit()
.and_then(lambda d4: digit()
.and_then(lambda d5: opt(char('-'))
.and_then(lambda _: digit()
.and_then(lambda d6: digit()
.map(lambda d7: ['+', '7', '(', '9', code1, code2, ')', d1, d2, d3, '-', d4, d5, '-', d6, d7])
)))))))))))))))

result = phone_parser(Scanner(phone_number, Position()))
return "".join(result)

print(phone_number("+7(964)111-11-11"))
print(phone_number("8964222-2222"))
print(phone_number("7(964)333-3333"))
``````

### Generators

#### Idea

The code from the example above, although it solves the problem, has a lot of syntactic noise associated with the peculiarities of the call. `and_then`. In functional programming languages, the method `and_then` is called a monadic binder, and, for example, Haskell or Idris implements `do`– a notation that allows you to turn code with a large number of callbacks into flat code, similar in form to imperative.

Python does not have the concept of a monadic binder, nor does it `do`-notations, but there are generators. Generators offer two-way communication between client and generator through `send` method:

• call result `send` is the value passed to `yield` inside the generator

• input value `send` is passed inside the generator and is the result `yield`.

Thus, it is possible to implement work with parsers through `yield`:

• passed value – parser,

• the return value is the result of parsing the parser.

#### Implementation

For the convenience of the client, the functions that will build parsers based on generators will be wrapped with a decorator `@topology`.

The decorator implements a wrapper around the parser, in which:

• the original parser function is called to get the generator object,

• a function is called to run the generator `next`which will return the first passed in `yield` parser,

• the current object is passed to the parser `Scanner`and the output is the result of parsing the parser,

• in an infinite loop, the result of parsing the previous parser is passed to the generator,

• exception `StopIteration` appears when the original function calls `return`,

• the result of the original function lies in the field `value` exceptions `StopIteration`,

• field value `value` exceptions `StopIteration` will be the result of the final parser.

``````def topology(parser_builder) -> Callable[..., Parser[V]]:
def wrapper(*args: Any, **kwargs: Any) -> Parser[V]:
gen = parser_builder(*args, **kwargs)

def __with_scanner(scanner: Scanner) -> V:
parser = next(gen)
try:
while True:
result = parser(scanner)
parser = gen.send(result)
except StopIteration as e:
return e.value

return Parser(__with_scanner)

return wrapper
``````

The disadvantage of the decorator is the fact that the return type will change from `Generator[Any, Parser[Any], V]` on `Parser[V]`which can lead to false positives `mypy`.

#### Example

Implement the task with phone numbers from the previous section using `@topology` decorator as follows:

``````@topology
def phone_number_gen():
yield opt(char('+'))

yield char('7').alt(char('8'))

yield opt(char('('))
yield char('9')
code1 = yield digit()
code2 = yield digit()
yield opt(char(')'))

d1 = yield digit()
d2 = yield digit()
d3 = yield digit()

yield opt(char('-'))

d4 = yield digit()
d5 = yield digit()

yield opt(char('-'))

d6 = yield digit()
d7 = yield digit()

return "".join(['+', '7', '(', '9', code1, code2, ')', d1, d2, d3, '-', d4, d5, '-', d6, d7])

def phone_number_gen_parser(phone_number: str) -> str:
parser: Parser[str] = phone_number_gen()
return parser(Scanner(phone_number, Position()))

print(phone_number_gen_parser("+7(964)111-11-11"))
print(phone_number_gen_parser("8964222-2222"))
print(phone_number_gen_parser("7(964)333-3333"))
``````

### Accessing the current position

During parsing, it may be necessary to refer to the current position. The current position is stored in an object `Scanner`and access to the object `Scanner` executed during parsing. This means that getting the current position can also be made an object of type `Parser[Position]`in other words getting the position is the parser:

``````def position() -> Parser[Position]:
def __with_scanner(scanner: Scanner) -> Position:
return copy(scanner.pos)

return Parser(__with_scanner)
``````

#### Example

Use parser `position()` you can do it just like any other parser: through `and_then` or `@topology`.

The example below remembers the current position at the beginning of the parser process and at the end, and both of these values ​​are eventually returned to the client.

``````@topology
def parse_text(text: str):
start = yield position()

for c in text:
yield char(c)

end = yield position()
return start.offset, end.offset

value = "Hello, World!"
parser_hello_world = parse_text(value)
print(parser_hello_world(Scanner(value, Position())))
``````

## Conclusion

The article demonstrated an approach to implementing a combinator parser based on the capabilities of the Python language, which is expressive enough for developing a combinator parser. When combined with generators, a high degree of abstraction can be achieved while maintaining code readability. The experience of developing a combinator parser led to the creation of a library PGPCPython Generator-based Parser Combinator library. The library is only at the very beginning of its development, and it lacks most of the traditional parser combinators, so the addition of new parsers from third-party developers is welcome. Look at the library for inspiration Parsec.

## Sources

1. https://stepik.org/lesson/42245/step/1 – the chapter on parsers in the Haskell course by D.N. Moskvina

2. https://github.com/haskell/parsec – combinator parser library `Parsec`