Parser combinators in Python

DISCLAIMER

This article uses Python generators, so the reader is expected to be familiar with them a little more than just using yield. It should also be noted that some examples are simplified for the sake of compactness of the narrative.

TLDR

The article offers a look at the experience of developing a combinator parser for Python, which resulted in a library PGPC for developing parsers in Python. The library was inspired Parsec.
Emulation is of particular interest. do-notations via Python generators, hence the name of the library: Python Generator-based Parser Combinator library.

An example of using the library to parse a phrase "Hello, World!":

@topology
def parse_word(word: str):
  parsed = []
  for c in word:
    letter = yield char(c)
    parsed.append(letter)

  return parsed

text = "Hello, World!"
parser = parse_word(text)
print(parser(Scanner(text, Position())))

Motivation

The task of parsing text appeared at the dawn of programming languages, and during this time a rigorous mathematical apparatus was created for it. The classic approach to analyzing the syntax and semantics of an input sequence involves developing:

  • lexer – a program that breaks a stream into tokens or lexemes;

  • parser – a program that tries to form some kind of language construction based on the input sequence of tokens.

This approach is described in detail in the book with a dragon “Compilers. Principles, approaches and tools.” by A.V. Aho, M.S. Lam, R. Seti, D.D. Ulman.

On the other hand, the parser can be viewed as a function that:

  • the input string associates a certain object,

  • returns the rest of the string that was not parsed:

For Haskell, the parser type looks like this: String -> (T, String).

Based on the idea that a parser is a function, and functions are usually combined into compositions, parser combinators appeared. The point is to build more complex ones from small parsers using the composition of parsers.

Parser combinators

The combinator parser is a powerful and flexible approach to text parsing that has its advantages over classical approaches:

  • expressive and readable code: parser combinators allow you to compose parsers through a combination of simple parsers, resulting in code that is close to BNF form (Backus-Naur form);

  • compositionality: parser combinators strive for composition by nature, in the sense that one can easily assemble a larger, more complex parser from smaller parsers. Composition allows you to develop parsers incrementally, repeatedly reusing existing parsers to parse different structures of the target language;

  • streaming parsing: parser combinators can work in conditions where the input sequence is formed in real time;

  • independence from the programming language: the combinator parser can be implemented in any language that supports higher-order functions;

  • rapid prototyping: parser combinators are great for prototyping, allowing you to quickly start working on a parser without the need for complicated workspace setup;

  • testing: each parser consists of simpler parsers, which means that each parser can be easily tested.

Parser combinators offer a flexible and intuitive approach that makes parser development easier and improves code readability. They are most convenient for working with DSL languages ​​and various data formats.

Development

The development of the generator parser will be carried out in Python with the active use of Type Hints.

Basic abstractions

The general idea is that a parser is a function from an input sequence into an object of an arbitrary type. So, after reading the input string, you can form some object.
The main abstractions are:

  • Position – current position of the parser in the input sequence;

  • Scanner – a class parameterized by the type of the element of the input sequence, for abstraction by the input sequence;

  • Parser – a class parameterized by the type of the object being parsed, representing the parser itself. To use objects Parser in the context of functions, in the class overridden __call__ method.

position

Class Position allows you to abstract the position in the input sequence where the previous parser stopped. It is expected that the position can take many forms, for example, when parsing a file, it is necessary to remember the line and column in addition to the absolute offset relative to the beginning of the file.

class Position:
    def __init__(self, offset: int = 0):
        self.__offset = offset

    def increase(self, offset: int):
        self.__offset += offset

Scanner

Class Scanner, parameterized by the element type of the input sequence, allows you to abstract the input sequence. The standard class interface includes methods:

  • mark – remember the current position to return to it later;

  • reset – return to the last memorized position;

  • advance_if – move the current position forward if the current element of the input sequence passes the test according to the predicate passed to the method. As a result, either the current element itself, which passed the test, is returned, or None

class Scanner(Generic[ELEMENT]):
    def __init__(self, content: Sequence[ELEMENT], pos: Position):
        self.__pos: Position = pos
        self.__content: Sequence[ELEMENT] = content
        self.__marks: List[Position] = []

    def mark(self) -> None:
        self.__marks.append(self.__pos)

    def reset(self) -> None:
        if self.__marks:
            self.__pos = self.__marks.pop()

    @property
    def current(self) -> ELEMENT:
        if self.__pos.offset >= len(self.content):
            raise EOFError(f"No value at {self.__pos.offset}")
        result = self.content[self.pos.offset]
        return result

    def advance_if(self, predicate: Callable[[ELEMENT], bool]) -> ELEMENT | None:
        element = self.current
        if predicate(element):
            self.__pos.increase(1)
            return element
        else:
            return None

parser

Class Parser, parameterized by the type of the object after parsing, allows you to abstract away the parsing logic. The parser must be used in the context of functions, and therefore it is redefined in it __call__ method that takes Scanner and returns an object of the specified type.

Examples of parser types:

  • Parser[datetime.date] – a parser that can extract the date from the input sequence;

  • Parser[int] – a parser that can extract a number from the input sequence.

When creating an object Parser[V] a function is passed to the constructor Callable[[Scanner], V]The that will be used to extract data from the input string.

The standard parser interface includes a method and_then, which allows you to glue two parsers into one. Per input method and_then takes a function consumer type Callable[[V], Parser[U]] and implements the following logic:

  • an object Scanner is passed as input to the current parser (self.__parser), which returns an object of type V;

  • resulting object of type V current parser gets to the input of the function consumer;

  • function consumer returns a new parser which is the result of the method and_then.

def and_then(self, consumer: Callable[[V], Parser[U]]) -> Parser[U]:
    def __with_scanner(scanner: Scanner) -> U:
        v = self.__parser(scanner)
        return consumer(v)(scanner)

    return Parser(__with_scanner)

Method and_then is called a monadic binder in functional programming languages, and in Haskell And Idris is based on it do– a notation that allows you to write functional code with callbacks in the form of “flat” code, reminiscent of imperative code.

Error processing

If the input sequence cannot be parsed by the parser, then a Parser.ParserError an exception that can be handled by standard Python tools.

Primitive Parsers

Parser satisfy

Parser satisfy is the main primitive on the basis of which the rest of the parsers are built. It takes the following predicate as input: Callable[[str], bool]and the output is a parser Parser[str].

Parsing of the input string succeeds for every element of the input string for which the predicate returns True. The element itself is the result of parsing, and Scanner moves to the next element in the input sequence.

def satisfy(predicate: Callable[[str], bool]) -> Parser[str]:
    def __with_scanner(scanner: Scanner) -> str:
        result = scanner.advance_if(predicate)
        if not result:
            raise Parser.ParserError(f"Unexpected '{scanner.current}' at {scanner.pos}")
        return result

    return Parser(__with_scanner)

char parser

Parser char accepts a character as input and checks whether the current element of the input sequence is equal to the received character.

def char(c: str) -> Parser[str]:
    return satisfy(lambda x: c == x)

ws parser

Parser ws checks if the current character is whitespace.

def ws() -> Parser[str]:
    return satisfy(lambda x: x.isspace())

opt parser

Parser opt is a parser that takes a parser as input and returns a parser. The peculiarity is that if the passed parser failed to parse the input sequence, then this is not an error. This parser is used to parse optional data that may not be present in the input sequence.

def opt(parser: Parser[V]) -> Parser[V | None]:
    def __with_scanner(scanner: Scanner) -> V | None:
        try:
            scanner.mark()
            return parser(scanner)
        except ValueError:
            scanner.reset()
            return None

    return Parser(__with_scanner)

In case of an error, it is necessary to cancel everything that the passed parser managed to do, for this purpose, before starting the parser, a flag is set to save the state, and in case of an error, the scanner returns to the saved state.

parser many

Parser many is a parser that takes a parser as input and returns a parser. The peculiarity is that the passed parser is applied to the input sequence until an error is encountered. The result of the parser is a list of successfully parsed objects.

def many(parser: Parser[V]) -> Parser[List[V]]:
    def __with_scanner(scanner: Scanner) -> List[V]:
        result = []
        try:
            while True:
                element = parser(scanner)
                result.append(element)
        except ValueError:
            return result

    return Parser(__with_scanner)

alt parser

Parser alt is a parser that tries to apply two parsers in turn to the input sequence and returns the result of the first successful parser. The parser is implemented as a method alt class Parser:

class Parser(Generic[V]):

    ...

    def alt(self, parser: 'Parser[U]') -> 'Parser[V | U]':
        def __with_scanner(scanner: Scanner) -> V:
            try:
                scanner.mark()
                return self.__parser(scanner)
            except ValueError:
                scanner.reset()
                return parser(scanner)

        return Parser(__with_scanner)

An example of using parsers

Task

Develop a phone number parser. Phone number format requirements:

  1. starts at 79 or 89,

  2. may optionally be present at the beginning +,

  3. code (912, 964etc) can be inside brackets,

  4. there can be hyphens between numeric groups,

  5. there are no gaps.

Solution

The solution will be implemented taking into account the following ideas:

  • method and_then allows you to “glue” parsers into one big parser: apply the first parser, followed by the second, third, etc.;

  • method and_then takes a lambda function that takes the result of the previous parser as the first argument;

  • the result of the lambda function will also be a parser, but this new parser can also call the method and_then;

  • keep calling and_then can be unlimited many times;

  • thanks to the closure the most recent call and_then will have access to all the results of previous parsers.

A nice bonus is that error handling happens inside the combinator parser itself, which leads to more or less clean business logic code.

def phone_number(phone_number: str) -> str:
    phone_parser: Parser[List[str]] = opt(char('+')) \
      .and_then(lambda _plus: char('7').alt('8')
      .and_then(lambda _seven_or_eight: opt(char('('))
      .and_then(lambda _: char('9')
      .and_then(lambda _nine: digit()
      .and_then(lambda code1: digit()
      .and_then(lambda code2: opt(char(')'))
      .and_then(lambda _: digit()
      .and_then(lambda d1: digit()
      .and_then(lambda d2: digit()
      .and_then(lambda d3: opt(char('-'))
      .and_then(lambda _: digit()
      .and_then(lambda d4: digit()
      .and_then(lambda d5: opt(char('-'))
      .and_then(lambda _: digit()
      .and_then(lambda d6: digit()
      .map(lambda d7: ['+', '7', '(', '9', code1, code2, ')', d1, d2, d3, '-', d4, d5, '-', d6, d7])
      )))))))))))))))

    result = phone_parser(Scanner(phone_number, Position()))
    return "".join(result)

print(phone_number("+7(964)111-11-11"))
print(phone_number("8964222-2222"))
print(phone_number("7(964)333-3333"))

Generators

Idea

The code from the example above, although it solves the problem, has a lot of syntactic noise associated with the peculiarities of the call. and_then. In functional programming languages, the method and_then is called a monadic binder, and, for example, Haskell or Idris implements do– a notation that allows you to turn code with a large number of callbacks into flat code, similar in form to imperative.

Python does not have the concept of a monadic binder, nor does it do-notations, but there are generators. Generators offer two-way communication between client and generator through send method:

  • call result send is the value passed to yield inside the generator

  • input value send is passed inside the generator and is the result yield.

Thus, it is possible to implement work with parsers through yield:

  • passed value – parser,

  • the return value is the result of parsing the parser.

Implementation

For the convenience of the client, the functions that will build parsers based on generators will be wrapped with a decorator @topology.

The decorator implements a wrapper around the parser, in which:

  • the original parser function is called to get the generator object,

  • a function is called to run the generator nextwhich will return the first passed in yield parser,

  • the current object is passed to the parser Scannerand the output is the result of parsing the parser,

  • in an infinite loop, the result of parsing the previous parser is passed to the generator,

  • exception StopIteration appears when the original function calls return,

  • the result of the original function lies in the field value exceptions StopIteration,

  • field value value exceptions StopIteration will be the result of the final parser.

def topology(parser_builder) -> Callable[..., Parser[V]]:
    def wrapper(*args: Any, **kwargs: Any) -> Parser[V]:
        gen = parser_builder(*args, **kwargs)

        def __with_scanner(scanner: Scanner) -> V:
            parser = next(gen)
            try:
                while True:
                    result = parser(scanner)
                    parser = gen.send(result)
            except StopIteration as e:
                return e.value

        return Parser(__with_scanner)

    return wrapper

The disadvantage of the decorator is the fact that the return type will change from Generator[Any, Parser[Any], V] on Parser[V]which can lead to false positives mypy.

Example

Implement the task with phone numbers from the previous section using @topology decorator as follows:

@topology
def phone_number_gen():
    yield opt(char('+'))

    yield char('7').alt(char('8'))

    yield opt(char('('))
    yield char('9')
    code1 = yield digit()
    code2 = yield digit()
    yield opt(char(')'))

    d1 = yield digit()
    d2 = yield digit()
    d3 = yield digit()

    yield opt(char('-'))

    d4 = yield digit()
    d5 = yield digit()

    yield opt(char('-'))

    d6 = yield digit()
    d7 = yield digit()

    return "".join(['+', '7', '(', '9', code1, code2, ')', d1, d2, d3, '-', d4, d5, '-', d6, d7])

def phone_number_gen_parser(phone_number: str) -> str:
    parser: Parser[str] = phone_number_gen()
    return parser(Scanner(phone_number, Position()))

print(phone_number_gen_parser("+7(964)111-11-11"))
print(phone_number_gen_parser("8964222-2222"))
print(phone_number_gen_parser("7(964)333-3333"))

Accessing the current position

During parsing, it may be necessary to refer to the current position. The current position is stored in an object Scannerand access to the object Scanner executed during parsing. This means that getting the current position can also be made an object of type Parser[Position]in other words getting the position is the parser:

def position() -> Parser[Position]:
    def __with_scanner(scanner: Scanner) -> Position:
        return copy(scanner.pos)

    return Parser(__with_scanner)

Example

Use parser position() you can do it just like any other parser: through and_then or @topology.

The example below remembers the current position at the beginning of the parser process and at the end, and both of these values ​​are eventually returned to the client.

@topology
def parse_text(text: str):
    start = yield position()
    
    for c in text:
        yield char(c)
    
    end = yield position()
    return start.offset, end.offset

value = "Hello, World!"
parser_hello_world = parse_text(value)
print(parser_hello_world(Scanner(value, Position())))

Conclusion

The article demonstrated an approach to implementing a combinator parser based on the capabilities of the Python language, which is expressive enough for developing a combinator parser. When combined with generators, a high degree of abstraction can be achieved while maintaining code readability. The experience of developing a combinator parser led to the creation of a library PGPCPython Generator-based Parser Combinator library. The library is only at the very beginning of its development, and it lacks most of the traditional parser combinators, so the addition of new parsers from third-party developers is welcome. Look at the library for inspiration Parsec.

Sources

  1. https://stepik.org/lesson/42245/step/1 – the chapter on parsers in the Haskell course by D.N. Moskvina

  2. https://github.com/haskell/parsec – combinator parser library Parsec

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *