Parser combinators in Python
DISCLAIMER
This article uses Python generators, so the reader is expected to be familiar with them a little more than just using
yield
. It should also be noted that some examples are simplified for the sake of compactness of the narrative.
TLDR
The article offers a look at the experience of developing a combinator parser for Python, which resulted in a library PGPC for developing parsers in Python. The library was inspired Parsec.
Emulation is of particular interest. do
-notations via Python generators, hence the name of the library: Python Generator-based Parser Combinator library.
An example of using the library to parse a phrase "Hello, World!"
:
@topology
def parse_word(word: str):
parsed = []
for c in word:
letter = yield char(c)
parsed.append(letter)
return parsed
text = "Hello, World!"
parser = parse_word(text)
print(parser(Scanner(text, Position())))
Motivation
The task of parsing text appeared at the dawn of programming languages, and during this time a rigorous mathematical apparatus was created for it. The classic approach to analyzing the syntax and semantics of an input sequence involves developing:
lexer – a program that breaks a stream into tokens or lexemes;
parser – a program that tries to form some kind of language construction based on the input sequence of tokens.
This approach is described in detail in the book with a dragon “Compilers. Principles, approaches and tools.” by A.V. Aho, M.S. Lam, R. Seti, D.D. Ulman.
On the other hand, the parser can be viewed as a function that:
the input string associates a certain object,
returns the rest of the string that was not parsed:
For Haskell, the parser type looks like this: String -> (T, String)
.
Based on the idea that a parser is a function, and functions are usually combined into compositions, parser combinators appeared. The point is to build more complex ones from small parsers using the composition of parsers.
Parser combinators
The combinator parser is a powerful and flexible approach to text parsing that has its advantages over classical approaches:
expressive and readable code: parser combinators allow you to compose parsers through a combination of simple parsers, resulting in code that is close to BNF form (Backus-Naur form);
compositionality: parser combinators strive for composition by nature, in the sense that one can easily assemble a larger, more complex parser from smaller parsers. Composition allows you to develop parsers incrementally, repeatedly reusing existing parsers to parse different structures of the target language;
streaming parsing: parser combinators can work in conditions where the input sequence is formed in real time;
independence from the programming language: the combinator parser can be implemented in any language that supports higher-order functions;
rapid prototyping: parser combinators are great for prototyping, allowing you to quickly start working on a parser without the need for complicated workspace setup;
testing: each parser consists of simpler parsers, which means that each parser can be easily tested.
Parser combinators offer a flexible and intuitive approach that makes parser development easier and improves code readability. They are most convenient for working with DSL languages and various data formats.
Development
The development of the generator parser will be carried out in Python with the active use of Type Hints.
Basic abstractions
The general idea is that a parser is a function from an input sequence into an object of an arbitrary type. So, after reading the input string, you can form some object.
The main abstractions are:
Position
– current position of the parser in the input sequence;Scanner
– a class parameterized by the type of the element of the input sequence, for abstraction by the input sequence;Parser
– a class parameterized by the type of the object being parsed, representing the parser itself. To use objectsParser
in the context of functions, in the class overridden__call__
method.
position
Class Position
allows you to abstract the position in the input sequence where the previous parser stopped. It is expected that the position can take many forms, for example, when parsing a file, it is necessary to remember the line and column in addition to the absolute offset relative to the beginning of the file.
class Position:
def __init__(self, offset: int = 0):
self.__offset = offset
def increase(self, offset: int):
self.__offset += offset
Scanner
Class Scanner
, parameterized by the element type of the input sequence, allows you to abstract the input sequence. The standard class interface includes methods:
mark
– remember the current position to return to it later;reset
– return to the last memorized position;advance_if
– move the current position forward if the current element of the input sequence passes the test according to the predicate passed to the method. As a result, either the current element itself, which passed the test, is returned, orNone
class Scanner(Generic[ELEMENT]):
def __init__(self, content: Sequence[ELEMENT], pos: Position):
self.__pos: Position = pos
self.__content: Sequence[ELEMENT] = content
self.__marks: List[Position] = []
def mark(self) -> None:
self.__marks.append(self.__pos)
def reset(self) -> None:
if self.__marks:
self.__pos = self.__marks.pop()
@property
def current(self) -> ELEMENT:
if self.__pos.offset >= len(self.content):
raise EOFError(f"No value at {self.__pos.offset}")
result = self.content[self.pos.offset]
return result
def advance_if(self, predicate: Callable[[ELEMENT], bool]) -> ELEMENT | None:
element = self.current
if predicate(element):
self.__pos.increase(1)
return element
else:
return None
parser
Class Parser
, parameterized by the type of the object after parsing, allows you to abstract away the parsing logic. The parser must be used in the context of functions, and therefore it is redefined in it __call__
method that takes Scanner
and returns an object of the specified type.
Examples of parser types:
Parser[datetime.date]
– a parser that can extract the date from the input sequence;Parser[int]
– a parser that can extract a number from the input sequence.
When creating an object Parser[V]
a function is passed to the constructor Callable[[Scanner], V]
The that will be used to extract data from the input string.
The standard parser interface includes a method and_then
, which allows you to glue two parsers into one. Per input method and_then
takes a function consumer
type Callable[[V], Parser[U]]
and implements the following logic:
an object
Scanner
is passed as input to the current parser (self.__parser
), which returns an object of typeV
;resulting object of type
V
current parser gets to the input of the functionconsumer
;function
consumer
returns a new parser which is the result of the methodand_then
.
def and_then(self, consumer: Callable[[V], Parser[U]]) -> Parser[U]:
def __with_scanner(scanner: Scanner) -> U:
v = self.__parser(scanner)
return consumer(v)(scanner)
return Parser(__with_scanner)
Method and_then
is called a monadic binder in functional programming languages, and in Haskell
And Idris
is based on it do
– a notation that allows you to write functional code with callbacks in the form of “flat” code, reminiscent of imperative code.
Error processing
If the input sequence cannot be parsed by the parser, then a Parser.ParserError
an exception that can be handled by standard Python tools.
Primitive Parsers
Parser satisfy
Parser satisfy
is the main primitive on the basis of which the rest of the parsers are built. It takes the following predicate as input: Callable[[str], bool]
and the output is a parser Parser[str]
.
Parsing of the input string succeeds for every element of the input string for which the predicate returns True
. The element itself is the result of parsing, and Scanner
moves to the next element in the input sequence.
def satisfy(predicate: Callable[[str], bool]) -> Parser[str]:
def __with_scanner(scanner: Scanner) -> str:
result = scanner.advance_if(predicate)
if not result:
raise Parser.ParserError(f"Unexpected '{scanner.current}' at {scanner.pos}")
return result
return Parser(__with_scanner)
char parser
Parser char
accepts a character as input and checks whether the current element of the input sequence is equal to the received character.
def char(c: str) -> Parser[str]:
return satisfy(lambda x: c == x)
ws parser
Parser ws
checks if the current character is whitespace.
def ws() -> Parser[str]:
return satisfy(lambda x: x.isspace())
opt parser
Parser opt
is a parser that takes a parser as input and returns a parser. The peculiarity is that if the passed parser failed to parse the input sequence, then this is not an error. This parser is used to parse optional data that may not be present in the input sequence.
def opt(parser: Parser[V]) -> Parser[V | None]:
def __with_scanner(scanner: Scanner) -> V | None:
try:
scanner.mark()
return parser(scanner)
except ValueError:
scanner.reset()
return None
return Parser(__with_scanner)
In case of an error, it is necessary to cancel everything that the passed parser managed to do, for this purpose, before starting the parser, a flag is set to save the state, and in case of an error, the scanner returns to the saved state.
parser many
Parser many
is a parser that takes a parser as input and returns a parser. The peculiarity is that the passed parser is applied to the input sequence until an error is encountered. The result of the parser is a list of successfully parsed objects.
def many(parser: Parser[V]) -> Parser[List[V]]:
def __with_scanner(scanner: Scanner) -> List[V]:
result = []
try:
while True:
element = parser(scanner)
result.append(element)
except ValueError:
return result
return Parser(__with_scanner)
alt parser
Parser alt
is a parser that tries to apply two parsers in turn to the input sequence and returns the result of the first successful parser. The parser is implemented as a method alt
class Parser
:
class Parser(Generic[V]):
...
def alt(self, parser: 'Parser[U]') -> 'Parser[V | U]':
def __with_scanner(scanner: Scanner) -> V:
try:
scanner.mark()
return self.__parser(scanner)
except ValueError:
scanner.reset()
return parser(scanner)
return Parser(__with_scanner)
An example of using parsers
Task
Develop a phone number parser. Phone number format requirements:
starts at
79
or89
,may optionally be present at the beginning
+
,code (
912
,964
etc) can be inside brackets,there can be hyphens between numeric groups,
there are no gaps.
Solution
The solution will be implemented taking into account the following ideas:
method
and_then
allows you to “glue” parsers into one big parser: apply the first parser, followed by the second, third, etc.;method
and_then
takes a lambda function that takes the result of the previous parser as the first argument;the result of the lambda function will also be a parser, but this new parser can also call the method
and_then
;keep calling
and_then
can be unlimited many times;thanks to the closure the most recent call
and_then
will have access to all the results of previous parsers.
A nice bonus is that error handling happens inside the combinator parser itself, which leads to more or less clean business logic code.
def phone_number(phone_number: str) -> str:
phone_parser: Parser[List[str]] = opt(char('+')) \
.and_then(lambda _plus: char('7').alt('8')
.and_then(lambda _seven_or_eight: opt(char('('))
.and_then(lambda _: char('9')
.and_then(lambda _nine: digit()
.and_then(lambda code1: digit()
.and_then(lambda code2: opt(char(')'))
.and_then(lambda _: digit()
.and_then(lambda d1: digit()
.and_then(lambda d2: digit()
.and_then(lambda d3: opt(char('-'))
.and_then(lambda _: digit()
.and_then(lambda d4: digit()
.and_then(lambda d5: opt(char('-'))
.and_then(lambda _: digit()
.and_then(lambda d6: digit()
.map(lambda d7: ['+', '7', '(', '9', code1, code2, ')', d1, d2, d3, '-', d4, d5, '-', d6, d7])
)))))))))))))))
result = phone_parser(Scanner(phone_number, Position()))
return "".join(result)
print(phone_number("+7(964)111-11-11"))
print(phone_number("8964222-2222"))
print(phone_number("7(964)333-3333"))
Generators
Idea
The code from the example above, although it solves the problem, has a lot of syntactic noise associated with the peculiarities of the call. and_then
. In functional programming languages, the method and_then
is called a monadic binder, and, for example, Haskell or Idris implements do
– a notation that allows you to turn code with a large number of callbacks into flat code, similar in form to imperative.
Python does not have the concept of a monadic binder, nor does it do
-notations, but there are generators. Generators offer two-way communication between client and generator through send
method:
call result
send
is the value passed toyield
inside the generatorinput value
send
is passed inside the generator and is the resultyield
.
Thus, it is possible to implement work with parsers through yield
:
passed value – parser,
the return value is the result of parsing the parser.
Implementation
For the convenience of the client, the functions that will build parsers based on generators will be wrapped with a decorator @topology
.
The decorator implements a wrapper around the parser, in which:
the original parser function is called to get the generator object,
a function is called to run the generator
next
which will return the first passed inyield
parser,the current object is passed to the parser
Scanner
and the output is the result of parsing the parser,in an infinite loop, the result of parsing the previous parser is passed to the generator,
exception
StopIteration
appears when the original function callsreturn
,the result of the original function lies in the field
value
exceptionsStopIteration
,field value
value
exceptionsStopIteration
will be the result of the final parser.
def topology(parser_builder) -> Callable[..., Parser[V]]:
def wrapper(*args: Any, **kwargs: Any) -> Parser[V]:
gen = parser_builder(*args, **kwargs)
def __with_scanner(scanner: Scanner) -> V:
parser = next(gen)
try:
while True:
result = parser(scanner)
parser = gen.send(result)
except StopIteration as e:
return e.value
return Parser(__with_scanner)
return wrapper
The disadvantage of the decorator is the fact that the return type will change from Generator[Any, Parser[Any], V]
on Parser[V]
which can lead to false positives mypy
.
Example
Implement the task with phone numbers from the previous section using @topology
decorator as follows:
@topology
def phone_number_gen():
yield opt(char('+'))
yield char('7').alt(char('8'))
yield opt(char('('))
yield char('9')
code1 = yield digit()
code2 = yield digit()
yield opt(char(')'))
d1 = yield digit()
d2 = yield digit()
d3 = yield digit()
yield opt(char('-'))
d4 = yield digit()
d5 = yield digit()
yield opt(char('-'))
d6 = yield digit()
d7 = yield digit()
return "".join(['+', '7', '(', '9', code1, code2, ')', d1, d2, d3, '-', d4, d5, '-', d6, d7])
def phone_number_gen_parser(phone_number: str) -> str:
parser: Parser[str] = phone_number_gen()
return parser(Scanner(phone_number, Position()))
print(phone_number_gen_parser("+7(964)111-11-11"))
print(phone_number_gen_parser("8964222-2222"))
print(phone_number_gen_parser("7(964)333-3333"))
Accessing the current position
During parsing, it may be necessary to refer to the current position. The current position is stored in an object Scanner
and access to the object Scanner
executed during parsing. This means that getting the current position can also be made an object of type Parser[Position]
in other words getting the position is the parser:
def position() -> Parser[Position]:
def __with_scanner(scanner: Scanner) -> Position:
return copy(scanner.pos)
return Parser(__with_scanner)
Example
Use parser position()
you can do it just like any other parser: through and_then
or @topology
.
The example below remembers the current position at the beginning of the parser process and at the end, and both of these values are eventually returned to the client.
@topology
def parse_text(text: str):
start = yield position()
for c in text:
yield char(c)
end = yield position()
return start.offset, end.offset
value = "Hello, World!"
parser_hello_world = parse_text(value)
print(parser_hello_world(Scanner(value, Position())))
Conclusion
The article demonstrated an approach to implementing a combinator parser based on the capabilities of the Python language, which is expressive enough for developing a combinator parser. When combined with generators, a high degree of abstraction can be achieved while maintaining code readability. The experience of developing a combinator parser led to the creation of a library PGPC – Python Generator-based Parser Combinator library. The library is only at the very beginning of its development, and it lacks most of the traditional parser combinators, so the addition of new parsers from third-party developers is welcome. Look at the library for inspiration Parsec.
Sources
https://stepik.org/lesson/42245/step/1 – the chapter on parsers in the Haskell course by D.N. Moskvina
https://github.com/haskell/parsec – combinator parser library
Parsec