# Parsing for natural languages. Part 1: Languages for describing languages

Historically, the first attempt to formalize a language and automate its parsing was *regular expressions*, invented by S.K. Kleini in 1951. A regular expression is composed of language characters (“terminals”), and three operations: concatenation, alternation, and closure. To parse regular expressions, it is enough DKA without memory: the parser knows what state he is in now, but does not remember anything about his past states. This means that languages that allow nested constructs – for example, the nested parenthesis language (^{n})^{n} and the language of the regular expressions themselves is impossible to describe with regular expressions. Natural languages also allow constructions of unlimited nesting (*“Here are two roosters who wake up that shepherd who scolds the strict cowshed, who milks a hornless cow, kicked an old dog without a tail, who kicks a cat by the collar, who scares and catches a tit, which often steals wheat, which is stored in a dark closet in the house that Jack built. “*), so regular expressions are not expressive enough to describe natural languages.

A more expressive way of describing languages is *formal grammars* – suggested by N. Chomsky in 1956. Sentences in English lend themselves quite well to this description:

```
S → NP VP
NP → N' | Det N'
N' → N | Adj N | N PP | N' CP
VP → V | V NP | V PP
PP → P NP
CP → VnP | C VP | C S
VnP → Vn | Adv VnP | VnP Conj Vn
N → This | rooster | morn | judge | man | maiden | cow |
horn | dog | cat | rat | malt | house | Jack
V → is | crowed | woke | married | kissed | milked |
tossed | worried | killed | ate | lay | built
Vn → shaven | shorn | tattered | torn | forlorn
Adj → crumpled
Adv → all
P → in | with
Det → the
C → that
Conj → and
```

This formal grammar of a couple of dozen abstract symbols (“nonterminals”) and a couple of dozen inference rules (not counting the lexicon) is enough to parse the nested constructions in the same sentence about a rooster, cowshed, dog and cat:

For *context-free* grammars (CFG) – when there is exactly one nonterminal on the left side of each inference rule, as in this example – there are efficient algorithms for parsing text; therefore almost all programming languages are described by CFG. Unlike programming languages, natural language texts often allow several possible parsing – for example, [*that woke the judge*…]_{CP} could apply to *morn*_{N ‘} – the choice between which requires semantic analysis. However, for natural languages with a more flexible word order than in English, CFG is not enough: already to parse a Russian rhyme, you would need `VP → V PP, N' → N Adj`

(“*scolds the strict cowshed*“), and `VP → PP V, N' → Adj N`

(“*stored in a dark closet*“); almost for every inference rule in the grammar, it would be necessary to add its” mirroring “- and then the number of possible parsing of the text grows exponentially. topicalization leaves of a subtree corresponding to a nonterminal may not appear in a row in the text: for example, in “*I asked you to help the children,* [а не родителям]”complex complement members [*детям помочь*]_{CP} separated by verb *requested*_{V}… In several languages - researchers note Dutch and Swiss German – there are non-topicalization “parallel-nested” constructs, eg:

… das |
mer |
d’chind |
em Hans |
es huus |
lönd |
hälfe |
aastriiche |

… what |
we |
children |
Hans |
house |
ask |
to help |
paint |

“… that we ask the children to help Hans paint the house” |

Such a chain can be extended indefinitely, as in the rhyme about Jack’s house; but the verbs in the second part of the sentence must go in the same order as their complements in the first part of the sentence. Such phenomena – scrambling with topicalization and “parallel-nested” constructions in Dutch and Swiss German – formal grammars are unable to describe.

More formally, one can prove that CFGs can describe nested constructs – in particular, the language of nested brackets (^{n})^{n} described by trivial grammar `S → () | (S)`

– but non-hierarchical dependencies defy description: in particular, CFG cannot describe the language a^{n}b^{n}c^{n} or the language of twice repeated strings ((a | b)^{+})^{2}…

In 1987, Osaka University developed an even more expressive way of describing languages - *multiple CFGs* (MCFG); at the same time and independently of the Osakis, the University of Pennsylvania developed *linear context-free rewriting systems* (LCFRS), which differ from MCFG only in a simpler notation. In MCFG / LCFRS, nonterminals become multiplace predicates – for example, this LCFRS describes the language of double repeated strings:

```
S(XY) ← P(X,Y)
P(a,a) ← ε
P(b,b) ← ε
P(XY,ZW) ← P(X,Z),P(Y,W)
```

Left arrow indicates Horn disjunct… On the left side of each inference rule, the compilation of nonterminal predicate arguments from terminals and variables is described; the right one lists the predicates that the variables must satisfy. The order in which the predicates are written on the right-hand side does not matter. Each variable used on the left side must be constrained by a predicate on the right side; reuse of one variable is not allowed. If there are no variables on the left side, then the right side can be empty – this means that no additional conditions, except for the coincidence of terminals, are imposed. The way of writing LCFRS is very different from that adopted for CFG – for example, the grammar of the language of twice repeated lines could be written in a more transparent, although in practice not used form:

```
S → XY ⇐ P → X,Y
P → a,a | b,b | XY,ZW ⇐ P → X,Z ∧ P → Y,W
```

Unlike CFG, where each parse tree node corresponds to a substring of the parsed text, when parsing MCFG / LCFRS, a parse tree node can correspond to several non-consecutive substrings in the text: (the line width shows how many substrings the nonterminal receives from child nodes)

The practical use of MCFG / LCFRS is that they allow you to describe a “torn” member of a sentence (*children*, *to help*)_{CP}:

```
VP(X,Y) ← NP(X),V(Y)
CP(X,Y) ← VP(X,Y)
VP(X,YZW) ← NP(X),V(Z),CP(Y,W)
S(XYZ) ← NP(X),VP(Y,Z)
```

` `

Cloud servers from Macleod fast and safe.

Register using the link above and get a 10% discount for the first month of renting a server of any configuration!