Parsing for natural languages. Part 1: Languages for describing languages
Historically, the first attempt to formalize a language and automate its parsing was regular expressions, invented by S.K. Kleini in 1951. A regular expression is composed of language characters (“terminals”), and three operations: concatenation, alternation, and closure. To parse regular expressions, it is enough DKA without memory: the parser knows what state he is in now, but does not remember anything about his past states. This means that languages that allow nested constructs – for example, the nested parenthesis language (n)n and the language of the regular expressions themselves is impossible to describe with regular expressions. Natural languages also allow constructions of unlimited nesting (“Here are two roosters who wake up that shepherd who scolds the strict cowshed, who milks a hornless cow, kicked an old dog without a tail, who kicks a cat by the collar, who scares and catches a tit, which often steals wheat, which is stored in a dark closet in the house that Jack built. “), so regular expressions are not expressive enough to describe natural languages.
A more expressive way of describing languages is formal grammars – suggested by N. Chomsky in 1956. Sentences in English lend themselves quite well to this description:
S → NP VP
NP → N' | Det N'
N' → N | Adj N | N PP | N' CP
VP → V | V NP | V PP
PP → P NP
CP → VnP | C VP | C S
VnP → Vn | Adv VnP | VnP Conj Vn
N → This | rooster | morn | judge | man | maiden | cow |
horn | dog | cat | rat | malt | house | Jack
V → is | crowed | woke | married | kissed | milked |
tossed | worried | killed | ate | lay | built
Vn → shaven | shorn | tattered | torn | forlorn
Adj → crumpled
Adv → all
P → in | with
Det → the
C → that
Conj → and
This formal grammar of a couple of dozen abstract symbols (“nonterminals”) and a couple of dozen inference rules (not counting the lexicon) is enough to parse the nested constructions in the same sentence about a rooster, cowshed, dog and cat:
For context-free grammars (CFG) – when there is exactly one nonterminal on the left side of each inference rule, as in this example – there are efficient algorithms for parsing text; therefore almost all programming languages are described by CFG. Unlike programming languages, natural language texts often allow several possible parsing – for example, [that woke the judge…]CP could apply to mornN ‘ – the choice between which requires semantic analysis. However, for natural languages with a more flexible word order than in English, CFG is not enough: already to parse a Russian rhyme, you would need VP → V PP, N' → N Adj
(“scolds the strict cowshed“), and VP → PP V, N' → Adj N
(“stored in a dark closet“); almost for every inference rule in the grammar, it would be necessary to add its” mirroring “- and then the number of possible parsing of the text grows exponentially. topicalization leaves of a subtree corresponding to a nonterminal may not appear in a row in the text: for example, in “I asked you to help the children, [а не родителям]”complex complement members [детям помочь]CP separated by verb requestedV… In several languages - researchers note Dutch and Swiss German – there are non-topicalization “parallel-nested” constructs, eg:
… das | mer | d’chind | em Hans | es huus | lönd | hälfe | aastriiche |
… what | we | children | Hans | house | ask | to help | paint |
“… that we ask the children to help Hans paint the house” |
Such a chain can be extended indefinitely, as in the rhyme about Jack’s house; but the verbs in the second part of the sentence must go in the same order as their complements in the first part of the sentence. Such phenomena – scrambling with topicalization and “parallel-nested” constructions in Dutch and Swiss German – formal grammars are unable to describe.
More formally, one can prove that CFGs can describe nested constructs – in particular, the language of nested brackets (n)n described by trivial grammar S → () | (S)
– but non-hierarchical dependencies defy description: in particular, CFG cannot describe the language anbncn or the language of twice repeated strings ((a | b)+)2…
In 1987, Osaka University developed an even more expressive way of describing languages - multiple CFGs (MCFG); at the same time and independently of the Osakis, the University of Pennsylvania developed linear context-free rewriting systems (LCFRS), which differ from MCFG only in a simpler notation. In MCFG / LCFRS, nonterminals become multiplace predicates – for example, this LCFRS describes the language of double repeated strings:
S(XY) ← P(X,Y)
P(a,a) ← ε
P(b,b) ← ε
P(XY,ZW) ← P(X,Z),P(Y,W)
Left arrow indicates Horn disjunct… On the left side of each inference rule, the compilation of nonterminal predicate arguments from terminals and variables is described; the right one lists the predicates that the variables must satisfy. The order in which the predicates are written on the right-hand side does not matter. Each variable used on the left side must be constrained by a predicate on the right side; reuse of one variable is not allowed. If there are no variables on the left side, then the right side can be empty – this means that no additional conditions, except for the coincidence of terminals, are imposed. The way of writing LCFRS is very different from that adopted for CFG – for example, the grammar of the language of twice repeated lines could be written in a more transparent, although in practice not used form:
S → XY ⇐ P → X,Y
P → a,a | b,b | XY,ZW ⇐ P → X,Z ∧ P → Y,W
Unlike CFG, where each parse tree node corresponds to a substring of the parsed text, when parsing MCFG / LCFRS, a parse tree node can correspond to several non-consecutive substrings in the text: (the line width shows how many substrings the nonterminal receives from child nodes)
The practical use of MCFG / LCFRS is that they allow you to describe a “torn” member of a sentence (children, to help)CP:
VP(X,Y) ← NP(X),V(Y)
CP(X,Y) ← VP(X,Y)
VP(X,YZW) ← NP(X),V(Z),CP(Y,W)
S(XYZ) ← NP(X),VP(Y,Z)
Cloud servers from Macleod fast and safe.
Register using the link above and get a 10% discount for the first month of renting a server of any configuration!