Parsing for natural languages. Part 1: Languages ​​for describing languages

Historically, the first attempt to formalize a language and automate its parsing was regular expressions, invented by S.K. Kleini in 1951. A regular expression is composed of language characters (“terminals”), and three operations: concatenation, alternation, and closure. To parse regular expressions, it is enough DKA without memory: the parser knows what state he is in now, but does not remember anything about his past states. This means that languages ​​that allow nested constructs – for example, the nested parenthesis language (n)n and the language of the regular expressions themselves is impossible to describe with regular expressions. Natural languages ​​also allow constructions of unlimited nesting (“Here are two roosters who wake up that shepherd who scolds the strict cowshed, who milks a hornless cow, kicked an old dog without a tail, who kicks a cat by the collar, who scares and catches a tit, which often steals wheat, which is stored in a dark closet in the house that Jack built. “), so regular expressions are not expressive enough to describe natural languages.

A more expressive way of describing languages ​​is formal grammars – suggested by N. Chomsky in 1956. Sentences in English lend themselves quite well to this description:

S  → NP VP
NP → N' | Det N'
N' → N | Adj N | N PP | N' CP
VP → V | V NP | V PP
PP → P NP
CP → VnP | C VP | C S
VnP → Vn | Adv VnP | VnP Conj Vn
N  → This | rooster | morn | judge | man | maiden | cow |
     horn | dog | cat | rat | malt | house | Jack
V  → is | crowed | woke | married | kissed | milked |
     tossed | worried | killed | ate | lay | built
Vn → shaven | shorn | tattered | torn | forlorn
Adj → crumpled 
Adv → all
P   → in | with 
Det → the
C   → that
Conj → and

This formal grammar of a couple of dozen abstract symbols (“nonterminals”) and a couple of dozen inference rules (not counting the lexicon) is enough to parse the nested constructions in the same sentence about a rooster, cowshed, dog and cat:

For context-free grammars (CFG) – when there is exactly one nonterminal on the left side of each inference rule, as in this example – there are efficient algorithms for parsing text; therefore almost all programming languages ​​are described by CFG. Unlike programming languages, natural language texts often allow several possible parsing – for example, [that woke the judge…]CP could apply to mornN ‘ – the choice between which requires semantic analysis. However, for natural languages ​​with a more flexible word order than in English, CFG is not enough: already to parse a Russian rhyme, you would need VP → V PP, N' → N Adj (“scolds the strict cowshed“), and VP → PP V, N' → Adj N (“stored in a dark closet“); almost for every inference rule in the grammar, it would be necessary to add its” mirroring “- and then the number of possible parsing of the text grows exponentially. topicalization leaves of a subtree corresponding to a nonterminal may not appear in a row in the text: for example, in “I asked you to help the children, [а не родителям]”complex complement members [детям помочь]CP separated by verb requestedV… In several languages ​​- researchers note Dutch and Swiss German – there are non-topicalization “parallel-nested” constructs, eg:

… das

mer

d’chind

em Hans

es huus

lönd

hälfe

aastriiche

… what

we

children

Hans

house

ask

to help

paint

“… that we ask the children to help Hans paint the house”

Such a chain can be extended indefinitely, as in the rhyme about Jack’s house; but the verbs in the second part of the sentence must go in the same order as their complements in the first part of the sentence. Such phenomena – scrambling with topicalization and “parallel-nested” constructions in Dutch and Swiss German – formal grammars are unable to describe.

More formally, one can prove that CFGs can describe nested constructs – in particular, the language of nested brackets (n)n described by trivial grammar S → () | (S) – but non-hierarchical dependencies defy description: in particular, CFG cannot describe the language anbncn or the language of twice repeated strings ((a | b)+)2

In 1987, Osaka University developed an even more expressive way of describing languages ​​- multiple CFGs (MCFG); at the same time and independently of the Osakis, the University of Pennsylvania developed linear context-free rewriting systems (LCFRS), which differ from MCFG only in a simpler notation. In MCFG / LCFRS, nonterminals become multiplace predicates – for example, this LCFRS describes the language of double repeated strings:

S(XY) ← P(X,Y)
P(a,a) ← ε
P(b,b) ← ε
P(XY,ZW) ← P(X,Z),P(Y,W)

Left arrow indicates Horn disjunct… On the left side of each inference rule, the compilation of nonterminal predicate arguments from terminals and variables is described; the right one lists the predicates that the variables must satisfy. The order in which the predicates are written on the right-hand side does not matter. Each variable used on the left side must be constrained by a predicate on the right side; reuse of one variable is not allowed. If there are no variables on the left side, then the right side can be empty – this means that no additional conditions, except for the coincidence of terminals, are imposed. The way of writing LCFRS is very different from that adopted for CFG – for example, the grammar of the language of twice repeated lines could be written in a more transparent, although in practice not used form:

S → XY  ⇐ P → X,Y
P → a,a | b,b | XY,ZW  ⇐ P → X,Z ∧ P → Y,W

Unlike CFG, where each parse tree node corresponds to a substring of the parsed text, when parsing MCFG / LCFRS, a parse tree node can correspond to several non-consecutive substrings in the text: (the line width shows how many substrings the nonterminal receives from child nodes)

The practical use of MCFG / LCFRS is that they allow you to describe a “torn” member of a sentence (children, to help)CP:

VP(X,Y) ← NP(X),V(Y)
CP(X,Y) ← VP(X,Y)
VP(X,YZW) ← NP(X),V(Z),CP(Y,W)
S(XYZ) ← NP(X),VP(Y,Z)


Cloud servers from Macleod fast and safe.

Register using the link above and get a 10% discount for the first month of renting a server of any configuration!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *