[Up] [Next]
Go up to Context Free Grammars And How To Implement Them In Prolog
Go forward to Implementing CFGs in Prolog

Context Free Grammars

A Framework For Describing Syntax

The term CONTEXT FREE GRAMMAR (CFG) stands for a particular method of describing the syntax of languages. CFGs can be used to describe any language (natural or artificial) provided it conforms to some fairly basic constraints. The constraints relate to how clauses are nested, and say roughly that dependent clauses must be adjacent to the constituent they depend on. Most natural languages appear to obey them, with the possible exception of Swiss German (see Gazdar and Mellish Natural Language Processing in Prolog (19XX:XXX)). Most contemporary theories of natural language syntax derive from the CFG framework.

The Language of Context Free Grammars

A CFG is a formal description of the syntax of a language. Specifically, the description is given as a set of PRODUCTION RULES, which define the well-formed sentences of the language. The rules themselves are of course also written in a (formal) language. As all formal languages, this one is defined by a vocabulary and a syntax.

Vocabulary

The rules contain three types of symbol:

Non-Terminals
correspond to constituents in the language to be described. One of the non-terminals has a special status. It is called the distinguished symbol.
Terminals
correspond to the words of the language to be described.
->
(the arrow symbol) delimits the left hand side of a rule from the right hand side.

Syntax

The sentences in the language of the CFG are the production rules (syntax here refers to the format of the rules themselves, not to the language they describe). A production rule has the following properties:

Some Examples

Production rules are rewrite rules. This means that the single symbol on the LHS can be replaced by the symbol(s) on the RHS (or vice versa--this depends on how the rules are used). Consider the following example:

s -> np, vp.
It has a single symbol s on the LHS, and two symbols np and vp on the RHS. This rule says that if we are trying to find an s, that we must look for an np and a vp or alternatively, if we have a symbol np and a symbol vp adjacent to each other, then together, they can be rewritten as the symbol s.

The example avove illustrates a rule which rewrites a symbol into a set of non-terminals (rewritable symbols). But there are also rules which rewrite non-terminals into terminals, as exemplified by the following:

noun -> "dog"
So a terminal is a symbol which cannot occur on the left hand side of a rule--they represent the words of the language that the CFG describes.

How CFGs Describe Well-Formed Sentences

Languages are described by giving a set of production rules, which define all grammatical sentences. The well-formed sentences of the language are those which have the following property:

A tree may be constructed for the sentence such that
The requirement that the root node be labeled with the distinguished symbol means that only strings of certain categories are well-formed. In natural language syntax for instance, only strings that are labeled s are sentences (as opposed to those labelled np for instance). If we write a grammar for a programming language, the well-formed string (i.e. the chunks of program that the compiler understands) are units which can be interpreted as instructions.

To prove that a sentence of a language is grammatical according to a set of rules, we determine whether it is possible to construct a tree according to the description above. The words in the sentence will be the leaves, so each word must be the RHS of a rule. If they do, we then have a list of symbols (the LHSs of the rules). We then try to match groups of symbols from this list to an RHS of a rule. If a match is found, that particular group is replaced by the symbol on the LHS of the relevant rule. If eventually this leaves us with a single symbol left, and this is the distinguished symbol, the sentence is grammatical.

Why Context Free?

The symbols on the RHS of a rule must only ever depend on the symbol on the LHS--any adjacent symbols must not be taken into account. With respect to the production rule

np -> det, noun.
this means that if there is an np symbol, we may try to rewrite it with the rule, regardless of what other symbols may be adjacent to it. Syntactic descriptions of this format are called context free grammars because the rewriting process is completely determined by the symbols mentioned in the rule--the contexts of the symbols are irrelevant.

Advantages of the CFG Format

CFGs have three important characteristics:

Although most modern syntactic theories derive from the CFG framework, many of them have departed from the basic format in such a way that the first two three properties have been severely compromised. An example of such a departure is the move-alpha rule in GB--this rule is not context free because it makes a statement about a whole tree. The production rules in proper CFGs can only make statements about local trees ( i.e. how a node is rewritten into another set of nodes). It is however possible to capture the phenomenon that the move-alpha rule is intended to explain within the CFG framework--this will be explained in Unbounded Dependencies.


<a.von.klopp@bangor.ac.uk>

[Up] [Next]