The term CONTEXT FREE GRAMMAR (CFG) stands for a particular method of describing the syntax of languages. CFGs can be used to describe any language (natural or artificial) provided it conforms to some fairly basic constraints. The constraints relate to how clauses are nested, and say roughly that dependent clauses must be adjacent to the constituent they depend on. Most natural languages appear to obey them, with the possible exception of Swiss German (see Gazdar and Mellish Natural Language Processing in Prolog (19XX:XXX)). Most contemporary theories of natural language syntax derive from the CFG framework.
A CFG is a formal description of the syntax of a language. Specifically, the description is given as a set of PRODUCTION RULES, which define the well-formed sentences of the language. The rules themselves are of course also written in a (formal) language. As all formal languages, this one is defined by a vocabulary and a syntax.
The rules contain three types of symbol:
The sentences in the language of the CFG are the production rules (syntax here refers to the format of the rules themselves, not to the language they describe). A production rule has the following properties:
LHS -> RHS
Production rules are rewrite rules. This means that the single symbol on the LHS can be replaced by the symbol(s) on the RHS (or vice versa--this depends on how the rules are used). Consider the following example:
sIt has a single symbol s on the LHS, and two symbols np and vp on the RHS. This rule says that if we are trying to find an s, that we must look for an np and a vp or alternatively, if we have a symbol np and a symbol vp adjacent to each other, then together, they can be rewritten as the symbol s.
The example avove illustrates a rule which rewrites a symbol into a set of non-terminals (rewritable symbols). But there are also rules which rewrite non-terminals into terminals, as exemplified by the following:
nounSo a terminal is a symbol which cannot occur on the left hand side of a rule--they represent the words of the language that the CFG describes.
Languages are described by giving a set of production rules, which define all grammatical sentences. The well-formed sentences of the language are those which have the following property:
A tree may be constructed for the sentence such thatThe requirement that the root node be labeled with the distinguished symbol means that only strings of certain categories are well-formed. In natural language syntax for instance, only strings that are labeled s are sentences (as opposed to those labelled np for instance). If we write a grammar for a programming language, the well-formed string (i.e. the chunks of program that the compiler understands) are units which can be interpreted as instructions.
- its root node is labelled with the distinguished symbol;
- the leaves are labelled in order with the words of the string; and
- each node in the interior of the tree is ADMITTED by a rule (this means the node's label is the LHS of a rule, and the node's daughters corresponding to the symbols on the RHS of the rule).
To prove that a sentence of a language is grammatical according to a set of rules, we determine whether it is possible to construct a tree according to the description above. The words in the sentence will be the leaves, so each word must be the RHS of a rule. If they do, we then have a list of symbols (the LHSs of the rules). We then try to match groups of symbols from this list to an RHS of a rule. If a match is found, that particular group is replaced by the symbol on the LHS of the relevant rule. If eventually this leaves us with a single symbol left, and this is the distinguished symbol, the sentence is grammatical.
The symbols on the RHS of a rule must only ever depend on the symbol on the LHS--any adjacent symbols must not be taken into account. With respect to the production rule
npthis means that if there is an np symbol, we may try to rewrite it with the rule, regardless of what other symbols may be adjacent to it. Syntactic descriptions of this format are called context free grammars because the rewriting process is completely determined by the symbols mentioned in the rule--the contexts of the symbols are irrelevant.
CFGs have three important characteristics: