Xerox Palo Alto Research Center
Kenneth R. Beesley
Xerox Research Centre Europe
This was the situation in the spring of 1980 when Kimmo Koskenniemi came to a conference on parsing that Lauri Karttunen had organized at the University of Texas at Austin. Also at the same conference were two Xerox researchers from Palo Alto, Ronald M. Kaplan and Martin Kay. The four K's discovered that all of them were interested and had been working on the problem of morphological analysis. Koskenniemi went on to Palo Alto to visit Kay and Kaplan at PARC.
This was the beginning of Two-Level Morphology, the first general model in the history of computational linguistics for the analysis and generation of morphologically complex languages. The language-specific components, the lexicon and the rules, were combined with a runtime engine applicable to all languages. In this article we trace the development of the finite-state technology that Two-Level Morphology is based on.
Traditional phonological grammars, formalized in the 1960s by Noam Chomsky and Morris Halle [Chomsky and Halle, 1968] , consisted of an ordered sequence of rewrite rules that convert abstract phonological representations into surface forms through a series of intermediate representations. Such rules have the general form x -> y / z _ w where x, y, z, and w can be arbitrarily complex strings or feature-matrices. In mathematical linguistics [Partee et al., 1993], such rules are called CONTEXT-SENSITIVE REWRITE RULES, and they are more powerful than regular expressions or context-free rewrite rules.
In 1972, C. Douglas Johnson published his dissertation, Formal Aspects of Phonological Description [Johnson, 1972], wherein he showed that phonological rewrite rules are actually much less powerful than the notation suggests. Johnson observed that while the same context-sensitive rule could be applied several times recursively to its own output, phonologists have always assumed implicitly that the site of application moves to the right or to the left of the string after each application. For example, if the rule x -> y / z _ w is used to rewrite the string ``uzxwv'' as ``uzywv'', any subsequent application of the same rule must leave the ``y'' part unchanged, affecting only ``uz'' or ``wv''. Johnson demonstrated that the effect of this constraint is that the pairs of inputs and outputs of any such rule can be modeled by a finite-state transducer. Unfortunately, this result was largely overlooked at the time and was rediscovered by Ronald M. Kaplan and Martin Kay in around 1980 [Kaplan and Kay, 1981,Kaplan and Kay, 1994].
Johnson was also aware of an important mathematical property of finite-state transducers [Schützenberger, 1961]: there exissts, for any pair of transducers applied sequentially, an equivalent single transducer. Any cascade of rule transducers could in principle be composed into one transducer that maps lexical forms directly into the corresponding surface forms, and vice versa, without any intermediate representations. Later, Kaplan and Kay had the same idea, illustrated in Figure 1.
These theoretical insights did not immediately lead to practical results. The development of a compiler for rewrite rules turned out to be a very complex task. It became clear that it required as a first step a complete implementation of basic finite-state operations such as union, intersection, complementation, and composition. Developing a complete finite-state calculus was a challenge in itself on the computers that were available at the time.
Another reason for the slow progress may have been that there were persistent doubts about the practicality of the approach for morphological analysis. Traditional phonological rewrite rules describe the correspondence between lexical forms and surface forms as a one-directional, sequential mapping from lexical forms to surface forms. Even if it was possible to model the generation of surface forms efficiently by means of finite-state transducers, it was not evident that it would lead to an efficient analysis procedure going in the reverse direction, from surface forms to lexical forms.
Let us consider a simple illustration of the problem with two sequentially applied rewrite rules, N ->m / _ p and p -> m / m _ . The corresponding transducers map the lexical form ``kaNpat'' unambiguously to ``kammat'', with ``kampat'' as the intermediate representation. However if we apply the same transducers in the other direction to the input ``kammat'', we get the three results shown in Figure 2.
This asymmetry is an inherent property of the generative approach to phonological description. If all the rules are deterministic and obligatory and the order of the rules is fixed, each lexical form generates only one surface form. But a surface form can typically be generated in more than one way, and the number of possible analyses grows with the number of rules that are involved. Some of the analyses may turn out to be invalid because the putative lexical forms, say ``kammat'' and ``kampat'' in this case, might not exist in the language. But in order to look them up in the lexicon, the system must first complete the analysis. Depending on the number of rules involved, a surface form could easily have dozens of potential lexical forms, even an infinite number in the case of certain deletion rules.
Although the generation problem had been solved by Johnson, Kaplan and Kay, at least in principle, the problem of efficient morphological analysis in the Chomsky-Halle paradigm was still seen as a formidable challenge. As counterintuitive as it was from a psycholinguistic point of view, it appeared that analysis was much harder computationally than generation. Composing all the rule transducers into a single one would not solve the ``overanalysis'' problem. Because the resulting single transducer is equivalent to the original cascade, the ambiguity remains.
The solution to the overanalysis problem should have been obvious: to formalize the lexicon itself as a finite state transducer and compose the lexicon with the rules. With the lexicon included in the composition, all the spurious ambiguities produced by the rules are eliminated at compile time. The runtime analysis becomes more efficient because the resulting single transducer contains only lexical forms that actually exist in the language.
The idea of composing the lexicon and the rules together is not mentioned in Johnson's book or in the early Xerox work. Although there obviously had to be some interface relating a lexicon component to a rule component, these were traditionally thought of as different types of objects. Furthermore, rules were traditionally conceived as applying to individual word forms; the idea of applying them simultaneously to a lexicon as a whole required a new mindset and computational tools that were not yet available.
The fact that finite-state networks could be used to represent both the inventory of valid lexical forms and the rules for mapping them to surface forms took a while to emerge. When it first appeared in print [Karttunen et al., 1992], it was not in connection with traditional rewrite rules but with an entirely different finite-state formalism that had been introduced in the meantime, Kimmo Koskenniemi's TWO-LEVEL RULES [Koskenniemi, 1983].
In the spring of 1980 when Kimmo Koskenniemi came to the USA for a visit, he learned about Kaplan and Kay's finite-state discovery. (They weren't then aware of Johnson's 1972 publication.) Xerox had begun work on the finite-state algorithms, but they would prove to be many years in the making. Koskenniemi was not convinced that efficient morphological analysis would ever be practical with generative rules, even if they were compiled into finite-state transducers. Some other way to use finite automata might be more efficient.
Back in Finland, Koskenniemi invented a new way to describe phonological alternations in finite-state terms. Instead of cascaded rules with intermediate stages and the computational problems they seemed to lead to, rules could be thought of as statements that directly constrain the surface realization of lexical strings. The rules would not be applied sequentially but in parallel. Each rule would constrain a certain lexical/surface correspondence and the environment in which the correspondence was allowed, required or prohibited. For his 1983 dissertation, Koskenniemi constructed an ingenious implementation of his constraint-based model that did not depend on a rule compiler, composition or any other finite-state algorithm, and he called it TWO-LEVEL MORPHOLOGY. Two-level morphology is based on three ideas:
To illustrate the first two principles we can turn back to the kaNpat example again. A two-level description of the lexical-surface relation is sketched in Figure 3.
Two-level rules may refer to both sides of the context at the same time. The yie alternation in English plural nouns could be described by two rules: one realizes y as i in front of an epenthetic e; the other inserts an epenthetic e between a lexical consonant-y sequence and a morpheme boundary (+) that is followed by an s. Figure 4 illustrates the y:i and 0:e constraints.
Like replace rules, two-level rules describe regular relations; but there is an important difference. Because the zeros in two-level rules are in fact ordinary symbols, a two-level rule represents an equal-length relation. This has an important consequence: Although transducers cannot in general be intersected, Koskenniemi's constraint transducers can be intersected. In fact, the apply function that maps the surface strings to lexical strings, or vice versa, using a set of two-level rules in parallel, simulates the intersection of the rule automata. It also simulates, at the same time, the composition of the input string with the constraint networks, just like the ordinary apply function.
Figure 5 illustrates the application of the N:m and p:m rules sketched in Figure 3 to the input ``kammat''. At each point in the process, all lexical candidates corresponding to the current surface symbol are considered one by one. If both rules accept the pair, the process moves on to the next point in the input. In the situation shown in Figure 5, the pair p:m will be accepted by both rules. The N:m rule accepts the pair because the p on the lexical side is required to license the N:m pair that has tentatively been accepted at the previous step. The p:m rule accepts the p:m pair because the preceding pair has an m on the surface side.
When the pair in question has been accepted, the apply routine moves on to consider the next input symbol and eventually comes back to the point shown in Figure 5 to consider other possible lexical counterparts of a surface m. They will all be rejected by the N:m rule, and the apply routine will return to the previous m in the input to consider other alternative lexical counterparts for it such as p and m. At every point in the input the apply routine must also consider all possible deletions, that is, pairs such as +:0 and e:0 that have a zero on the input side.
Applying the rules in parallel does not in itself solve the overanalysis problem discussed in the previous section. The two constraints sketched above allow ``kammat'' to be analyzed as ``kaNpat'', ``kampat'', or ``kammat''. However, the problem is easy to manage in a system that has only two levels. The possible upper-side symbols are constrained at each step by consulting the lexicon. In Koskenniemi's two-level system, lexical lookup and the analysis of the surface form are performed in tandem. In order to arrive at the point shown in Figure 5, we must have traversed a path in the lexicon that contains the lexical string in question, see Figure 6. The lexicon acts as a continuous lexical filter. The analysis routine only considers symbol pairs whose lexical side matches one of the outgoing arcs in the current state.
In Koskenniemi's 1983 system, the lexicon was represented as a forest of tries (= letter trees), tied together by continuation-class links from leaves of one tree to roots of another tree or trees in the forest.1 Koskenniemi's lexicon can be thought of as a partially deterministic, unminimized simple network. In the Xerox lexc tool, the lexicon is a minimized network, typically a transducer, but the filtering principle is the same. The lookup utility in lexc matches the lexical string proposed by the rules directly against the lower side of the lexicon. It does not pursue analyses that have no matching lexical path. Furthermore, the lexicon may be composed with the rules in lexc to produce a single transducer that maps surface forms directly to lexical forms, and vice versa .
Koskenniemi's two-level morphology was the first practical general model in the history of computational linguistics for the analysis of morphologically complex languages. The language-specific components, the rules and the lexicon, were combined with a universal runtime engine applicable to all languages. The original implementation was primarily intended for analysis, but the model was in principle bidirectional and could be used for generation.
Although the two-level approach to morphological analysis was quickly accepted as a useful practical method, the linguistic insight behind it was not picked up by mainstream linguists. The idea of rules as parallel constraints between a lexical symbol and its surface counterpart was not taken seriously at the time outside the circle of computational linguists. Many arguments had been advanced in the literature to show that phonological alternations could not be described or explained adequately without sequential rewrite rules. It went largely unnoticed that two-level rules could have the same effect as ordered rewrite rules because two-level rules allow the realization of a lexical symbol to be constrained either by the lexical side or by the surface side. The standard arguments for rule ordering were based on the a priori assumption that a rule can refer only to the input context.
But the world has changed. Current phonologists, writing in the framework of OT (Optimality Theory), are sharply critical of the ``serialist'' tradition of ordered rewrite rules that Johnson, Kaplan and Kay wanted to formalize [Prince and Smolensky, 1993,Kager, 1999,McCarthy, 2002].2In a nutshell, OT is a two-level theory with ranked parallel constraints. In contrast to Koskenniemi's ``hard'' constraints, optimality constraints are ``soft'' and violable. There are of course many other differences. Most importantly, OT constraints are meant to be universal. The fact that two-level rules can describe orthographic idiosyncrasies such as the yie alternation in English with no help from universal principles makes the approach uninteresting from the OT point of view.3
In his 1983 dissertation, Koskenniemi defined a formalism for two-level rules. The N:m and p:m rules discussed above are written as N:m <=> _ p: and p:m <=> :m _ , where <=> is an operator that combines a context restriction with the prohibition of any other realization for the lexical symbol of the pair. The colon in the right context of first rule, p:, indicates that it refers to a lexical symbol; the colon in the left context of the second rule, :m, indicates a surface symbol. The i:y rule sketched in Figure 4 that refers to both contexts is formally expressed as y:i <=> _ 0:e.
The semantics of two-level rules were well-defined but there was no rule compiler available at the time. Koskenniemi and other early practitioners of two-level morphology had to compile their rules by hand into finite-state transducers. This is tedious in the extreme and demands a detailed understanding of transducers and rule semantics that few human beings can be expected to grasp. A complex rule with multiple overlapping contexts may take hours of concentrated effort to compile and test, even for an expert human ``compiler''. In practice, linguists using two-level morphology consciously or unconsciously tended to postulate rather surfacy lexical strings, which kept the two-level rules relatively simple.
Although two-level rules are formally quite different from the rewrite rules studied by Kaplan and Kay, the basic finite-state methods that had been developed for compiling rewrite-rules were applicable to two-level rules as well. In both formalisms, the most difficult case is a rule where the symbol that is replaced or constrained appears also in the context part of the rule. This problem Kaplan and Kay had already solved with an ingenious technique for introducing and then eliminating auxiliary symbols to mark context boundaries. Kaplan and Koskenniemi worked out the basic compilation algorithm for two-level rules in the summer of 1985 when Koskenniemi was a visitor at Stanford. The first two-level rule compiler was written in InterLisp by Koskenniemi and Karttunen in 1985-87 using Kaplan's implementation of the finite-state calculus [Koskenniemi, 1986,Karttunen et al., 1987]. The current C-version of the compiler, based on Karttunen's 1989 Common Lisp implementation, was written by Lauri Karttunen, Todd Yampol and Kenneth R. Beesley in consultation with Kaplan at Xerox PARC in 1991-92 [Karttunen and Beesley, 1992]. The landmark 1994 article by Kaplan and Kay on the mathematical foundations of finite-state linguistics gives a compilation algorithm for phonological rewrite rules and for Koskenniemi's two-level rules.4
The Xerox two-level compiler has been used to compile large-scale morphological analyzers for French, English, Spanish, Portuguese, Dutch, Italian and many other languages. In the course of this work, it soon became evident that the two-level formalism was difficult for the linguists to master. It is far too easy to write rules that are in conflict with one another. It was necessary to make the compiler check for, and automatically eliminate, most common types of conflicts.
For example, in Finnish consonant gradation, an intervocalic k generally disappears in the weak grade. However, between two high labial vowels k is realized as a v. Consequently, the genitive of ``maku'' 'taste' is ``maun'', but the genitive of ``puku'' 'suit, dress' is ``puvun''. The problem is that the k in ``puku'' is also an instance of an intervocalic k, and it should be deleted by the general rule. Figure 7 illustrates the conflict.
This situation is not a problem for a derivational phonologist because the rule that turns k into v in the more specific context can be ordered before the deletion rule that applies in the more general environment. This is known as DISJUNCTIVE ordering. In Optimality Theory, cases of this sort are handled by constraint ranking. In a two-level framework, there is seemingly a problem. If one constraint requires the genitive of ``puku'' to be realized as ``puun'' and another accepts only ``puvun'', the result is that ``puku'' cannot be realized in the genitive at all.
This is one of the many types of conflicts that the Xerox compiler detects and resolves without difficulty. It sees that the context of the k:v rule is subsumed by the context of the k:0 rule. The conflict is resolved by compiling the more general rule in such a way that an intervocalic k can be either deleted or realized as v. The compiler takes advantage of UNDERSPECIFICATION, a very useful option in a declarative system of constraints. The general rule relies on the specific one to produce the correct result. In the two-level model, it is not the responsibility of any single rule to get everything right because the other rules will also have their say.
The first implementation [Koskenniemi, 1983] was quickly followed by others. The most influential implementation was by Lauri Karttunen and his students at the University of Texas [Karttunen, 1983,Gajek et al., 1983,Dalrymple et al., 1983]. Published accounts of this project inspired many copies and variations, including those by Beesley [Beesley, 1989,Beesley, 1990]. A copyrighted but freely distributed implementation of classic Two-Level Morphology, called PC-KIMMO, available from the Summer Institute of Linguistics [Antworth, 1990]; runs on PCs, Macs and Unix systems.
In Europe, two-level morphological analyzers became a standard component in several large systems for natural language processing such as the British Alvey project [Black et al., 1987,Ritchie et al., 1987,Ritchie et al., 1992], SRI's CLE Core Language Engine [Carter, 1995], the ALEP Natural Language Engineering Platform [Pulman, 1991] and the MULTEXT project [Armstrong, 1996]. ALEP and MULTEXT were funded by the European Commission. The MMORPH morphology tool [Petitpierre and Russel, 1995] built at ISSCO for MULTEXT is now available under GNU Public License.5
Some of these systems were implemented in Lisp (Alvey), some in Prolog (CLE, ALEP), some in C (MMORPH). They were based on simplified two-level rules, the so-called PARTITION-BASED formalism [Ruessink, 1989], which was claimed to be easier for linguists to learn than the original Koskenniemi notation. But none of these systems had a finite-state rule compiler. Another difference is that morphological parsing could be constrained by feature unification. Because the rules are interpreted at runtime and because of the unification overhead, these systems are not very efficient, and two-level morphology acquired, undeservedly, a reputation for being ``slow''. MMORPH solves the speed problem by allowing the user to run the morphology tool off-line to produce a database of fully inflected word forms and their lemmas. A compilation algorithm has been developed for the partition-based formalism [Grimley-Evans et al., 1996], but to our knowledge there is no publicly available compiler for it.
The pioneers of finite-state morphology knew well that a cascade of transducers compiled from phonological rewrite rules could be composed into a single one (see Figure 1). It was also known from the beginning that a set of two-level transducers could be merged into a single one [Karttunen, 1983,Koskenniemi, 1983] by intersecting them, as in Figure 8.
It was at that time that the researchers at Xerox [Karttunen et al., 1992] realized what should have been obvious to them much earlier. The intersection of two-level rules blows up because it constrains the realization of all the strings in the universal language. But in fact we are typically interested only in the strings of a particular language. If the lexicon is composed with the rules, it filters out all the spurious strings. This method had not been tried earlier because it seemed that the composition of a large lexicon with a large rule system would result in something even larger. It soon became evident that the result of composing a source lexicon with an intersected two-level rule system was never significantly larger than the original source lexicon, and typically much smaller than the intersection of the rules by themselves. Figure 9 illustrates this interesting phenomenon.
Including the lexicon at compile time obviously brings the same benefit in the case of a cascade of rewrite rules. The result does not grow significantly if the composition is done ``top-down'', starting with the lexicon and the first rule of the cascade.
A considerable amount of work has been done, and continues to be done, in the general framework of two-level morphology. However, at Xerox the newer xfst tool that includes an extended regular expression compiler with replace rules has largely supplanted twolc rules in many applications.
If the lexical forms are very different from the corresponding surface forms, it appears that for most ``language engineers'' a cascade of replace rules is conceptually easier to master than an equvalent set of two-level rules. The ordering of the rules seems to be less of a problem than the mental discipline required to avoid rule conflicts in a two-level system, even if the compiler automatically resolves most of them. From a formal point of view there is no substantive difference; a cascade of rewrite rules and a set of parallel two-level constraints are just two different ways to decompose a complex regular relation into a set of simpler relations that are easier to understand and manipulate.
When two-level rules were introduced, the received wisdom was that morphological alternations should be described by a cascade of rewrite-rules. Practitioners of two-level morphology used to write papers pointing out that a two-level account of certain phenomena was no less adquate than a serialist description [Karttunen, 1993]. It is interesting to note how linguistic fashions have changed.
From the current point of view, two-level rules have many interesting properties. They are symbol-to-symbol constraints, not string-to-string relations like general rewrite rules. Two-level rules make it possible to directly constrain deletion and epenthesis sites because the zero is an ordinary symbol. Two-level rules enable the linguist to refer to the input and the output context in the same constraint. The only anachronistic feature is that two-level constraints are inviolable. Perhaps we will see in the future a new finite-state formalism with weighted and violable two-level constraints.
All the systems that have been used for the description of morphological alternations, rewrite rules, two-level rules, and optimality constraints are UNIDIRECTIONAL. They have a generative orientation, viewing surface forms as a realization of the corresponding lexical forms, not the other way around. In the two-level formalism, the left-arrow part of a rule such as N:m <=> _ p: constrains the realization of a lexical N, not the interpretation of a surface m. This is an arbitrary choice, it would be easy to have another operator just like <=> but oriented in the other direction, constraining the potential lexical sources of a surface m. In fact, both types of rules could co-exist within the same BIDIRECTIONAL two-level rule system. This is an interesting possibility, especially for weighted constraints.
This document was generated using the LaTeX2HTML translator Version 99.1 release (March 30, 1999)
Copyright © 1993, 1994, 1995, 1996,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 -no_navigation helsinki.tex
The translation was initiated by Lauri Karttunen on 2001-08-12