Building the Estonian finite state morphology
Heli Uibo, University of Tartu
The research on modeling the Estonian morphology by finite state devices has
been influenced mostly by (Koskenniemi 1983), (Karttunen 1993) and
(Beesley&Karttunen 2000). We have used lexical transducer combined with
two-level rules as a general model for describing Estonian morphology. As a
relatively novel approach we can emphasize the application of the rules to the
both sides of the lexical transducer both to the lexical representation and
to the lemma.
The key issue in modeling morphology is the productivity of all kind of rules
that participate in the word inflection, derivation and compounding processes.
If the rule is absolutely productive then it is easy to formalize either as a
rule or a part of the network of lexicons. Exceptions always cause problems and
often lead to inelegant solutions in the language description.
The Estonian morphology is complex the number of different inflected
forms is 28 for nouns and about 70 for verbs. Moreover, parallel forms
almost double these numbers for some inflection types. There are a lot
of different inflection types in Estonian the minimal type system
developed by Ülle Viks (Viks 1992) contains 38 types and 84
exceptions. Some of the inflection types are not productive any more.
The Estonian language is both agglutinative and flective, thus it is
natural to model its morphotactics by the network of lexicons and its
stem flexion rules by rules.
The principal problems and their solutions for building the Estonian finite
state morphology are the following:
- Stem changes
Solution many in one - all the possible stem variants are encoded as a single
lexical entry, using lexical symbols (morphophonemes) that correspond to
different phonemes on the surface.
- Stem internal changes are handled by lexical
symbols. Two-level rules state the legal correspondences between
lexical and surface phonemes, depending on the current
- Stem final changes are mostly
described by the means of continuation lexicons.
- Agglutinative processes occuring by declination of nouns and
conjugation of verbs
Solution: described by three layers of lexicons
- continuation lexicon for each inflection type.
- allocation of stem variants in the paradigm
- adding of grammatical features and endings
- Inflection types: how much to describe by lexicons and how
much by rules?
Solution: the network of lexicons is based on the type system
provided by (Viks 1992). Regular stem changes, phonotactics,
orthography and morphophonological distribution have been described by
rules. The system is a little unbalanced the network of lexicons plays
the major role. We could try to add more rules to diminish the
workload of lexicons but on the other hand human-readability of the
system of lexicons is important to be able to update the lexicons. And
also, as long as we use the Viks's type system we can reuse the
automatic inflection type detection moduledeveloped for this
- Absolutely productive derivation modeled within the network
of lexicons. Problems:
- On which step is
the upper side of the lexical transducer (primary form and word class)
- How to handle the words with stem-internal changes?
- Partially productive derivation. Include into lexicon or mark the words
which can be derived in the specific ways. Suitable tool in finite state
technologies: flag diacritics.
- Unsolved problem: the balance between productivity and
lexicalization how complex is it to describe partially productive
derivation types by minilexicons and continuation links (instead of
including the derivatives into stem lexicons as independent stems)?
Which derivation types to consider productive enough? Which are the
formal features that could be used to handle some processes in
derivation by rules?
- Compounding Solution: build the compounds using the
continuation links between certain lexicons. Problem -
We can use only the features present in the lexicon. What kind of
features are essential? Is Estonian compounding generally formalizable
at all? Proposed solutions: flag diacritics, finite state filters.
- Which additional features to encode in the lexical
- strong/weak grade in consonant gradation weak grade marked by $,
similarly to (Koskenniemi 1983)
- degree of quantity (degrees II and III differ in the written
form for stops (k, p, t : kk, pp, tt) only). Has not been done,
but it could be useful both for text-to-speech systems and for
determination of the inflection type of a word.
- Should the description of Estonian morphology be as universal as
possible or should it be oriented to specific language technology
For example, if we want the morphological description
to be usable for automatic hyphenation, the word boundaries in
compounds should be marked. Another example: spelling check is very
sensitive to overgeneration, whereas information retrieval is not. At
the moment, the description is not oriented to any specipic
The Estonian finite state morphology has been implemented using the
XEROX tools LEXC, TWOLC and XFST. There are 45 two-level rules. The
network of lexicons covers all the inflection types. The stem lexicon
contains ca 2500 most frequent word roots, based on the frequency
dictionary of Estonian (Kaalep, Muischnek 2002). Additionally, the
network of lexicons include ca 200 continuation lexicons, which
describe the stem final changes, noun declination, verb conjugation,
derivation and compounding.
For modelling derivation a new solution
has been proposed: the two-levelness has been partly extended to the
upper side of the lexical transducer to the lexical representations of
the lemmas of forms productively derivable from the verb roots. The
proposed approach may be applied in describing the morphology of
languages, where the word stems are subject to change during
productive derivational processes.
It has been shown that the
two-level representation is useful for the description of the Estonian
stem internal changes, especially because the stem flexion type does
not depend on the phonological shape of a stem in the contemporary
Estonian any more. The network of lexicons, combined with rules,
having effect on morpheme boundaries, naturally describe the
morphotactic processes. The lexicons are also useful for describing
the non-phonologically caused stem end alternations.
open problems remain to be solved in the ongoing research on Estonian
finite-state morphology: To increase the coverage of root lexicons.
To guess the analysis of unknown words. The idea that has so far
tested only on a very limited lexicon is to have a root as a regular
expression (e.g. CVVCV) in the root lexicon for each productive
inflection type. To constrain the overgeneration of compound
words. The idea is to apply the semantic features. To include the
finite-state component into practical applications. The most
interesting idea in this perspective is to work on fuzzy information
retrieval that is tolerant to misspellings and typos.
K. Beesley, L. Karttunen. 2000. Finite-State Non-Concatenative
Morphotactics. In "Proceedings of SIGPHON-2000" 5th Workshop of the
ACL Special Interest Group in Computational Phonology, Centre
Universitaire, Luxembourg. 1-12.
H.-J. Kaalep, K. Muischnek. 2002..Eesti kirjakeele sagedussõnastik
(The frequency dictionary of written Estonian) University of Tartu
L. Karttunen. 1993. Finite-State Lexicon Compiler. Technical Report.
ISTL-NLTT-1993-04-02. April 1993. Xerox Palo Alto Research
Centre. Palo Alto, California.
K. Koskenniemi 1983. Two-level Morphology: A General Computational
Model for Word-Form Recognition and Production. University of
Helsinki, Dept of General Linguistics. Publications No. 11. Helsinki.
Ü. Viks. 1992. A Concise Morphological Dictionary of Estonian I:
Introduction & Grammar. Tallinn.
Last modified: Thu Aug 11 12:56:14 EEST 2005