Building the Estonian finite state morphology

Heli Uibo, University of Tartu
heli.uibo@ut.ee
The research on modeling the Estonian morphology by finite state devices has been influenced mostly by (Koskenniemi 1983), (Karttunen 1993) and (Beesley&Karttunen 2000). We have used lexical transducer combined with two-level rules as a general model for describing Estonian morphology. As a relatively novel approach we can emphasize the application of the rules to the both sides of the lexical transducer both to the lexical representation and to the lemma.

The key issue in modeling morphology is the productivity of all kind of rules that participate in the word inflection, derivation and compounding processes. If the rule is absolutely productive then it is easy to formalize either as a rule or a part of the network of lexicons. Exceptions always cause problems and often lead to inelegant solutions in the language description.

The Estonian morphology is complex the number of different inflected forms is 28 for nouns and about 70 for verbs. Moreover, parallel forms almost double these numbers for some inflection types. There are a lot of different inflection types in Estonian the minimal type system developed by ▄lle Viks (Viks 1992) contains 38 types and 84 exceptions. Some of the inflection types are not productive any more. The Estonian language is both agglutinative and flective, thus it is natural to model its morphotactics by the network of lexicons and its stem flexion rules by rules.

The principal problems and their solutions for building the Estonian finite state morphology are the following:

  1. Stem changes
    Solution many in one - all the possible stem variants are encoded as a single lexical entry, using lexical symbols (morphophonemes) that correspond to different phonemes on the surface.
    1. Stem internal changes are handled by lexical symbols. Two-level rules state the legal correspondences between lexical and surface phonemes, depending on the current morphophonological context.
    2. Stem final changes are mostly described by the means of continuation lexicons.
  2. Agglutinative processes occuring by declination of nouns and conjugation of verbs
    Solution: described by three layers of lexicons
    1. continuation lexicon for each inflection type.
    2. allocation of stem variants in the paradigm
    3. adding of grammatical features and endings
  3. Inflection types: how much to describe by lexicons and how much by rules?
    Solution: the network of lexicons is based on the type system provided by (Viks 1992). Regular stem changes, phonotactics, orthography and morphophonological distribution have been described by rules. The system is a little unbalanced the network of lexicons plays the major role. We could try to add more rules to diminish the workload of lexicons but on the other hand human-readability of the system of lexicons is important to be able to update the lexicons. And also, as long as we use the Viks's type system we can reuse the automatic inflection type detection moduledeveloped for this particular system.
  4. Derivation
    1. Absolutely productive derivation modeled within the network of lexicons. Problems:
      1. On which step is the upper side of the lexical transducer (primary form and word class) completed?
      2. How to handle the words with stem-internal changes?
    2. Partially productive derivation. Include into lexicon or mark the words which can be derived in the specific ways. Suitable tool in finite state technologies: flag diacritics.
    3. Unsolved problem: the balance between productivity and lexicalization how complex is it to describe partially productive derivation types by minilexicons and continuation links (instead of including the derivatives into stem lexicons as independent stems)? Which derivation types to consider productive enough? Which are the formal features that could be used to handle some processes in derivation by rules?
  5. Compounding Solution: build the compounds using the continuation links between certain lexicons. Problem - overgeneration!
    We can use only the features present in the lexicon. What kind of features are essential? Is Estonian compounding generally formalizable at all? Proposed solutions: flag diacritics, finite state filters.
  6. Which additional features to encode in the lexical representation?
    1. strong/weak grade in consonant gradation weak grade marked by $, similarly to (Koskenniemi 1983)
    2. degree of quantity (degrees II and III differ in the written form for stops (k, p, t : kk, pp, tt) only). Has not been done, but it could be useful both for text-to-speech systems and for determination of the inflection type of a word.
  7. Should the description of Estonian morphology be as universal as possible or should it be oriented to specific language technology application(s)?
    For example, if we want the morphological description to be usable for automatic hyphenation, the word boundaries in compounds should be marked. Another example: spelling check is very sensitive to overgeneration, whereas information retrieval is not. At the moment, the description is not oriented to any specipic application.

The Estonian finite state morphology has been implemented using the XEROX tools LEXC, TWOLC and XFST. There are 45 two-level rules. The network of lexicons covers all the inflection types. The stem lexicon contains ca 2500 most frequent word roots, based on the frequency dictionary of Estonian (Kaalep, Muischnek 2002). Additionally, the network of lexicons include ca 200 continuation lexicons, which describe the stem final changes, noun declination, verb conjugation, derivation and compounding.

For modelling derivation a new solution has been proposed: the two-levelness has been partly extended to the upper side of the lexical transducer to the lexical representations of the lemmas of forms productively derivable from the verb roots. The proposed approach may be applied in describing the morphology of languages, where the word stems are subject to change during productive derivational processes.

It has been shown that the two-level representation is useful for the description of the Estonian stem internal changes, especially because the stem flexion type does not depend on the phonological shape of a stem in the contemporary Estonian any more. The network of lexicons, combined with rules, having effect on morpheme boundaries, naturally describe the morphotactic processes. The lexicons are also useful for describing the non-phonologically caused stem end alternations.

However, some open problems remain to be solved in the ongoing research on Estonian finite-state morphology: To increase the coverage of root lexicons. To guess the analysis of unknown words. The idea that has so far tested only on a very limited lexicon is to have a root as a regular expression (e.g. CVVCV) in the root lexicon for each productive inflection type. To constrain the overgeneration of compound words. The idea is to apply the semantic features. To include the finite-state component into practical applications. The most interesting idea in this perspective is to work on fuzzy information retrieval that is tolerant to misspellings and typos.

References

K. Beesley, L. Karttunen. 2000. Finite-State Non-Concatenative Morphotactics. In "Proceedings of SIGPHON-2000" 5th Workshop of the ACL Special Interest Group in Computational Phonology, Centre Universitaire, Luxembourg. 1-12.

H.-J. Kaalep, K. Muischnek. 2002..Eesti kirjakeele sageduss§nastik (The frequency dictionary of written Estonian) University of Tartu Press, Tartu.

L. Karttunen. 1993. Finite-State Lexicon Compiler. Technical Report. ISTL-NLTT-1993-04-02. April 1993. Xerox Palo Alto Research Centre. Palo Alto, California.

K. Koskenniemi 1983. Two-level Morphology: A General Computational Model for Word-Form Recognition and Production. University of Helsinki, Dept of General Linguistics. Publications No. 11. Helsinki.

▄. Viks. 1992. A Concise Morphological Dictionary of Estonian I: Introduction & Grammar. Tallinn.


Last modified: Thu Aug 11 12:56:14 EEST 2005