Inducing a morphological transducer from inflectional paradigms

Lauri Carlson

Paradigm morphology

A traditional way to represent the morphology of inflectional languages is through paradigms. A paradigm (the Greek word for example) is a table of inflectional forms of an example word, representing a given inflectional class. The table is indexed by grammatical tags, the items in the table cells are inflected forms. In linguistic morphology, this approach is known as the WP (Word and Paradigm) model.

Ideally, to find a given form of a new member of the same class, one substitutes the inflectional stem of the new word in place of the stem of the paradigm word and reads off the resulting form. In grammars intended for human consumption, the relation between the paradigm and its representatives may be subject to simple morphophonological rules.

Compared to concatenative (IA, Item and Arrangement) morphology, the WP model does not define any correspondence between individual tags and morphs. For instance, the plural genitive of Latin nouns in -us is -orum. Compared to rule (IP, Item and Process) morphology, there is no explicit treatment of morphophonology.

Paradigms in a WP morphology can be identified by arbitrary labels (declension or conjugation number) or by a set of thematic forms which suffice to identify the paradigm. For instance, Latin verb amo belongs to first conjugation, identified by the series amo, amavi, amatum, amare.

Inflectional morphology

Murf is a program intended to induce from traditional style paradigm sets a morphological transducer which (1) produces the forms in the paradigms, (2) does not produce any forms either explicitly or implicitly excluded from the paradigms, and (3) generalises common features of the paradigms, reducing redudancy in the paradigms.

The initial idea is quite simple. Murf reads in forms in a set of tagged forms, trying to place each form in a finite state network, maximising the match of the new form in the existing network. The new form is matched with the existing network at both ends of the net. A match which leaves the least unmatched residue is chosen, and the missing part is added into the net as a new arc.

Given, for instance, a paradigm

Form		Tagging

talossa	        talo 1 N SG INE
taloissa	talo 1 N SG INE
talona	        talo 1 N SG ESS

Murf correctly infers that the plural essive form is taloina:

0: talo talo 1 N 4
 4  SG 5
  5 ssa INE 1.
  5 na ESS 1.
 4 i PL 5
5

(The number following the base form identifies the base as a member of a given paradigm.) As the net shows, Murf is able to infer a segmentation of the forms into morphs and tags the morphs appropriately. As a side effect of entering the attested form in the network, new, unattested forms may get generated through re-entrances in the net. Call such forms side effects.

The initial idea needs a number of refinements to capture familiar morphological phenomena in real data. They include morphotax, complementary distribution, free variation, blocking, defective paradigms and productivity.

Morphotax concerns the admissible orders of tags in a well-formed word. The heuristics Murf follows here is that a proposed match of a new word is not allowed to produce unattested taggings. To guarantee that, Murf first forms a separate morphotax network of the taggings it has encountered. When a new form is considered for entry at a given place of the net, its side effects are checked for morphotax.

Complementary distribution is present when any given tagging is realised by just one form, although tags occurring in it have more than one allomorph. For instance, Finnish partitive endings tA and A are in complementary distribution, the former occurs after heavy syllables and the latter after light ones. Identically tagged forms are in free variation. Murf implements a complementary distribution check which prevents production of free variants as a side effect of insertion.

To allow genuine free variation past the complementary distribution check, it suffices to tag the variants as different. For instance, Finnish third person possessive suffix has two forms nsA and Vn which are in free variation after light open syllables. They are tagged as P3/A and P3, respectively.

Another distributional gap is that nouns do not occur in comitative plural without possessive suffix (adjectives do). To record such gaps Murf allows definition of separate networks for exceptions. For instance, entry

*-	- N PL_COM
disallows a noun ending in plural comitative.

Blocking refers to the phenomenon that a lexicalised exception to a regular rule blocks a productive, regular rule. For instance, Finnish nominative plural is talot, not taloi, as one might be led to expect from the previous data. Murf accounts for blocking in the following way. When a paradigm is read in, all forms in it are put on a waiting list. Whenever a form is inserted, forms on the waiting list are checked for blocking. An insertion is not allowed if it would produce a side effect blocked by a form on the waiting list.

Some paradigms are defective in that some forms are missing from an expected cross classification. For instance, Finnish comitative and instrumental (instructive) cases only have one number (plural). From a combinatorial point of view, case and number form in these cases a portmanteau morph instead of two independent morphs. The most straightforward way of recording this gap in distribution is to make the tag combination PL_COM a tag on its own.

Productivity refers to the fact that certain forms by default generalise to new words, while others are by default restricted to a closed set of forms. (This fact is one of the main motivations of paradigm morphology in the first place.) For instance, Finnish nominals have productive vowel stems and less productive consonant stems. A new base form pokemon will automatically go in the productive wovel stem paradigm. Murf allows marking a variant as a nonproductive one as follows:

tienoisiin	tienoo 24 N PL ILL
tienoihin	tienoo 24 N PL_ILL/h!
tienoiden	tienoo 24 N PL GEN
tienoitten	tienoo 24 N PL_GEN/tt!

Nonproductive variants marked with ! will not be generalised into paradigms where they have not been specifically licensed by attested forms.

Derivational morphology

Derivational morphology allows concatenating base forms coming from different paradigms. A derivational affix may be specific (at least) to part of speech. For instance, Finnish abessive adjective suffix tOn produces an adjective out of a noun.

Murf allows constraining derivational endings with a categorial grammar style tag format X\Y


onneton	onni 8 N tOn N\A 57 A SG NOM

This constrains tOn to combine with nouns and produce adjectives. (Formally, X\Y is analogous to a portmanteau tag in that it constrains variation at a point in the net.)

Experiments

Murf has been tested with Finnish nominal and verb paradigms. There are ca 80 nominal and 50 verb paradigms respectively, in the classification of the Modern Dictionary of Finnish (original edition). The set of 80 noun paradigms producing aroung 20.000 forms got coded into a nondeterministic transducer with around 600 states and 1700 arcs, almost 1000 of which were epsilon arcs. Only correct forms get produced.

With little space/time optimisation done so far, adding new paradigms gets slow toward the end of the process. Adding new words to existing paradigms can be made faster. It is also possible to use the net to guess the paradigms of unknown words on the basis of thematic forms.

Order sensitivity

The phenomenon of blocking makes Murf sensitive to the order in which forms are presented to it. If regular paradigms are presented before irregular ones, Murf tends to overgeneralise, and subregularities across irregular paradigms may get missed. The best strategy seems to be to start with paradigms which exhibit central regularities but make significant splits between regular and irregular sets of endings. In Finnish nouns, a good strategy proved to be to start with bisyllabic nouns in paradigm 40 (susi 'wolf', vesi 'water') whose local cases are regular, while irregular grammatical cases show stem allomorphy.

Discussion

Murf differs from some earlier approaches to learning morphology from data. Koskenniemi (1991) considers learning two-level morphophonological rules from surface alternations. Goldsmith (2000) presents a heuristics and a statistical evaluation procedure to find a morphological segmentation for a language from a raw text corpus. Murf, in contrast, expects fully tagged and classified paradigms prepared by a linguist and restricts itself to the task of converting the paradigms into a less redundant transducer form. From a linguistic point of view, Murf can be seen to implement some of the traditional principles of taxonomical morphemic analysis. More robust methods can be thought of to attack the transducer induction task as a purely computational problem.

References

  1. K. Koskenniemi. 1991. "A Discovery Procedure for Two-Level Phonology". In L. Cignoni and C. Peters (editors), Computational Lexicology and Lexicography: A Special Issue Dedicated to Bernard Quemada, 1991, 451-465
  2. Goldsmith, J. 2000. "Unsupervised Learning of the Morphology of a Natural Language." http://humanities.uchicago.edu/faculty/goldsmith/Linguistica2000/Paper/paper.html, University of Chicago, 2000.