In this paper we discuss the problems pertaining to non-concatenative processes in Bantu languages. For example, verbs undergo processes of productive derivation, reduplication, and inflection, and there can be up to 15 morpheme slots. While derivation and inflection can be handled as concatenation of morphemes, reduplication cannot. We are particularly interested in how to handle verbs in disjoining writing systems, where part of verb morphemes are written as separate words. We suggest that in disjoining writing systems verb structures are first identified with a specially constructed tokeniser and then analysed with a morphological analyser. Such a tokeniser requires, in addition to identifying words, punctuation marks and diacritics, also identifying sequences of such 'words' that are part of the verb. Test languages used in this study are Kwanyama (Hurskainen and Halme 2001) and Northern Sotho.
The tokenisation is carried out in two phases. Verb constructions with more than one 'word' are identified first. Initially, verb candidates in text are marked. This marking is based on matching verb roots, excluding a small number of monosyllabic roots that cause excessive over-marking. The marked roots are then tested against other criteria of the verb, such as prefixes and suffixes that are written as separate words. If the test succeeds, the construction is marked as a verb. When verb constructions have been identified and marked, the conventional tokenisation operation can be carried out.
The morphological analysis can now be carried out even with a parser that operates on a word level only, i.e. without crossing blanks. However, in order to handle other non-concatenative features, especially reduplication and restriction of morpheme co-occurrence, we have implemented the morphological analyser with xfst and lexc available in the Xerox tool package (Beesley and Karttunen 2003).
The stem of a Bantu verb may be reduplicated, including the extended stem. Although only part of the verb stems, simple or extended, occur in reduplicated forms in practice, it is cumbersome to list them in the lexicon. A more adequate solution is offered by the Xerox tool package. The part of the verb subject to reduplication, i.e. the stem, can be formulated as a regular expression and then repeated. This temporary meta-language is then replaced by the surface language proper. Tests show, however, that memory problems will be encountered if the compile-replace operation is applied on a full-size dictionary. Various solutions were tested for solving the memory problem, including the under-specification of verb prefixes and handling the lexicon in parts so that only the verb stems were subjected to the compile-replace operation. Nevertheless, memory problems seem to prevail.
The co-occurrence of such morphemes that are on different sides of the verb stem was restricted with flag diacritics. This also adds to the memory requirements.
Tests made with Bantu languages show that more research is needed for solving memory problems in morphological analysis, before a full-size analyser, including disambiguation, syntactic and semantic analysis etc. is feasible.
Hurskainen, Arvi and Halme, Riikka, 2001. Mapping between Disjoining and Conjoining Writing Systems in Bantu Languages: Implementation on Kwanyama. Nordic Journal of African Studies, 10(3): 399-414.