Thematic Training Course on Processing Morphologically Rich Languages

11-15 April, 2011, at RIL in Budapest, Hungary

The University of Helsinki and the Research Institute for Linguistics, Hungarian Academy of Sciences will host a one week training course on computational processing of morphologically rich languages.

This PhD level course is a part of the thematic training programme offered by the Marie Curie ITN project CLARA. The course consists of lectures, discussions, and hands-on training activities.

Aim and Focus

Morphologically-rich languages like Turkish, Finnish, Hungarian, Smi etc. present significant challenges for natural language processing applications due to their relatively free word order and highly productive morphological and morphophonological processes (inflection, agglutination, compounding, vowel harmony). The course will work on problems due to dictionary size, sparse data, poor language model probability estimation, high out-of-vocabulary rate and information gaps on related lexical items.

The course will introduce advanced modelling techniques addressing these problems, such as decomposition of complex word forms into smaller units, relating inflectional variants to root forms (lemmatization), methods for optimizing the selection of units at different levels of processing, novel probability estimation techniques, and the creation of a new class of data resources and annotation tools. Newer finite state techniques have proved to be useful and can be optimized in combination with other methods. The course will conclude with an assessment of present day standard techniques and a demonstration of practical applications focusing primarily on Hungarian and Finnish.

The course focuses on practical issues and problem-solving strategies when building large-scale LT tools for morphologically complex languages. To the extent that time permits, the students will get hands-on tutorials and excercises to get a better understanding of the complexities involved when building such tools, and also to get a better grasp of the methods taught to handle this complexity.


The course is open to external participants, who are very welcome. The course is specifically aimed at PhD students, but is also open for candidates at master and postdoctoral levels. The course is relevant for researchers who are interested in practical applications of finite state transducers, or want to utilise such transducers for unrestricted text processing. We expect participants to have basic knowledge about finite state transducers, and some familiarity with a unix working environment.

Candidates who are affiliated with CLARA through one of its member institutions will be given priority, but we also welcome other national and international candidates. The CLARA consortium and its affiliated partners can be found on page:

ECTS credits

Participation in the course will give 5 ECTS credits for candidates at PhD level involved in the CLARA project.


The project CLARA - Common Language Resources and their Applications - has received research funding from the European Community within the Seventh Framework Programme, Marie Curie Actions.

