Under construction

2013-14 The Kone Foundation funds the AKU project (Avointa kieliteknologiaa uralilaisille vähemmistökielille 'Open-source Language Technology for Uralic Minority Languages') where we creating finite-state morphological analyzers with Finnish-language glossing for five languages. During the two year span (2013-2014) our aim is to create working morphological parsers for five minority Uralic languages.
The finite-state transducers constructed for the morphological parsers can be used for many other purposes. They have been used in the free Voikko spell checker applications, morphology-savvy net dictionaries, "Oahpa" language learning programs and rule-based machine translation applications. Central components, such as the lemma lists with their Finnish glosses, guarantee access to Uralic materials to a broader public, the "Pelastusdigitointihanke" funded by the Kone Foundation, for instance, Uralica and Finno-Ugrian materials from the 1920s and 1930s (Veps, Ingrian, Mari and Mordvin) as well as Digitization Project of Kindred Languages (Uralic languages), below.

The selection of languages in this project consists of both Balto-Finnic and other memebers of Uralic langauge family. Primary languages targeted in 2013 are Hill Mari (urj-fiu-chm-mrj), Livonian (urj-fiu-xxx-liv), Moksha (urj-fiu-xxx-mdf), Nenets (urj-syd-yrk) and Olonets Karelian (urj-fiu-xxx-olo), note the ISO 639-5 codes urj.

The Project adheres to a principle of open-source and in addition to working morphological parsers, free spell checkers and Finnish-languages translations are produced for each of the 5 main languages (to the extent of 20,000 word stems).

The project is physically located at the University of Helsinki, and it is administered through the Department of Modern Languages.
Work is done in the University of Tromsø infrastructure at Giellatekno, the Sámi-language technology infrastructure, where Helsinki Finite-State Transducer Technology is being applied to numerous facets of minority language envigoration (HFST). The robust morphological analysers created are used as the bases for open-source spell checkers which can be used in the Finnish-language spell checker and syllable marker (Voikko).

The Primary languages funded by the KONE Foundation in Helsinki with their ISO codes:
Hill Mari (urj:fiu:chm:mrj) with a Voikko spell checker application zhfst
speller-mrj.zhfst/download,
Livonian (urj:fiu:BALTO-FINNIC1:liv) with a Voikko spell checker application zhfst
speller-liv.zhfst/download,
Moksha (urj:fiu:MORDVINIC1:mdf) with a Voikko spell checker application zhfst
speller-mdf.zhfst/download,
Nenets (urj:syd:yrk) ja Voikko-oikolukusovellusta varten on zhfst
speller-yrk.zhfst/download,
Olonets or Livvi (urj:fiu:BALTO-FINNIC1:olo) with a Voikko spell checker application zhfst
speller-olo.zhfst/download.

At Giellatekno, in Tromsø, the Sámi language technological infrastructure, net dictionaries have been developed that are morphology savvy. You can follow our progress there and reap the fruit of labor and development done there on Uralic languages2:
Sámi languages (urj:fiu:smi),
Balto-Finnic languages (urj:fiu:BALTO-FINNIC1),
Mordvinic languages (urj:fiu:MORDVINIC1),
Mari languages (urj:fiu:chm),
Permic languages (urj:fiu:PERMIC1),
Samoyedic langauges (urj:syd),

Other Uralic languages whose morphological parsers are also being created or extended include the following:
Erzya (urj:fiu:MORDVINIC:myv) with a Voikko spell checker application zhfst
speller-myv.zhfst/download,
Northern literary Khanty (urj:fiu:UGRIAN:OB-UGRIAN:kca),
Ingrian (1930s) (urj:fiu:BALTO-FINNIC:izh),
Meadow Mari (The Meadow and Eastern Mari literary language) (urj:fiu:chm:mhr),
Veps (urj:fiu:BALTO-FINNIC:vep),
as well as
Komi-Zyrian (urj:fiu:PERMIC:kom:kpv) with a Voikko spell checker application zhfst
speller-kpv.zhfst/download,
Kven (urj:fiu:BALTO-FINNIC:fkv),
Nganasan (urj:syd:nio) and
Udmurt (urj:fiu:PERMIC:udm)
Võro (urj:fiu:BALTO-FINNIC:vro) in the SamEst project in Tartu, Estonia.

Documentation

Testing
Варчамонь нолдавкс
Вилдавны комиӧн кӧ

Activity

Parallel projects central to AKU (Open Language Technology for Minority Uralic Languages) include Digitization Project of Kindred Languages, where readers are being digitized from the 1930s in Veps (urj:fiu:BALTO-FINNIC:vep), Ingrian (urj:fiu:BALTO-FINNIC:izh), Hill Mari (urj:fiu:chm:mrj), Meadow Mari (urj:fiu:chm:mhr), Erzya (urj:fiu:MORDVINIC:myv) and Moksha (urj:fiu:MORDVINIC:mdf), as well as, newspapers from the 1920s and 1930s in the Mari languages (urj:fiu:chm:X) and Mordvinic languages (urj:fiu:MORDVINIC:X, also Shoksha language form) for use by both researchers and interested citizens.
Now you can page through newspapers and readers, down load them and search freely, uralica or fennougrica.
Statistics
OXTs for Voikko spellchecker betas
1000 word form heads

1 Macro language names written in upper-case lack ISO 639-5 codes.
ISO code issues in Uralic languages.
2 The Giellatekno infrastucture serves as a home for many language projects in not only the circum-polar region but further south, as well.


Contact Jack Rueter: First name dot last name at helsinki dot fi.


Last modified: Thu Jun 8 9:26:17 EEST 2006