COMPUTER CORPORA AT THE UNIVERSITY OF HELSINKI CORPUS SERVER

Indo-European languages, Turkic languages, Uralic languages, Tungusic languages, Mongolic languages, Chukotko-Kamchatkan languages, North East Caucasian languages, Afro-Asiatic and Niger-Congo languages, Lexical databases Multilingual databases


The University of Helsinki Language Corpus Server contains machine-readable linguistic data and basic services for the use of these data. The UHLCS which administratively is organized under the University of Helsinki, Department of General Linguistics, was founded late in 1980. From the beginning of September in 2007, the UHLCS is available at the CSC (the Finnish IT center for science). Many of the corpora are compiled and edited by several language departments at the University of Helsinki. At present, the UHLCS contains computer corpora of more than 50 languages, including samples of minority languages and extensive corpora representing different text types.



Computer corpora of the Indo-European languages


  1. Iranian languages:

  2. Greek:

  3. Latin:

  4. Germanic languages:

  5. Slavonic languages:



Computer corpora of the Turkic languages


  1. The south-west branch (the Oghuz group):

  2. The north-west branch (the Kipchak group):

  3. The south-east branch (the Karluk group):

  4. The north-east branch:

    • The Yenisey Turkic languages:

  5. The Bulghar (Oghur) group:





Computer corpora of the Uralic languages


  1. Saami languages:

  2. North Saami, Ume Saami, and Kildin Saami,

  3. Baltic-Finnic languages:

  4. Finnish, Dvina Karelian, Ingrian, Livvi (Olonets Karelian), Lude, Veps,
    Estonian, and Liv (Livonian).

  5. Mordvin languages:

  6. Erzya, and Moksha.

  7. Mari languages:

  8. East Mari, and West Mari.

  9. Permic languages:

  10. Ugric languages:

  11. Samoyedic languages:


*List of tags used in encoding morpho-syntactic information in the analysis of the Uralic languages.

* Samples of the corpora of the Uralic languages




Computer corpora of the Tungusic languages


  1. Northern group:

  2. Middle group:

* Samples of the corpora of various languages




Computer corpora of the Mongolic languages




Computer corpora of the Chukotko-Kamchatkan languages


Chukotko-Kamchatkan group:

* Samples of the corpora of the Chukotko-Kamchatkan languages



The Computer Corpora of the North East Caucasian Languages


  1. The Avar-Andi-Tsez Group:

  2. The Lak-Dargva group:

  3. The Lezgian group:





Computer corpora of Afro-Asiatic and Niger-Congo languages


  1. Afro-Asiatic languages:


  2. Niger-Congo languages:



Lexical and Multilingual Databases






Multilingual Databank
P.S. 2002; 2007; Last modified: Thu Jan 22 16:35:17 EET 2009