COMPUTER CORPORA AT THE UNIVERSITY OF HELSINKI CORPUS
SERVER
Indo-European languages,
Turkic languages,
Uralic languages,
Tungusic languages,
Mongolic languages,
Chukotko-Kamchatkan languages,
North East Caucasian languages,
Afro-Asiatic and Niger-Congo languages,
Lexical data bases
The University of Helsinki Language Corpus Server contains
machine-readable linguistic data and basic services for the use of
these data. The UHLCS which administratively is organized under the
University of Helsinki, Department of General Linguistics, was founded
late in 1980. From the beginning of September in 2007, the UHLCS is
available at the CSC (the Finnish
IT center for science). Many of the corpora are compiled and
edited by several language departments at the University of Helsinki.
At present, the UHLCS contains computer corpora of more than 50
languages, including samples of minority languages and extensive
corpora representing different text types.
Computer corpora of the Indo-European languages
-
Iranian languages:
-
Greek:
-
Latin:
-
Germanic languages:
-
Slavonic languages:
Computer corpora of the Turkic languages
-
The south-west branch (the Oghuz group):
-
The north-west branch (the Kipchak group):
-
The south-east branch (the Karluk group):
-
The north-east branch:
The Sayan Turkic languages:
The Yenisey Turkic languages:
-
The Bulghar (Oghur) group:
Computer corpora of the Uralic languages
-
Saami languages:
-
North Saami,
Ume Saami, and
Kildin Saami,
-
Baltic-Finnic languages:
- Finnish,
Dvina Karelian,
Ingrian,
Livvi (Olonets Karelian),
Lude,
Veps,
-
Estonian, and
Liv (Livonian).
-
Mordvin languages:
-
Erzya, and
Moksha.
-
Mari languages:
-
East Mari, and
West Mari.
-
Permic languages:
-
Ugric languages:
Samoyedic languages:
List of
tags used in encoding morpho-syntactic information in the
analysis of the Uralic languages.
Samples
of the corpora of the Uralic languages
Computer corpora of the Tungusic languages
-
Northern group:
-
Middle group:
Samples of the corpora of various languages
Computer corpora of the Mongolic languages
Computer corpora of the Chukotko-Kamchatkan languages
Chukotko-Kamchatkan group:
Samples
of the corpora of the Chukotko-Kamchatkan languages
The Computer Corpora of the North East Caucasian Languages
The Avar-Andi-Tsez Group:
1.1. The Avar-Andi group:
-
Avar
The Lak-Dargva group:
The Lezgian group:
Computer corpora of Afro-Asiatic and Niger-Congo languages
-
Afro-Asiatic languages:
Semitic languages:
-
North-West Semitic languages, Central group: Hebrew.
-
Niger-Congo languages:
Benue-Congo languages:
- Bantu languages:
Swahili.
Lexical Data Bases
Multilingual Databank
© P.S. 2002; 2007; Last modified: Thu Jan 22 16:35:17 EET 2009