MULTILINGUAL RESOURCE COLLECTION
OF THE UNIVERSITY OF HELSINKI LANGUAGE CORPUS SERVER (UHLCS)

University of Helsinki, Department of General Linguistics

Information on the new location of the UHLCS

The UHLCS contains machine-readable linguistic data and basic services for the use of these data. The UHLCS which is maintained by the University of Helsinki was founded late in 1980. The first corpora at the UHLCS consisted of Finnish, English, and Swedish corpora. One of the first linguistic corpora in the UHLCS was the HKV-corpus, a syntactically analyzed corpus of Finnish (Hakulinen, Karlsson & Vilkuna 1980). Many of the corpora are compiled and edited by several language departments at the University of Helsinki.

At present, the UHLCS contains computer corpora of more than 50 languages, including samples of minority languages and extensive corpora representing different text types. The corpora at the UHLCS are given by various institutions and individual data providers. There are very large corpora from Finnish, Swedish, English, German, Latin, Russian and Swahili, and, already at the beginning of 90's, e.g., the Finnish, English and Swahili corpora totaled several million words (Helsinki Corpora I). The UHLCS also contains samples of morphologically analyzed corpora from most of the Uralic languages (cf. Suihkonen 1998), and corpora of numerous languages spoken in Europe and North and Central Asia (LENCA-group) (cf. Helsinki-Corpora II). The corpora of Finnish, Swahili, and Swedish are examples of the results of the corpus linguistic research done at the University of Helsinki. The Russian corpora are examples of electronic linguistic data prepared by international research groups. Many of the corpora prepared at the University of Helsinki have been prepared during various projects. In 2000, the corpora of the Uralic, Turkic, Tungusic, Mongolic, Chukotko-Kamchatkan, Iranian and North-Eastern Caucasian languages were edited for public use with the financial support of the Max Planck Institute for Evolutionary Anthropology, Leipzig. The corpora located at the UHLCS are organized with respect to the names of the language families, and all the corpora are described with specific metadata descriptions, which also are linked to the descriptions of the corpora. In summer 2003, the first metadata descriptions for the corpora were prepared with the financial support of the ECHO-project (ECHO = European Cultural Inheritance Online).

The UHLCS also contains tools that can be used in analyzing the corpora, including the morphological analyzer of Finnish (cf. Koskenniemi 1983 below), and some concordance tools. A UNIX-operating system forms the basis for the UHLCS, and all the tools available at the UNIX-operating system can also be used at the UHLCS (cf. also information on the other IT-services). The use of most of the data located at the UHLCS is restricted for research and teaching. When new data are received at the UHLCS, the copyrights and authors' rights of the corpora are specified with contracts. For the access to the data at the UHLCS, a special computer account is needed. The UHLCS is open for data providers.

Obs! Change of the address: from the beginning of September 2007, more information on the technical facilities has to be asked form the administration of the CSC (Information on the corpora and the tools at the CSC).

One of the functions of the UHLCS has been to form a databank for material collected from endangered minority languages. The function and importance of electronic material for endangered languages can be described by citing the following words:

    "Only languages for which adequate language resources, products and systems have been developed will be available over the Information Society network. On the worst hypothesis, citizens who are not able to communicate in the languages implemented in the global network would be denied full participation in the Information Society. Authoritative sources have already warned that languages for which language technology will not be adequately developed run the risk of losing their status as media of communication in the Information Society: because languages and cultures are inextricably linked, that will seriously threaten one of our most valuable human assets, linguistic and cultural diversity. To avoid this danger it is necessary to support multilinguality." (Antonio Zampolli in Rubio A. & al. 1998. First International Conference on Language Resources and Evaluation. Elra. Granada. Cit. Frantishek Chermak, Nov. 1998, Lecture, Mathesius Courses, Prague).



References

* Hakulinen, Auli, Fred Karlsson, and Maria Vilkuna. 1980. Suomen tekstilauseiden piirteitš: kvantitatiivinen tutkimus. Publications 6. Helsinki: Department of General Linguistics, University of Helsinki.

* Koskenniemi, Kimmo. 1983. Two-Level Morphology: A General Computational Model for Word-Form Recognition and Production. Publications 11. Helsinki: Department of General Linguistics, University of Helsinki.

* Suihkonen, Pirkko. 1998. Documentation of the Computer Corpora of the Uralic Languages at the University of Helsinki. Technical Reports TR-2. Helsinki: Department of General Linguistics, University of Helsinki.




© P.S. 2002; 2007; Last modified: Tue Nov 25 15:14:42 EET 2008