Appendix 3:
The computer corpora prepared and/or received during
the project SA 1013 4233, in 1996-1998
This appendix contains information on the data in the machine readable form (computer corpora) received during the project SA 1013 4233, in 1996-1998 to be adapted to the University of Helsinki Language Corpus Server (UHLCS) at the Department of General Linguistics. Most of the corpora are adapted to the UHLCS. The corpora, which will be in the public use, are prepared for research work and teaching.
The computer corpora given or to be given by the project participants:
1. Laanest, A. (1966). Isuri murdetekste. Tallinn. Chapters 1-18, 28, 32, 37-38, 42-44. The running texts and the morphologically coded word forms have been translated into English.
2. Laanest, A. (1966). Isuri murdetekste. Tallinn. Chapters 1-44. The questions to the informants are included in the texts. The morphologically coded word forms have been translated into English.
3. Virtaranta, Pertti (1967). Lähisukukielten lukemisto. Suomalaisen Kirjallisuuden Seuran Toimituksia 280. Helsinki. 158-165.
4. Nirvi, R.E. In Virtaranta, Pertti (1967). Lähisukukielten lukemisto. Suomalaisen Kirjallisuuden Seuran Toimituksia 280. Helsinki. 138-150. The morphologically coded word forms have been translated into English.
The Kamas corpus is edited by Jarmo Alatalo. The corpus totals 33045 words and 18701 characters. The number of coded word forms is 2770. The corpus is morphologically coded and translated into German by Jarmo Alatalo. The corpus contains the following texts.
Donner, Kai. Manuscripts. In A.J. Joki (ed.) (1944): Kai Donners Kamassisches Wörterbuch nebst Sprachproben und Hauptzügen der Grammatik. Lexica Societatis Fenno-ugricae VIII. Suomalais-Ugrilainen Seura. Helsinki.
The corpus of the Ludian dialect is edited and transliterated by Miikul Pahomov. The size of the corpus is ca. 42390 words including punctuation (totally 285467 characters). The texts, which are written by using the Finno-Ugric transcription, must be adjusted on the UNICODE-system.
Merja Salo has edited the computer corpora of Khanty. The corpora includes the following texts.
The text is originally received from the Institute for Bible Translation.
The corpora, which were received in May 1999, will be adapted to the University of Helsinki Language Corpus Server. The UNICODE-system will needed in adjusting the Russian translation of the corpus.
Jack Rueter has compiled and edited a new text corpus of Komi Zyrian. The corpus contains the following novel:
Toropov, Ivan (1969). Jujas da s'ölömjas ('Rivers and Hearts'). Syktyvkar.
The size of corpus is ca. 44 900 words. The corpus, which is adapted to the University of Helsinki Language Corpus Server, will be transformed into the Cyrillic alphabet as soon as the UNICODE will be available on the corpus server. Also the dictionary of Komi that Rueter has in preparation for the Two-level model of Komi is adapted to the University of Helsinki Language Corpus Server. The dictionary is not yet in the public use.
The computer corpus of Liv includes a sample of Livonian texts (c.a. one page) edited by Seppo Suhonen. The text is written by using the Finno-Ugric transcription system.
The computer corpora of Hill Mari are compiled and edited by André Hesselbäck. The size of the corpora is ca. 3 800 words. The corpora includes the following texts.
Ramstedt, G.J. 1902. Bergtscheremissische Sprachstudien. Memoires de la Société Finno-Ougrienne 17. Finno-Ugrian Society. Helsinki. 169-201.
The texts are translated into English and German. The corpus is available on the UHLCS.
During the project and also as a private work, Jack Rueter has also compiled and edited new text corpora on Erzya Mordvin. He has delivered to the UHLCS the following novel and short stories.
(a) A novel: Kuzjma Abramov. Isjak jakinj Najmanov. Saransk. (2nd edition; 1st edition Najman).(b) Short stories: (1) Vasili Arapov. Ashtema kov. Saransk 1995, (2) Pjotr Kljuchagin. Cjokanka. Saransk 1997.
The size of the corpora totals ca. 142 700 words.
The computer corpora donated by Dennis Estill consist of the word lists of Erzya and Moksha Mordvin. The word lists are published as follows:
Feoktistov, A.P. (ed.) (1971). Russko-mordovskiy slovar. Iz istorii otechestvennoy leksikografii. Izdatelstvo Nauka. Moskva.The word lists are originally collected at the initiative of Catherine the Great under the direction of Bishop Damaskin and recorded in 1785. The word list of Erzya contains approx. 23,500 words and in Moksha, 300 words. Later, the word lists will be transformed on the UNICODE.
The Tundra Nenets computer corpus is a sample sentence corpus compiled by Tapani Salminen. The size of the corpus is c.a. 9 992 sentences and 39 415 words. The source of the corpus:
N. M. Tereshchenko. 1965. Nenecko-russkij slovar´. Sovetskaja Enciklopedija. Moskva.
The computer corpus of the Ume Saami is compiled and edited by Olavi Korhonen. The size of the corpus is c.a. 18 200 words. The corpus is morphologically coded and translated into Swedish. The Ume Saami texts consist of the data recorded and transliterated by Olavi Korhonen.
The writer Kerttu Vuolab has donated the following book to the University of Helsinki Language Corpus Server both as the paper copy and in the machine readable form.
Vuolab, Kerttu (1994). Cheppari cháráhus. Davvi Girji. Karasjok. ISBN 82-7374-229-6. 105 pp.
The agreement on the use of the corpus between Kerttu Vuolab and the University of Helsinki, Department of Linguistics was signed in May 1999. The corpus will be adapted to the University of Helsinki Language Corpus Server.
The computer corpus of Selkup contains the following texts:
The corpora also include data from the Ivankino, Tundra, Tym and Upper Ob dialects. The computer corpora of Selkup are morphologically coded and translated into German. The corpora are compiled and edited by Jarmo Alatalo. The size of the Selkup corpora is c.a. 470 000 words. The size of the corpora includes information on the Selkup sentences and their German translations, and the coded word forms.
The computer corpus of Chuvash will contain samples from the following texts:
An Uzbek-English dictionary compiled and edited by Daniel Kimmage (Uzbekiztan and Russia) was received from Daniel Kimmage in 1999. The size of the dictionary is approx. 3000 words. Daniel Kimmage has donated the dictionary to be used in the research work and teaching.