Appendix 3:
The computer corpora prepared and/or received during the project SA 1013 4233, in 1996-1998

DATABANK FOR
ENDANGERED FINNO-UGRIC LANGUAGES

WORKING REPORT

Pirkko Suihkonen


The Computer Corpora Prepared and/or Received during the Project SA 1013 4233, in 1996-1998

This appendix contains information on the data in the machine readable form (computer corpora) received during the project SA 1013 4233, in 1996-1998 to be adapted to the University of Helsinki Language Corpus Server (UHLCS) at the Department of General Linguistics. Most of the corpora are adapted to the UHLCS. The corpora, which will be in the public use, are prepared for research work and teaching.

The computer corpora given or to be given by the project participants:

  1. Uralic Languages
    1. Ingrian

      The Ingrian corpora are edited by Manja Lehto (Uppsala). The size of the corpora is c.a. 30 000 words. The corpora contain the following texts.

      1. Laanest, A. (1966). Isuri murdetekste. Tallinn. Chapters 1-18, 28, 32, 37-38, 42-44. The running texts and the morphologically coded word forms have been translated into English.

      2. Laanest, A. (1966). Isuri murdetekste. Tallinn. Chapters 1-44. The questions to the informants are included in the texts. The morphologically coded word forms have been translated into English.

      3. Virtaranta, Pertti (1967). Lähisukukielten lukemisto. Suomalaisen Kirjallisuuden Seuran Toimituksia 280. Helsinki. 158-165.

      4. Nirvi, R.E. In Virtaranta, Pertti (1967). Lähisukukielten lukemisto. Suomalaisen Kirjallisuuden Seuran Toimituksia 280. Helsinki. 138-150. The morphologically coded word forms have been translated into English.

    2. Kamas
    3. The Kamas corpus is edited by Jarmo Alatalo. The corpus totals 33045 words and 18701 characters. The number of coded word forms is 2770. The corpus is morphologically coded and translated into German by Jarmo Alatalo. The corpus contains the following texts.

      Donner, Kai. Manuscripts. In A.J. Joki (ed.) (1944): Kai Donners Kamassisches Wörterbuch nebst Sprachproben und Hauptzügen der Grammatik. Lexica Societatis Fenno-ugricae VIII. Suomalais-Ugrilainen Seura. Helsinki.

    4. Karelian: Lude

      The corpus of the Ludian dialect is edited and transliterated by Miikul Pahomov. The size of the corpus is ca. 42390 words including punctuation (totally 285467 characters). The texts, which are written by using the Finno-Ugric transcription, must be adjusted on the UNICODE-system.

    5. Khanty

      Merja Salo has edited the computer corpora of Khanty. The corpora includes the following texts.

      1. Life of Jesus in Khanty (the Kazim dialect). ISBN 952-9790-40-6, ISBN 91-88794-83-0. Institute for Bible Translation. Stockholm & Helsinki 1997.

        The text is originally received from the Institute for Bible Translation.

      2. Steinitz, Wolfgang (1950). Ostjakische Grammatik und Chrestomathie mit Wörterverzeichniss. 112-113 (Synja). Otto Harrassowitz, Leipzig.
      3. Steinitz, Wolfgang (1989). Ostjakologische Arbeiten III. Texte aus dem Nachlass. S. 317-323 (Atlym), 327-386 (Nizjam), 515-522 (Synja), 525-555 (Obdorsk) . Mouton de Gruyter, Berlin, New York.
      4. Rugin, R.P. (1990). Shum jo

      There are six different versions from the text number 4: (1) one version edited in the original form by using the Cyrillic alphabet; (2) the same text as transformed to the Latin alphabet; the same text as translated into (3) Finnish, (4) English and (5) Russian, and (6) the original text in the Latin format as morphologically coded and translated into English.

      The corpora, which were received in May 1999, will be adapted to the University of Helsinki Language Corpus Server. The UNICODE-system will needed in adjusting the Russian translation of the corpus.

    6. Komi Zyrian

      Jack Rueter has compiled and edited a new text corpus of Komi Zyrian. The corpus contains the following novel:

      Toropov, Ivan (1969). Jujas da s'ölömjas ('Rivers and Hearts'). Syktyvkar.

    7. The size of corpus is ca. 44 900 words. The corpus, which is adapted to the University of Helsinki Language Corpus Server, will be transformed into the Cyrillic alphabet as soon as the UNICODE will be available on the corpus server. Also the dictionary of Komi that Rueter has in preparation for the Two-level model of Komi is adapted to the University of Helsinki Language Corpus Server. The dictionary is not yet in the public use.

    8. Liv (Livonian)

      The computer corpus of Liv includes a sample of Livonian texts (c.a. one page) edited by Seppo Suhonen. The text is written by using the Finno-Ugric transcription system.

    9. Mari: Western Mari

      The computer corpora of Hill Mari are compiled and edited by André Hesselbäck. The size of the corpora is ca. 3 800 words. The corpora includes the following texts.

      Ramstedt, G.J. 1902. Bergtscheremissische Sprachstudien. Memoires de la Société Finno-Ougrienne 17. Finno-Ugrian Society. Helsinki. 169-201.

    10. The texts are translated into English and German. The corpus is available on the UHLCS.

    11. Mordvin: Erzya and Moksha

      1. The computer corpora edited by Jack Rueter

        During the project and also as a private work, Jack Rueter has also compiled and edited new text corpora on Erzya Mordvin. He has delivered to the UHLCS the following novel and short stories.

        (a) A novel: Kuzjma Abramov. Isjak jakinj Najmanov. Saransk. (2nd edition; 1st edition Najman).

        (b) Short stories: (1) Vasili Arapov. Ashtema kov. Saransk 1995, (2) Pjotr Kljuchagin. Cjokanka. Saransk 1997.

        The size of the corpora totals ca. 142 700 words.

      2. The computer corpora edited by Dennis Estill

        The computer corpora donated by Dennis Estill consist of the word lists of Erzya and Moksha Mordvin. The word lists are published as follows:

        Feoktistov, A.P. (ed.) (1971). Russko-mordovskiy slovar. Iz istorii otechestvennoy leksikografii. Izdatelstvo Nauka. Moskva.

        The word lists are originally collected at the initiative of Catherine the Great under the direction of Bishop Damaskin and recorded in 1785. The word list of Erzya contains approx. 23,500 words and in Moksha, 300 words. Later, the word lists will be transformed on the UNICODE.

    12. Nenets

      The Tundra Nenets computer corpus is a sample sentence corpus compiled by Tapani Salminen. The size of the corpus is c.a. 9 992 sentences and 39 415 words. The source of the corpus:

      N. M. Tereshchenko. 1965. Nenecko-russkij slovar´. Sovetskaja Enciklopedija. Moskva.

    13. Saami
      1. Ume Saami

        The computer corpus of the Ume Saami is compiled and edited by Olavi Korhonen. The size of the corpus is c.a. 18 200 words. The corpus is morphologically coded and translated into Swedish. The Ume Saami texts consist of the data recorded and transliterated by Olavi Korhonen.

      2. North Saami

        The writer Kerttu Vuolab has donated the following book to the University of Helsinki Language Corpus Server both as the paper copy and in the machine readable form.

        Vuolab, Kerttu (1994). Cheppari cháráhus. Davvi Girji. Karasjok. ISBN 82-7374-229-6. 105 pp.

        The agreement on the use of the corpus between Kerttu Vuolab and the University of Helsinki, Department of Linguistics was signed in May 1999. The corpus will be adapted to the University of Helsinki Language Corpus Server.

    14. Selkup
    15. The computer corpus of Selkup contains the following texts:

        h-dialects: (1) Kai Donner: fairy-tales (manuscripts); A.P. Dulson: Jazyki I Toponimija Sibiri I. Tomsk 1966 and Szabó, László: Szölkup szövegek szójegyzékkel (Tymi nyelvjárás). Nyelvtudományi Közlemények 68. Budapest 1966.

        Ket-dialect: Texts collected by Kai Donner in 1911-1912. In addition to these texts, there also are texts collected by N. P. Grigorovski, László Szabó and E. G. Becker.

      The corpora also include data from the Ivankino, Tundra, Tym and Upper Ob dialects. The computer corpora of Selkup are morphologically coded and translated into German. The corpora are compiled and edited by Jarmo Alatalo. The size of the Selkup corpora is c.a. 470 000 words. The size of the corpora includes information on the Selkup sentences and their German translations, and the coded word forms.

  2. The Turkic Languages
    1. Chuvash

      The computer corpus of Chuvash will contain samples from the following texts:

        Gebräuche und Volksdichtung der Tschuwassen. Gesammelt von Heikki Paasonen, herausgeben von Eino Karahka und Matti Räsänen. Mémoires de la Société Finno-Ougrienne XCIV. Suomalais-Ugrilainen Seura. Helsinki 1949.

      The computer corpus of Chuvash, which is compiled, edited and morphologically coded by André Hesselbäck, will be received from André Hesselbäck later in 1999. The corpus will be adapted to the University of Helsinki Language Corpus Server for the public use.
    2. Uzbek

      An Uzbek-English dictionary compiled and edited by Daniel Kimmage (Uzbekiztan and Russia) was received from Daniel Kimmage in 1999. The size of the dictionary is approx. 3000 words. Daniel Kimmage has donated the dictionary to be used in the research work and teaching.