COMPUTER CORPORA OF THE URALIC LANGUAGES

North Saami, Ume Sami, Kildin Samí, Finnish, Dvina Karelian (North Karelian), Ingrian, Lude, Livvi (aunuksenkarjala), Veps, Estonian, Livonian, Erzya, Moksha, East Mari, West Mari, Komi Zyrian, Komi Permjak, Udmurt, Khanty, Mansi, Tundra Nenets, Enetsi, Kamass and Selkup.

Metadata descriptions of the Uralic languages

The North Saami computer corpus

The North Saami computer corpus contains the following documents:

  1. Committee report:
    Komiteanmietintö 1985: 66.
    Sámikultuvradoaibmagotti smiehttamush, pp. 1-140.
    Opetusministeriö (Ministry of education in Finland).
    Valtion painatuskeskus, Helsinki 1990.
    The document has been donated to the University of Helsinki Language Corpus Server by Irja Seurujärvi-Kari.

    Document type: running text.
    Size of the corpus: 28,664 words, 214,774 characters.
    Character encoding: ISO 8859-1 (Latin-1).

    For further information, please contact: the address of the contact person.

  2. Cheppari charahus:
    Vuolab, Kerttu (1994).
    Cheppari cháráhus.
    ISBN 82-7374-229-6. 105 pp.
    Davvi Girji.
    Karasjok.
    The document has been donated to the University of Helsinki Language Corpus Server by Kerttu Vuolab.

  3. Document type: running text.
    Size of the corpus: 17,830 words, 129,034 characters.
    Character encoding: ISO 8859-1 (Latin-1).

The corpora located at the University of Helsinki Language Corpus Server can be used in research and teaching. References to the corpora have to be done in the papers in which they are used as data. For further information, please contact: the address of the contact person.

Metadata descriptions for the North Saami Report and the Novel



The Ume Saami computer corpus

The Ume Saami computer corpus contains the following document:

Morphologically encoded Ume Saami corpus:
The morphologically analyzed and tagged Ume Saami corpus is a preliminary version of the Ume Saami text corpus compiled anden coded by Olavi Korhonen. The texts, which are told by Lars Sjulsson, Malå, Sweden, are based on the data recorded by Olavi Korhonen. The words in the corpus are translated into Swedish. The corpus was prepared during the project the Data Bank for Endangered Finno-Ugric Languages.

Document type: running text, which is in the word-per-line format. The words are markes with the morphosyntactic information, and translated into Swedish.
Size of the corpus: 109,572 words, 561,654 characters (including tags).
Character encoding: ISO 8859-1 (Latin-1).

The use of the corpora at the University of Helsinki Language Corpus Server is restricted to concern research and teaching. Refrerence to the corpora has to be done in the papers in which they is used as a source. For further information, please contact: the address of the contact person.

Metadata descriptions for the Ume Saami corpus



The Kildin Saami computer corpus

The Kildin Saami computer corpus contains the following document:

Document type: running text.
Size of the corpus: 22,037 words, 146,690 characters.
Character encoding: Cyrillic alphabet converted to the ISO 8859-1 (Latin-1).

The Kildin Sámi texts are donated to the University of Helsinki by the Institute for Bible Translation (Stockholm, Sweden) to be used as research material. Reference to the corpus has to be done in the papers in which it is used as a source. For further information, please contact: the address of the contact person.

Metadata descriptions of the Kildin Saami computer corpus



The Finnish computer corpora

The Finnish databank consists of Finnish corpora compiled and edited during several projects. The following corpora of the Finnish language are located at the University of Helsinki Language Corpus Server:

  1. The HKV-corpus
    The HKV-corpus consists of samples of the Finnish literature representing various text types. The corpus is documented in the following publications:

      Auli Hakulinen & Fred Karlsson & Maria Vilkuna Suomen tekstilauseiden piirteitd: kvantitatiivinen tutkimus. Publications, No.6. Department of General Linguistics, University of Helsinki, 1980.

    FILES: hkv.txt and hkv.tag:

      hkv.txt: running text.

      hkv.tag: The file is an encoded version, in which the classes of parts of speech are marked. The morpho-syntactic encoding is documented in the following publication:

      Computational morphosyntax: Report on research 1981-84. Publications, No. 13. pp. 115-136. University of Helsinki, Department of General Linguistics, 1985.

  2. The Bibles: The Finnish text corpus contains two editions of Bible: the old translation from the year 1938, and the revised translation published in 1992. The Bibles are located at different directories: "/KRaamattu38/" and "/KRaamattu92/". In both directories, the files are arranged according to the chapters.

    Document types: running text. The files in the directory /KRaamattu92/ are pre-procesed in that way that the verses are marked with % and & charcters (%1&, etc.), and the chapters with the characters $ and # ($1Genesis#, etc.). The hierachy of titles of the chapters is marked with numbers from 1 to 4. h
    Characater encoding: The edition from the year in the 1938 is in the ASCII-format, and the edition 1992 in the Latin-1 format.

  3. For further information on the corpus, please contact: the address of contact person.

  4. FINCORP:
    FINCORP is a collection of literature of various text types written in Finnish: novels, textbooks, children's books, samples of texts published in newspapers and periodicals, and law texts. The word forms are analyzed morphologically by the automatic analyzer of Finnish (fintwol), and disambiquated with the fin-CG (CG=Constraint Grammar). The TEI-header and the structural coding is done according to the SGML-format. Accurate description of the corpus is available for the users of the corpus.

    Document types:
    The corpora in FINCORP consist of following data types: (1) Texts analyzed morphologically with the automatic morphological analyzer "fintwol" and disambiguated with the constraint grammar "fin-CG". The texts end with the index ".cl.twol"; (2) morphologically analyzed texts with the TEI-header; (3) plain running texts; (4) plain running texts containing information on the data structure.
    Character encoding: ASCII.

  5. The Finnish newspaper "Helsingin Sanomat" (hs), 1980-1981:
    The hs-corpus (Helsingin Sanomat 1980-1981) is a collection of the texts from the Finnish newspaper "Helsingin Sanomat" in 1980-1981. The corpus consists of three files, hs1.txt, hs2.txt, and hs3.txt, which are in plain running text format. The texst were donated to the University of Helsinki by Sanoma Oy (Sanoma-WSOY).

    The size of the corpus: 243,146 words, 2,180,426 characters.
    Document types: Running text.
    Character encoding: ASCII.

  6. The Finnish newspaper "Helsingin Sanomat", 1990 (hs90):
    The hs90-corpus (Helsingin Sanomat 1990) is a collection of the texts published in the Finnish newspaper "Helsingin Sanomat" in 1990. The texst were donated to the University of Helsinki by Sanoma Oy (Sanoma-WSOY).

    Document type: Running text in the SGML-format.
    The size of the corpus: 5,553,876 words, 56,967,984 characters.
    Character encoding: ISO 8859-1 (Latin-1).

  7. LE PAROLE:
      The corpora prepared during the project "LE PAROLE" form a sub-corpus of the corpora of languages spoken in Europe. The project was financed by the European Union. The Finnish LE PAROLE corpus is available at the Research Centre of Languages in Finland (http://www.kotus.fi/).

      Text types: (1) Literature: Prose; Reports, State of Art Report. (2) Serial Traditional Document: Newspapers, Periodicals, Series. The list of the ID-information of the original documents is available within the corpus.

      Document type: All the corpora prepared during the project LE PAROLE are in the SGML-format and described with the TEI-header equal to the descriptions given in the corpora of other languages of the project. Separate sub-corpora were prepared from the verbal lexicon described with morho-syntactic information: also the verbal lexicon is made from the other languages participating in the project. In addition to these, there are sub-corpora consisting of lemmatized and annotated word forms.
      Character encoding: ISO 8859-1 (Latin-1).

  8. Document type: List of words arranged in the reversed alphabetic order.
    The size of the corpus: 165,123 words, 1,778,997 characters.
    Character encoding: ASCII.

  9. Spoken language in the Helsinki area in 1972-1974:
    The corpus of "Language in the helsinki area in 1972-1974" is based on the data recorded during the project dircted by Terho Itkonen. The data from the Helsinki area were collected in 1972-74. The project on the Spoken language in the Helsinki area was combined with the larger project "Nykysuomen puhekielen murros" financed by Valtion humanistisen toimikunta (the Committee of humanistic research in Finland). The project was directed by Heikki Paunonen. The preliminary phase of the project was going on in 1976, and the principal project was carried out in 1977-80.

    The corpus is formed from recordings consisting of three social and age groups. The time of each recording was one hour. The number of interviewees was 127. The original recordings are available at the Research Centre of Languages in Finland (http://www.kotus.fi/). (The description is an abridged README-file on the corpus of Spoken Finnish, Helsinki dialect, written by Pirkko Kukkonen. Description of the corpus is available within the corpus.)

    Document type: The recordings of corpus of Spoken Finnish, Helsinki dialect, are transcribed, and adjusted in the machine-readable form at the University of Helsinki, Department of General Linguistics. Documentation of the project, the recordings, and the adjusting work is availabe with the corpus.
    The size of the corpus: 127 x 30 min.
    Character encoding: ASCII.

  10. Document types: The text in the corpus is (1) in the plain text format and (2) in the sentence-per-line format. The texts in the sentence-per-line format are pre-processed.
    The size of the corpus:The size of the material in the sentence-per-line format is 304,685 words, 2,416,797 characters.
    Character encoding: ASCII.

  11. Suomen Kuvalehti, issues published in 1975 (sk75):
    The corpus "sk75" consists of some issues of the Finnish periodical "Suomen Kuvalehti" from the years 1975 and 1976. The corpus is donated to the University of Helsinki by the publisher "Yhtyneet Kuvalehdet Oy".

    Document Type: running text. The text in the corpus is indexed with running numbers.
    The size of the corpus: 840,672 words, 9,693,042 characters.
    Character encoding: ASCII.

  12. Suomen Kuvalehti, all the issues published in 1987 (sk87):
    The corpus "sk87" consists of all the issues of the Finnish periodical "Suomen Kuvalehti" from the year 1987. The corpus is donated to the University of Helsinki by the publisher "Yhtyneet Kuvalehdet Oy".

    Document Type: running text. The texts are in the plain text and a sentence-per-line formats.
    Size of the corpus: 17,30,597 words, 12,520,546 characters
    Character encoding: ASCII

  13. Tiede 2000 (t2000.snt):
    The corpus t2000.snt consists of the texts published in the scientific periodical "Tiede 2000",1990: 1, pp. 39-43.

    Document type: running text.
    The size of the corpus: 68,067 words, 464,792 characters.
    Character encoding: ASCII.

  14. WSOY (wsoy):
    The corpus "wsoy" contains some books and fragments of books published by Werner Söderström Osakeyhtiö (Helsinki - Porvoo).

    Document types: running text in the sentence-per-line format in the sub-directory /snt/.
    Size of the corpus: 979,516 words, 7,086,335 characters.
    Character encoding: ASCII.

  15. For further information, please contact: the address of contact person.

Metadata descriptions of the Finnish corpora



The Dvina Karelian corpus

The Dvina Karelian (North Karelian) contains the following sub-corpora:

  1. Life of Jesus
    "Life of Jesus" in the North Karelian language.
    (The second, corrected edition).
    ISBN 91-88394-68-9, ISBN 952-9790-18-X. 63 pp.
    Institute for Bible Translation.
    Stockholm & Helsinki 1994.

  2. Document type: running text.
    Size of the corpora: 4,757 words, 36,417 characters.
    Character encoding: ISO 8859-1 (Latin-1).

  3. Gospel of Mark:
    The Gospel of Mark in North-Karelian language.
    ISBN 952-9790-27-9, ISBN 91-88794-22-9. 75 pp.
    Institute for Bible Translation.
    Stockholm & Helsinki 1996.

  4. Document type: running text.
    Size of the corpus: 14,213 words, 111,590 characters.
    Character encoding: ISO 8859-1 (Latin-1).

The documents are donated to the University of Helsinki by the Institute for Bible Translation (Helsinki and Stockholm) to be used as research material. Helsinki. Reference to the corpora has to be done in papers in which they are used as a source. For further information, please contact: the address of the contact person.

Metadata descriptions



The corpus of Lude

The computer corpus of Lude contains samples collected during the fieldwork trips done in (1988), in Aunus (1989), and in Helsinki (1999). The number of informants is seven (7). The corpus is recorded, transliterated, and edited by Miikul Pahomov who is a native speaker of the Lude language. The corpus was edited to the machine-readable form during the databank project Databank for endangered Finno-Ugric languages. Reference to the corpus has to be done in the documents in which it are is used as a source.

The original tapes are stored in the recording archive at the Institute for Languages of Finland. Description of the literation, and information on the informants is enclosed the corpus.

Document types: Tranliterated audio-tapes: the content of the documents: ethnographic descriptions.
Size of the documents: 39 361 words, 177 843 characters (no spaces).
Character encoding: ISO 8859-1 (Latin-1).

For further information, please contact: the address of the contact person.

Metadata descriptions for the Lude corpus



The Livvi (Olonets Karelian) corpus

The Livvi (Olonets Karelian) corpus contains the following texts:

  1. Children's Bible
    Children's Bible in Olonets-Karelian language.
    (Lasten Raamattu livviksi (aunuksenkarjalaksi).
    Arapovich, Borislav & Mattelmäki, Vera (eds.).
    Translation: Dubinina, Zinaida.
    ISBN 91-88394-91-3, ISBN 952-9790-22-8. 552 pp.
    Institute for Bible Translation.
    Stockholm & Helsinki 1995.

    Document type: running text.
    Size of the copus: Sub-corpus 1. 56,883 words, 407,397 characters.
    Character encoding: ISO 8859-1 (Latin-1).

  2. Gospel of John:
    The Gospel of John in Karelian (Olonets) Language. (Trial edition)
    ISBN 91-88394-54-9, ISBN 952-9790-05-8. 91 pp.
    Institute for Bible Translation.
    Stockholm & Helsinki 1993.

    Document type: running text. The file is in the sentence-per-file format.
    Size of the corpus: 17,284 words, 120,632 characters.
    Character encoding: ISO 8859-1 (Latin-1).

  3. Gospel of Luke:
    ISBN 952-9790-30-9, ISBN 91-88794-51-2.
    The Gospel of Luke.
    Institute for Bible Translation.
    Stockholm & Helsinki 1996.
  4. Document type: running text.
    Size of the copus: 21,965 words, 155,155 characters.
    Character encoding: ISO 8859-1 (Latin-1).

  5. Gospel of Mark:
    The Gospel of Mark in Karelian (Olonets) Langugage. (Trial edition)
    ISBN 91-88394-17-4, ISBN 952-9790-02-3. 87 pp.
    Institute for Bible Translation.
    Stockholm & Helsinki 1993.

    Document types: running text. The file in in the sentence-per-line format.
    Size of the copus: 12,488 words, 91,385 characters.
    Character encoding: ISO 8859-1 (Latin-1).

  6. Gospel of Matthew:
    The Gospel of Matthew in Karelian (Olonets) Language. (Trial edition) ISBN 952-9790-42-2, ISBN 91-88794-87-3.
    Institute for Bible Translation.
    Stockholm & Helsinki 1997.

    Document type: running text.
    Size of the copus: 20,235 words, 141,937 characters.
    Character encoding: ISO 8859-1 (Latin-1).

  7. Life of Jesus
    "Life of Jesus" in the (Olonets) Karelian language.
    (The second, corrected edition).
    ISBN 952-9790-15-5, ISBN 91-88394-67-0. 63 pp.
    Institute for Bible Translation.
    Stockholm & Helsinki 1994.

    Document type: running text.
    Size of the copus: 3,311 words, 25,936 characters.
    Character encoding: ISO 8859-1 (Latin-1).


    The corpora added after September 2008:

    The New Testament in the Karelian (Olonets) language
    ISBN 10: 952-9790-73-2
    Institute for Bible Translation
    Helsinki/Petroskoi 2006
    The document types: htm, rtf, txt
    The size of the corpus: 3,76 Mt (calculated from the rtf-file)

The Livvi texts are donated to the University of Helsinki by the Institute for Bible Translation (Helsinki and Stockholm) to be used as research material. The computer corpora are adapted to the Unix software system with the financial support of the Department of General Linguistics, University of Helsinki. Reference to the corpora has to be done in the papers in which they are used as a source. For further information, please contact: the address of the contact person.

Metadata descriptions of the Livvi corpus



The Ingrian computer corpora

The Ingrian corpus contains the following texts:

  1. Laanest, A. (1966). Isuri murdetekste. Tallinn.
    Chapters 1-18, 28, 32, 37-38, 42-44.
    (The running texts and the morphologically coded word forms are translated into English.)

  2. Laanest, A. (1966). Isuri murdetekste. Tallinn.
    Chapters 1-44.
    (The questions to the informants are included in the texts.
    The morphologically coded word forms are translated into English.)

  3. Virtaranta, Pertti (1967). Lähisukukielten lukemisto. Suomalaisen Kirjallisuuden
    Seuran Toimituksia
    280. Helsinki. 158-165.

  4. R.E. Nirvi. In Virtaranta, Pertti (1967). Lähisukukielten lukemisto. Suomalaisen
    Kirjallisuuden Seuran Toimituksia
    280. Helsinki. 138-150.
    (The morphologically coded word forms are translated into English.)

The word forms in the corpora are morphologically encoded and translated into English. English translation in the running text format is available about part of the corpus.

The file "list-of-abbreviations" contains the list of abbreviations used in encoding the Ingrian computer corpora.

The corpora were prepared during the project Data Bank for Endangered Finno-Ugric Languages by Manja Lehto. Refrerence to the corpora has to be done in papers in which they are is used as a source. For further information, please contact: the address of the contact person.

Metadata descriptions of the Ingrian corpora



The Veps (Vepsian) computer corpus

The Veps computer corpus contains the following sub-corpora:

  1. Children's Bible in the Veps Language
    Children's Bible in the Vepsian Language.
    Arapovich, Borislav & Mattelmäki, Vera (eds.).
    Translation: Nina Zaiceva.
    ISBN 91-88794-11-3. 542 pp.
    Institute for Bible Translation.
    Stockholm & Helsinki 1996.

    Document type: running text.
    Size of the corpus: 59,092 words, 405,206 characters.
    Character encoding: ISO 8859-1 (Latin-1).

  2. Gospel of John:
    The Gospel of John in the Vepsian Language.
    ISBN 91-88394-56-5, ISBN 952-9790-06-6. 86 pp.
    Institute for Bible Translation.
    Stockholm & Helsinki 1993.

    Document type: running text.
    Size of the corpus: 17,906 words, 121,650 characters.
    Character encoding: ISO 8859-1 (Latin-1).
  3. Gospel of Mark:
    The Gospel of Mark in the Vepsian Language.
    ISBN 91-88794-29-6, 952-9790-28-7. 95 pp.
    Institute for Bible Translation.
    Stockholm & Helsinki 1996.

    Document type: running text.
    Size of the corpus: 20,999 words, 142,366 characters.
    Character encoding: ISO 8859-1 (Latin-1).
  4. Gospel of Matthew:
    The Gospel of Matthew in the Vepsian Language.
    ISBN 952-9790-45-7, ISBN 91-88794-90-3.
    (Trial edition).
    Institute for Bible Translation.
    Stockholm & Helsinki 1992.

    Document type: running text.
    Size of the corpus: 13,105 words, 88,709 characters.
    Character encoding: ISO 8859-1 (Latin-1).

  5. "Life of Jesus" in the Vepsian language:
    "Life of Jesus" in the Vepsian language.
    (The second, corrected edition).
    ISBN 91-88934-66-2, ISBN 952-9790-17-1. 63 pp.
    Institute for Bible Translation.
    Stockholm & Helsinki.

    Document type: running text.
    Size of the corpus: 3,316 words, 25,272 characters.
    Character encoding: ISO 8859-1 (Latin-1).


    The corpora added after the end of September 2008:

    The New Testament in the Veps language
    ISBN 12: 978-952-5634-06-08
    ISBN 10: 952-5634-06-8
    Institute for Bible Translation
    Helsinki/Petroskoi 2006
    The document types: htm, rtf, txt, xml
    The size of the corpus: 3,56 Mt (calculated from the rtf-file)

The Veps texts are donated to the University of Helsinki by the Institute for Bible Translation (Helsinki and Stockholm) to be used as research material. The computer corpora are adapted to the Unix software system with the financial support of the Department of General Linguistics, University of Helsinki. Reference to the corpora has to be done in the papers in which they are used as a source. For further information, please contact: the address of the contact person.

Metadata descriptions of the Veps corpus



The Estonian computer corpora

The Estonian corpus contains the following sub-corpora:

  1. viro1:
  2. Text type: Short-stories, and articles published in Estonian newspapers and magazines.
    Size of the corpus "viro1": 142,362 words, 1,065,718 characters.
    Document types: Running text.
    Character encoding: Latin-1.

    For further information on the corpus, please contact: the address of contact person.

  3. viro2

  4. Text type: Short-stories and fragments of novels.
    Size of the corpus: 3,294 sentences, 50,282 words, and 276,269 characters.
    Document types: Running text.
    Character encoding: ASCII.

For further information on the corpus, please contact: the address of contact person.

Metadata descriptions for the Estonian corpora



The Liv (Livonian) corpus

The computer corpus of Liv (Livonian) contains the following documents:

  1. Life of Jesus
    Children's book "Life of Jesus". Livonian translation.
    Tranlation: Juha-Lassi Tast.
    ISBN-number = ISBN 952-9790-47-3
    Institute for Bible Translation
    Stockholm, Helsinki 1998.

    Document type: running text.
    Size of the corpus: 5,266 words, 33,361 characters.
    Character encoding: ISO 8859-1 (Latin-1). For further information, please contact: the address of the contact person.

  2. Ethnographic texts:
    A sample of Livonian ethnographic texts.

  3. Document type: running text.
    Size of the corpus: 2,309 words, 13,096 characters.
    Character encoding: ISO 8859-1 (Latin-1).

    For further information, please contact: the address of the contact person.

Metadata descriptions for the Livonian corpus



The Erzya Mordvin corpus

The computer corpus of Erzya Mordvin contains the following documents:

  1. Children's Bible
    Children's Bible in the Erzya Mordvin Language.
    Arapovich, Borislav & Mattelmäki, Vera (eds.).
    Translation: Adushkina, N.S., Shchemerova, V.S. & Nadkin, D.T.
    ISBN 91-88394-23-9, ISBN 952-9790-01-5. 544 pp. &
    Erzya-Russian wordlist for the Erzya-Mordvin
    Children's Bible. 19 pp.
    Institute for Bible Translation.
    Stockholm & Helsinki 1993.

  2. Document type: Running text, which have been modified into a sentence-per-line format and the paragraphs have been separated by a new line.
    Size of the corpus: 76,817 words, 528,786 characters.
    Character encoding: Cyrillic alphabet converted to the ISO 8859-1 (Latin-1).

  3. Gospel of Mark
    The Gospel of Mark in Erzya-Mordvin language.
    ISBN 91-88394-58-1, ISBN 952-9790-11-2. 81 pp.
    Institute for Bible Translation.
    Stockholm & Helsinki 1995.

  4. Document type:
    Size of the corpus: 12,071 words, 93,318 characters.
    Charcter encoding: Cyrillic alphabet converted to the ISO 8859-1 (Latin-1).

  5. Gospel of Luke and Acts of the Apostles
    The Gospel of Luke and Acts of the Apostles in the Erzya-Mordvin Language.
    Adushkina, N.S., Bargova, T.S. & Gorbynov, G.I. (eds.).
    Translation: Batkov, G.I., Devyatkin, G.S. & Nadkin, D.T.
    ISBN 91-88794-56-3, 952-9790-33-3. 232 pp.
    Institute for Bible Translation.
    Stockholm & Helsinki 1996.

  6. Document type: running text.
    Size of the document: 40,289 words, 300,308 characters.
    Character encoding: Cyrillic alphabet converted to the ISO 8859-1 (Latin-1).

  7. Gospel of Matthew
    The Gospel of Matthew in the Erzya-Mordvin Language.
    Translated by Adushkina, N.S. Edited by Bargova, T.S., Batkov, G.I., Gorbunov, G.I., Devjatkin, G.S.
    ISBN 91-88394-58-1, ISBN 952-9790-11-2.
    Institute for Bible Translation.
    Stockholm & Helsinki 1996.

  8. Document type: running text.
    Size of the document: 23,020 words, 171,502 characters.
    Character encoding: Cyrillic alphabet converted to the ISO 8859-1 (Latin-1).


    The corpora added after September 2008:

    The New Testament in the Mordvin-Erzya language
    ISBN 10: 952-5634-00-0
    Institute for Bible Translation
    Helsinki/Saransk 2006
    The document types: htm, rtf, txt, xml
    The size of the corpus: 4,78 Mt (calculated from the rtf-file)

    For further information, please contact: te addresses of the contact person.
  9. Novels
    Novels written by Kuzjma Abramov, Kalinkin, and Kutorkin.
    Publishing place: Saransk.

  10. Document type: running text.
    Size of the document: 865,007 words, 11,142,079 characters.
    Character encoding: Cyrillic alphabet converted to the ISO 8859-1 (Latin-1).

  11. Short-stories
    Short-stories written by Vasili Arapov and Petja Kljuchagin.
    Publishing place: Saransk.

  12. Document type: running text.
    Size of the documents: 135,472 words, 1,636,435 characters.
    Character encoding: Cyrillic alphabet converted to the ISO 8859-1 (Latin-1).

  13. Poetry
    Poems written by Vasili Arapov.
    Publishing place: Saransk.

    Document type: running text.
    Size of the document: 4,862 words, 64,978 characters.
    Character encoding: Cyrillic alphabet converted to the ISO 8859-1 (Latin-1).

  14. Morphologically encoded text
    A sample of a morphologically coded novel written in the Erzya Mordvin language. The coding is done by Jack Rueter (University of Helsinki, Department of Finno-Ugrian Studies).

  15. Document type: running text.
    Size of the document: 6,292 words, 50,973 characters.
    Character encoding: Cyrillic alphabet converted to the ISO 8859-1 (Latin-1).

    For further information, please contact: the addresses of the contact person.
  16. List of words
    List of words collected at the initiative of Catherine the Great under the direction of Bishop Damaskin and recorded in 1785.
    Originally published in A. P. Feoktistov in Russko-mordovskiy slovar, Iz istorii otechestvennoy leksikografii, Izdatel'stvo Nauka, Moskva 1971.

  17. Document type: List of words in the alphabetic order.
    Size of the document: 23,500 words.
    Character encoding: ISO 8859-1 (Latin-1).

    For further information, please contact: the addresses of the contact person.

Sub-corpora 1.-3. are donated to the University of Helsinki by the Institute for Bible Translation (Helsinki and Stockholm), sub-corpora 4.-6. belong to the large collection of electronic material compiled and edited by Jack Rueter with the assistance of Mordvin research assistants, sub-corpus 7. is prepared by Jack Rueter, and sub-corpus 8. is prepared and by Dennis Estill. Sub-corpora 4.-8. have been donated to the University of Helsinkin Language Corpus Server by the editors. The use of the corpora located at the University of Helsinki Language Corpus Server is restricted to concern research and teaching. The computer corpora of Erzya have been adapted to the Unix software system with the financial support of the Department of General Linguistics, University of Helsinki. Reference to the corpora has to be done in the papers in which they are used as a source.

Metadata descriptions for the Erzya corpora



The Moksha Mordvin corpus

  1. Children's book "Jesus friend of children"
    "Jesus friend of children" in Mordvin-moksha language.
    Arapovich, Borislav & Mattelmäki, Vera (eds.).
    ISBN 952-9790-23-6, ISBN 91-88394-94-8. 65 pp.
    Institute for Bible Translation.
    Stockholm & Helsinki 1995.

    Document type: running text.
    Size of the corpus: 5,935 words, 45,694 characters.
    Character encoding: ISO 8859-1 (Latin-1).

  2. Gospel of Mark
    Gospel of Mark in Mordvin-Moksha language.
    ISBN 952-9790-21-X, ISBN 91-88794-13-X. 78 pp.
    Institute for Bible Translation. Stockholm & Helsinki 1995.

    Document type: running text.
    Size of the corpus:11870 words, 88703 characters.
    Character encoding: ISO 8859-1 (Latin-1).
  3. For further information, please contact: the address of the contact person.

  4. Novels written by Anatolij Tjapaev and Aleksej Mokshoni. For further information, please contact: the address of the contact person.
  5. List of words
    A. P. Feoktistov in Russko-mordovskiy slovar, Iz istorii otechestvennoy leksikografii, Izdatel'stvo Nauka, Moskva 1971.

    Document type: Words in Moksha collected at the initiative of Catherine the Great under the direction of Bishop Damaskin and recorded in 1785.
    Size of the corpus: approx. 300 words.
    Character encoding: ISO 8859-1 (Latin-1).

  6. For further information, please contact: the address of the contact person.

The Moksha Mordvin texts in the sub-corpora 1.-3. are donated to the University of Helsinki by the Institute for Bible Translation (Helsinki and Stockholm) to be used as research material. The list of words is donated by Dennis Estill who also has compiled and edited the corpus. The computer corpora are adapted to the Unix software system with the financial support of the Department of General Linguistics, University of Helsinki. The use of the corpora located at the University of Helsinki Language Corpus Server is restricted to concern research and teaching. Reference to all the corpora has to be done in the papers in which they are used as a source.

Metadata descriptions for Moksha Mordvin corpora



The Mari computer corpus

The East Mari computer corpus contains the following documents:

  1. Bible of Children.
    ISBN 952-9790-36-8, ISBN 91-88794-79-2.
    Institute for Bible Translation.
    Stockholm & Helsinki 1994.

    Document type: running text.
    Size of the document: 59,272 words, 415,375 characters.
    Character encoding: Cyrillic alphabet converted to the ISO 8859-1 (Latin-1).

  2. The Gospel of Mark in the Mari language. Trial edition.
    ISBN 91-88394-60-3, ISBN 952-9790-13-9. 76 pp.
    Institute for Bible Translation.
    Stockholm & Helsinki 1994.

    Document type: running text in the sentence-per-line format.
    Size of the document: 12,981 words, 89,879 characters.
    Character encoding: Cyrillic alphabet converted to the ISO 8859-1 (Latin-1).

  3. The Gospel of Luke in the Mari language. Trial edition.
    ISBN 952-9790-19-8, ISBN 91-88-394-93X.
    Institute for Bible Translation. 106 pp.
    Stockholm & Helsinki 1995.

    Document type: running text.
    Size of the document: 21,159 words, 145,092 characters.
    Character encoding: ISO 8859-1 (Latin-1).

  4. The Gospel according to John in the Mari language. Trial edition.
    ISBN 91-88794-85-7, 952-9790-34-1. 95 pp.
    Institute for Bible Translation.
    Stockholm & Helsinki 1995.

    Document type: running text.
    Size of the document: 16,483 words, 109,835 characters.
    Character encoding: Cyrillic alphabet converted to the ISO 8859-1 (Latin-1).


  5. The corpora added after September 2008:

    The New Testament in the Mari language
    ISBN 10: 978-952-5634-12-9
    Institute for Bible Translation
    Helsinki/Joshkar-Ola 2007
    The document types: htm, rtf, txt, xml
    The size of the corpus: 7,65 Mt (calculated from the rtf-file)

    Sept. 17, 2015, Updating:
    Document type: sfm.

    Sept. 17, 2015, Updating

    Genesis in the Mari language.
    Helsinki 2014.
    Document type: pdf, rtf and sfm.

The Eastern Mari texts are donated to the University of Helsinki by the Institute for Bible Translation (Helsinki and Stockholm) to be used as research material. Reference to the corpora has to be done in the papers in which they are used as a source. For further information, plese contact: the address of the contact person.

Metadata descriptions for the corpora of the Mari languages



The West Mari computer corpus

The following documents written on the West Mari language and the Hill Mari dialect are included to the West Mari corpus:
  1. "Life of Jesus" in the Mari High language.
    ISBN 952-9790-16-3, ISBN 91-88394-69-7. 62 pp.
    Institute for Bible Translation.
    Stockholm & Helsinki 1994.

    Document type: running text.
    Size of the document: 3,712 words, 28,859 characters.
    Character encoding: Cyrillic alphabet converted to the ISO 8859-1 (Latin-1).

  2. "Jesus Friend of Children" in the Mari High language.
    Arapovich, B. & Mattelmäki, V. (eds.).
    ISBN 952-9790-24-4, ISBN 91-88394-98-0. 64 pp.
    Institute for Bible Translation.
    Stockholm - Helsinki 1995.

    Document type: running text.
    Size of the document: 6,833 words, 48,008 characters.
    Character encoding: Cyrillic alphabet converted to the ISO 8859-1 (Latin-1).

  3. Gospel of Mark in the Hill Mari Language.
    ISBN 952-9790-37-6, 91-88794-84-9. 76 pp.
    Institute for Bible Translation.
    Stockholm-Helsinki 1997.

    Document type: running text.
    Size of the document: 12,981 words, 89,879 characters.
    Character encoding: Cyrillic alphabet converted to the ISO 8859-1 (Latin-1).

    The documents mentioned above have been donated to the University of Helsinki by the Institute for Bible Translation (Helsinki & Stockholm) to be used as research material.

    For further information, please contact: the addresses of the contact person.

  4. Ramstedt, G.J. (1902). Bergstscheremissische Sprachstudien. Memoires de la Sociiti
    Finno-Ougrienne
    17. Finno-Ugrian Society. Helsinki. 169-201.
  5. The texts are tagged morphologically and translated into English and German. The corpus is edited and encoded by Andri Hesselbäck (University of Uppsala). The texts are translated into English and German. The corpus is prepared during the project the Data Bank for Endangered Finno-Ugric Languages.

    For further information, please contact: the addresses of the contact person.

Reference to the corpora has to be done in the papers in which they are used as a source.

Metadata descriptions for the Mari corpora



The Komi Zyrian computer corpora

The Komi Zyrian computer corpus contains the following sub-corpora:

  1. Jesus Friend of Children.
    ISBN 91-88394-64-6, ISBN 952-9790-13-9.
    Institute for Bible Translation.
    Stockholm & Helsinki 1994. 65 pp.

  2. Document type: running text.
    Size of the corpus: 7,338 words, 48,883 characters.
    Character encoding: Cyrillic alphabet converted to the ISO 8859-1 (Latin-1).

  3. Gospel of Mark in Komi-Zyrian language.
    (Preliminary edition.)
    ISBN 91-88394-79-4, ISBN 952-9790-20-1. 71 pp.
    Institute for Bible Translation.
    Stockholm & Helsinki 1995.

  4. Document type: running text.
    Size of the corpus: 11,932 words, 86,108 characters.
    Character encoding: Cyrillic alphabet converted to the ISO 8859-1 (Latin-1).

  5. Gospel of Luke in Komi-Zyrian language.
    (Trial edition.)
    ISBN 91-88794-32-6, 952-9790-32-5. 137 pp.
    Institute for Bible Translation.
    Stockholm & Helsinki 1996.

  6. Document type: running text.
    Size of the corpus: 14,677 words, 101,908 characters.
    Character encoding: Cyrillic alphabet converted to the ISO 8859-1 (Latin-1).

  7. Gospel of John in Komi-Zyrian language.
    (Trial edition.)
    ISBN91-88794-88-1, 952-9790-44-9. 97 pp.
    Institute for Bible Translation.
    Stockholm & Helsinki 1997.

  8. Document type: running text.
    Size of the corpus: 14,769 words, 102,504 characters.
    Character encoding: Cyrillic alphabet converted to the ISO 8859-1 (Latin-1).


    The corpora added after September 2008:

    The New Testament in the Komi-Zyrian language
    ISBN 10: 978-952-5634-16-7
    Institute for Bible Translation
    Helsinki/Izhevsk 2008
    The document types: htm, rtf, txt, xml
    The size of the corpus: 6,65 Mt (calculated from the rtf-file)

    Sept. 17, 2015. Updating.
    Psalms in the Komi-Zyrian language.
    The document types: sfm, rtf, and pdf.

    For further information, please contact: the addresses of the contact person.

  9. Morphologically encoded Komi text corpus.

    The Komi text corpus is in three formats: running texts, in the sentence-per-line format, and morphologically encoded format. The text corpora include the following texts:

    (1) 2 short-stories by a Komi writer, (2) a text of a booklet for children, (3) an article from a Komi newspaper, (4) a scientific text (in two parts) from a Komi periodical and (5) religional texts:

    (1) N'ina Kuratova (1983). Bobön'an' kör, Povest'jas, vis'tjas.
    Komi kn'izhnöj izdatel'stvo, Syktyvkar.
    FICT_ST__Ni_Ku_1983_BK_186-197
    FICT_ST_Ni_Ku_1983_BK-198-212

    (2) Rots'ev, Jegor (1987). Mitruk petö tundrays', 3 - 65.
    Komi knizhnoj izdatelstvo, Syktyvkar.
    FAT/FICT_NV_Je_Ro_1987_MPT_3-65

    (3) P. Stolpovskij, SSSR-ys' pisat'el'jas sojuzsa ts'l'en. Komi mu 1991: 4.
    NEWS_P_St_1991_KM_04

    (4) Tsypanov, Jevgenij (1989). VK: 6, 49 - 55.
    SCF_Je_Ts_1989_VK:6_49-55
    SCF_Je_Ts_1989_VK:7_54-59

  10. For further information, please contact: the addresses of the contact person.

  11. Komi literature text corpus:
    The sub-corpus of Komi literature consists of the following novel:

    Ivan Toropov.
    "Jujas da s'ölömjas ('Rivers and Hearts')"
    Syktyvkar 1969.

  12. For further information, please contact: the addresses of the contact person.

The texts in 1.-4. are donated to the University of Helsinki by the Institute for Bible Translation, Stockholm and Helsinki. The Komi text corpora in 5. are compiled and edited by Paula Kokkonen with the economical support of The Academy of Finland and the University of Helsinki.

The texts in (1) - (4) have been transliterated from the Komi official alphabet, which is based on the Cyrillic alphabet. The transliteration has been adjusted to the phonological system of Komi. In the coding, - distinctive phonemes are the basic units to be coded; - if the orthographic system includes letters not belonging to the phonemes of this language but needed, for example, in loan words, they have been added to the inventory of the orthographical units, i.e. alphabet; - if it is not possible to code a phoneme with one standardized unit, instead of artificial marks, combinations of several units have been used. The novel in 6. was compiled and edited by Jack Rueter.

The texts under the sub-directories /Books-of-Children/ and /New-Testament/ are in the unmodified form. These texts are donated to the University of Helsinki by the Institute for Bible Translation (Helsinki and Stockholm) to be used as research material. Reference to the corpora has to be done in the papers in which they are used as a source.

Metadata descriptions for the Komi corpora



The Komi Permyak computer corpus

The Komi Permyak computer corpus contains the following documents:

  1. Gospel of Mark in Komi Permyak language.
    ISBN 952-9790-29-5, ISBN 91-88794-24-5. 78 pp.
    Institute for Bible Translation.
    Stockholm 1996.
  2. "Jesus Friend of Children" in the Komi-Permyak Language.
    ISBN 952-9790-38-4, 91-88794-81-4. 65 pp.
    Institute for Bible Translation.
    Stockholm & Helsinki 1997.

The texts are donated to the University of Helsinki by the Institute for Bible Translation (Helsinki and Stockholm) to be used as research material. Reference to the corpora has to be done in the papers in which they are used as a source. For further information, please contact: the address of the contact person.

Metadata descriptions for the Komi corpora



The Udmurt computer corpora

The Udmurt computer corpus contains the following sub-corpora:

  1. "Jesus Friend of Children" in the Udmurt language.
    ISBN 91-88394-65-4, ISBN 952-9790-14-7. 65 pp.
    Institute of Bible Translation
    Stockholm 1994.

    Document type: running text.
    Size of the corpus: 7,314 words, 50,238 characters.
    Character encoding: Cyrillic alphabet converted to the ISO 8859-1 (Latin-1).

  2. The Four Gospels in Udmurt Language.
    (Trial edition).
    ISBN 91-88394-21-2, 952-9790-00-7. 279 pp.
    Institute for Bible Translation.
    Stockholm & Helsinki 1992.

    Document type: running text.
    Size of the corpus:
    Character encoding: Cyrillic alphabet converted to the ISO 8859-1 (Latin-1).

  3. Acts of the Apostles in the Udmurt Language.
    Translation: Atamanov, Mikhail.
    ISBN 91-88794-15-6, 952-9790-26-0. 123 pp.
    Institute for Bible Translation.
    Stockholm & Helsinki 1996.

    Document type: running text.
    Size of the corpus:
    Character encoding: Cyrillic alphabet converted to the ISO 8859-1 (Latin-1).

  4. New Testament in the Udmurt Language.
    Translation: Atamanov, Mihkail.
    ISBN 91-88794-82-2, 952-9790-39-2. 780 pp.
    Institute for Bible Translation.
    Stockholm & Helsinki 1997.

    Document type: running text.
    Size of the corpus: 133,575 1,007,598.
    Character encoding: Cyrillic alphabet converted to the ISO 8859-1 (Latin-1).

  5. Udmurt literature:
    The corpus of Udmurt literature contains samples of of Udmurt literature of different types: a drama, fairy-tales, legends, novels, poems, short-stories and other stories (cf. Udmurt metadata descriptions). The texts have been collected from various publications.

    Document type: running texts in the sentence-per-line format. In addition to this, there are copies of the texts, which are plain running texts.
    Size of the corpus: 1. Drama: ; 2. Fairy-tales: 2,118 words, 14,618 characters; 3. Legends: 3,719 words, 25,694 characters; 4. Novels: 30,470 words, 276,092 characters; 5. Poems: 3,448 words, 41,687 characters; 6. Short-stories: 11,592 words, 118,689 characters; and 7. Stories: 11,827 words, 117,866 characters.
    Character encoding: ISO 8859-1 (Latin-1). The texts are manually edited to correspond to Udmurt phonematic system.

  6. Udmurt statistical corpus:
    The Udmurt statistical corpus located sub-directory /udmurt-statistical-data/ in the corpus server includes a statistical corpus of Udmurt and lists of encoded variables.

    Document type: numerical data in the matrix format. The corpus is prepared from running texts.
    Size of the corpus: ca. 2,330 sentences.
    Character encoding: Cyrillic alphabet converted to the ISO 8859-1 (Latin-1).

Reference to the Udmurt corpora has to be done in the papers in which they are used as a source. For further information, please contact: the address of the contact person.

Metadata descriptions for the Udmurt corpora



The Khanty computer corpora

The Khanty computer corpus contains the following sub-corpora:

  1. Khanty, Atlym dialect,
  2. Khanty, Kazym dialect,
  3. Khanty, Konda dialect,
  4. Khanty, Nizjam dialect,
  5. Khanty, Obdorsk dialect,
  6. Khanty, Synja dialect.
  7. The corpora of the Khanty dialects are samples taken from the following text collections:

    Rédei, Károly (1968).
    Nord-ostjakische Texte (Kazym-Dialekt) mit Skizze der Grammatik.
    Gesammelt und herausgegeben von Károly Rédei. Abhandlung der Akademie
    der Wissenschaften in Göttingen, philologisch-historische Klasse
    , dritte Folge 71.
    Göttingen.

    Steinitz, Wolfgang (1989).
    Ostjakologische Arbeiten III. Texte aus dem Nachlass.
    Eds.: Hartung, Liselotte, Hauel, Petra, Sauer, Gert & Schulze, Birgitte.
    Janua Linguarum, Series Practica 256.
    Mouton de Gruyter, Berlin.

    Vértes, Edith (1980).
    H. Paasonens südostjakische Textsammlungen.
    Suomalais-Ugrilaisen Seuran Toimituksia
    175.
    Suomalais-Ugrilainen Seura, Helsinki.

    Corpora are running texts and several corpora are morphologically analyzed. Morphologically encoded words of the texts are in the word-per-line format, and the plain texts are in sentence-per-line format. There are also texts in which the clauses and the sentences are marked with the information about the location of the sentences in the texts.

  8. Khanty, Textbook:
    Rugin, R.P. (1990).
    Shum jôxan sjun'öng xâtLöt.
    (Shchastlivye den'ki na Shum-jugane.) [Onnellisia päiviä Shum-joella.]
    Kniga dlja dopol'nitel'nogo chtenija v 3-4 klassax xantyjskix shkol (shuryshkarskij dialekt).
    Prosveshchenie, Leningrad.

    The text includes six different versions: (1) one version edited in the original form by using the Cyrillic alphabet; (2) the same text as transformed to the Latin alphabet; the same text as translated into (3) Finnish, (4) English and (5) Russian, and (6) the original text in the Latin format as morphologically coded and translated into English.

    For further information, please contact: the addresses of the contact person.

  9. Children's books:
    Life of Jesus in Khanty (the Kazim dialect). (Trial edition).
    Translation: Nyomysova, Yevdokiya Andreyevna &
    Lozyamova, Zoya Nikiforovna.
    ISBN 952-9790-25-2, ISBN 91-88394-97-2. 63 pp.
    Institute for Bible Translation.
    Stockholm & Helsinki 1995.

    Life of Jesus in Khanty (the Kazim dialect). (Second edition).
    Translation: Nyomysova, Yevdokiya Andreyevna &
    Lozyamova, Zoya Nikiforovna.
    ISBN 952-9790-40-6, ISBN 91-88794-83-0. 63 pp.
    Institute for Bible Translation.
    Stockholm & Helsinki 1997.

  10. For further information, please contact: the addresses of the contact person.

The computer corpora on the Khanty dialects, and the textbook are compiled and edited by Merja Salo with the financial support of the Academy of Finland. Adaptation of the texts for public use have been done with the financial support of the Department of General Linguistics, University of Helsinki. The books of children are donated to the University of Helsinki by the Institute for Bible Translation, Helsinki and Stockholm. The use of the corpora is restricted to concern research and teaching. Reference to the corpora has to be done in the papers in which they are used as a source.

Metadata descriptions for the Khanty corpora



The Mansi computer corpus

The Mansi corpus contains the following document:

"Life of Jesus" in Mansi. (Trial edition).
Kartano, Anne (ed.).
Translation: Afanasyeva, Klavdiya.
ISBN 952-9790-35-X, ISBN 91-88794-52-0. 63 pp.
Institute for Bible Translation.
Stockholm & Helsinki.

Document type: running text.
Size of the document: 3,421 words, 23,273 characters.
Character encoding: Cyrillic alphabet converted to the ISO 8859-1 (Latin-1).

The Mansi texts are donated to the University of Helsinki by the Institute for Bible Translation (Helsinki and Stockholm) to be used as research material. The computer corpus of Mansi has been adapted to the Unix software system with the financial support of the Department of General Linguistics, University of Helsinki. Reference to the corpus has to be done in the papers in which it is used as a source. For further information, please contact: the address of the contact person.

Metadata descriptions for the Mansi corpus



The Nenets computer corpora

The Nenets computer corpora contains the following sub-corpora:

  1. Fragments of the Gospel of Luke in the Nenets Language.
    Translation: Barmich, Mariya Yakovlevna.
    ISBN 91-88794-05-9. 32 pp.
    Institute for Bible Translation.
    Stockholm 1995.

    Document type: running text.
    Size of the document: 3,179 words, 23,896 characters.
    Character encoding: ISO 8859-1 (Latin-1).
  2. The corpus is donated to the University of Helsinki by the Institute for Bible Translation (Stockholm, Sweden) to be used as research material. For further information, please contact: the address of the contact person.

  3. Tundra Nenets sample sentence corpus:
    The Tundra Nenets sample sentence corpus includes 9,992 sentences, some of them complex, with 39,415 words. Each sentence is preceded by two numbers which refer to its page and place in N. M. Tereshchenko, Nenecko-russkij slovar´ (Moskva: Sovetskaja Ènciklopedija, 1965) [temporarily separated with \]. Each sentence is followed by a transliterated Russian translation [temporarily separated with /]. The corpus is compiled and edited by Tapani Salminen (Data Bank for the Endangered Finno-Ugric Languages) For further information, please contact: the address of the contact person.

The use of the corpora is restricted to concern research and teaching. Reference to the corpora has to be done in the papers in which they are used as a source.

Description of the corpus (Tapani Salminen, 1998-09-23)

Tundra Nenets sample sentence corpus
Compiled by Tapani Salminen
Databank of the endangered Finno-Ugrian languages

The corpus includes 9,992 sentences, some of them complex, with 39,415 words. Each sentence is preceded by two numbers which refer to its page and place in N. M. Tereshchenko, Nenecko-russkij slovar´ (Moskva: Sovetskaja Ènciklopedija, 1965) [temporarily separated with \]. Each sentence is followed by a transliterated Russian translation [temporarily separated with /].

Each part of a compound word is marked by a hyphen and given a separate morphological analysis. Clitic particles and supplementary elements are not accounted for by the morphological analysis. Russian words within Tundra Nenets sentences are transliterated in the same way as in the translations rather than transcribed phonologically. The character ý is supposed to appear as a stressed y.

Metadata descriptions for the Nenets corpora



The Enets computer corpus

The Enets computer corpus contains the following document:

Fragments of the Gospel of Luke translated into Enets.
Translation: Bolina, Darja Spiridonovna.
ISBN 91-88394-99-9. 31 pp.
Institute for Bible Translation.
Stockholm 1995.

Document type: running text.
Size of the document: 3,548 words, 22,931 characters.
Character encoding: Cyrillic alphabet converted to the ISO 8859-1 (Latin-1).

The Enets texts are donated to the University of Helsinki by the Institute for Bible Translation (Stockholm, Sweden) to be used as research material. Reference to the corpus has to be done in papers in which it is used as a source. For further information, please contact: the address of the contact person.

Metadata descriptions for the Enets corpus>



The Kamas computer corpus

The Kamas computer corpus contains the following documents:

The source of the documents: Donner, Kai. Manuscripts. In A.J. Joki (ed.): Kai Donners Kamassisches Wörterbuch nebst Sprachproben und Hauptzügen der Grammatik. Lexica Societatis Fenno-ugricae VIII. (Suomalais-Ugrilainen Seura. Helsinki 1944).

Document type: morphologically encoded running texts, which are translated into German.
Size of the documents: 38,340 words, 215,521 characters.
Character encoding: Cyrillic alphabet converted to the ISO 8859-1 (Latin-1).

The texts of the Kamas corpus are prepared by Jarmo Alatalo during the project the Data Bank for Endangered Finno-Ugric Languages. The texts are morphologically encoded and translated into German. The use of the corpus is restricted to concern resarch and teaching. Reference to the corpora must be done in the papers in which they are used as a source. For further information, please contact: the address of the contact person:

Metadata descriptions for the Kamas corpus



The Selkup computer corpus

The Selkup computer corpus contains the following sub-corpora:

  1. H-dialects
    Document type: morphologically encoded running texts translated into German.
    Size of the document: 78,390 words, 421,062 characters.
    Character encoding: ISO 8859-1 (Latin-1).
  2. Ivankino-dialect
    Document type: morphologically encoded running texts translated into German.
    Size of the document: 21,570 words, 112,320 characters.
    Character encoding: ISO 8859-1 (Latin-1).
  3. Ket-dialect
    Document type: morpohologically encoded running text translated into German.
    Size of the document: 10,818 words, 55,584 characters.
    Character encoding: ISO 8859-1 (Latin-1).
  4. Tundra-dialect
    Document type: morphologically encoded running texts translated into German.
    Size of the document: 36,669 words, 184,174 characters.
    Character encoding: ISO 8859-1 (Latin-1).
  5. Tym-dialect
    Document type: morphologically encoded running texts translated into German.
    Size of the document: 159,113 words, 826,494 characters.
    Character encoding: ISO 8859-1 (Latin-1).
  6. Upper-ob-dialect
    Document type: morphologically encoded running texts translated into German.
    Size of the document: 67,888 words, 340,885 characters.
    Character encoding: ISO 8859-1 (Latin-1).

The Selkup corpora are prepared from texts collected by various researchers in fieldwork trips during several periods of time mostly in the first half of 20th century. Some of the corpora are from materials, which are published, but most of the data used as material is located in the archive of the Finno-Ugrian Society.

The texts of the computer corpora of Selkup were compiled and edited during the project the Data Bank for Endangered Finno-Ugric Languages. Part of work was done with the financial support of the Finno-Ugrian Society, Helsinki. The corpora are adapted to the Unix software system with the financial support of the Department of General Linguistics, University of Helsinki. Reference to the corpora has to be done in papers in which they are used as a source. For further information, please contact: the address of the contact person.

Metadata descsriptions for the Selkup corpus

University of Helsinki Language Corpus Server


P.S., 1998; 2002; 2007.