Metadata descriptions of the Uralic languages
The North Saami computer corpus
The North Saami computer corpus contains the following documents:
Committee report:
Komiteanmietintö 1985: 66.
Sámikultuvradoaibmagotti smiehttamush, pp. 1-140.
Opetusministeriö (Ministry of education in Finland).
Valtion painatuskeskus, Helsinki 1990.
The document has been donated to the
University of Helsinki Language Corpus Server by Irja Seurujärvi-Kari.
For further information, please contact: the address of the contact person.
Cheppari charahus:
Vuolab, Kerttu (1994).
Cheppari cháráhus.
ISBN 82-7374-229-6. 105 pp.
Davvi Girji.
Karasjok.
The document has been donated to the University of Helsinki Language
Corpus Server by Kerttu Vuolab.
Document type: running text.
Size of the corpus: 17,830 words, 129,034 characters.
Character encoding: ISO 8859-1 (Latin-1).
The corpora located at the University of Helsinki Language Corpus Server can be used in research and teaching. References to the corpora have to be done in the papers in which they are used as data. For further information, please contact: the address of the contact person.
Metadata
descriptions for the North Saami Report and the
Novel
The Ume Saami computer corpus contains the following document:
Morphologically encoded Ume Saami corpus:
Document type: running text, which is in the word-per-line
format. The words are markes with the morphosyntactic information, and
translated into Swedish.
Size of the corpus: 109,572 words, 561,654 characters
(including tags).
Character encoding: ISO 8859-1 (Latin-1).
The use of the corpora at the University of Helsinki Language Corpus Server is restricted to concern research and teaching. Refrerence to the corpora has to be done in the papers in which they is used as a source. For further information, please contact: the address of the contact person.
Metadata descriptions for the Ume Saami corpus
The Kildin Saami computer corpus
The Kildin Saami computer corpus contains the following document:
Document type: running text.
Size of the corpus: 22,037 words, 146,690 characters.
Character encoding: Cyrillic alphabet converted to the ISO
8859-1 (Latin-1).
The Kildin Sámi texts are donated to the University of Helsinki by the Institute for Bible Translation (Stockholm, Sweden) to be used as research material. Reference to the corpus has to be done in the papers in which it is used as a source. For further information, please contact: the address of the contact person.
Metadata descriptions of the Kildin Saami computer corpus
The Finnish databank consists of Finnish corpora compiled and edited during several projects. The following corpora of the Finnish language are located at the University of Helsinki Language Corpus Server:
FILES: hkv.txt and hkv.tag:
For further information on the corpus, please contact: the address of contact person.
The Bibles: The Finnish text corpus contains two editions of Bible: the old translation from the year 1938, and the revised translation published in 1992. The Bibles are located at different directories: "/KRaamattu38/" and "/KRaamattu92/". In both directories, the files are arranged according to the chapters.
Document types: running text.
The files in the directory
/KRaamattu92/ are pre-procesed in that way that the verses are
marked with % and & charcters (%1&, etc.), and the chapters with
the characters $ and # ($1Genesis#, etc.). The hierachy of titles of
the chapters is marked with numbers from 1 to 4. h
Characater encoding: The edition from the year in the
1938 is in the ASCII-format, and the edition 1992 in the Latin-1
format.
For further information on the corpus, please contact: the address of contact person.
Document types:
The corpora in FINCORP consist of following data types: (1) Texts
analyzed morphologically with the automatic morphological analyzer
"fintwol" and disambiguated with the constraint grammar
"fin-CG". The texts end with the index ".cl.twol"; (2)
morphologically analyzed texts with the TEI-header; (3) plain
running texts; (4) plain running texts containing information on the
data structure.
Character encoding: ASCII.
The size of the corpus: 243,146 words, 2,180,426
characters.
Document types: Running text.
Character encoding: ASCII.
Document type: Running text in the SGML-format.
The size of the corpus: 5,553,876 words, 56,967,984 characters.
Character encoding: ISO 8859-1 (Latin-1).
Document type: List of words arranged in the reversed
alphabetic order.
The size of the corpus: 165,123 words, 1,778,997 characters.
Character encoding: ASCII.
Spoken language in the Helsinki area in 1972-1974:
The corpus of "Language in the helsinki area in 1972-1974" is based on the data recorded during the project dircted by Terho
Itkonen. The data from the Helsinki area were collected in
1972-74. The project on the Spoken language in the Helsinki area was
combined with the larger project "Nykysuomen puhekielen murros"
financed by Valtion humanistisen toimikunta (the Committee of
humanistic research in Finland). The project was directed by Heikki
Paunonen. The preliminary phase of the project was going on in 1976,
and the principal project was carried out in 1977-80.
The corpus is formed from recordings consisting of three social and age groups. The time of each recording was one hour. The number of interviewees was 127. The original recordings are available at the Research Centre of Languages in Finland (http://www.kotus.fi/). (The description is an abridged README-file on the corpus of Spoken Finnish, Helsinki dialect, written by Pirkko Kukkonen. Description of the corpus is available within the corpus.)
Document type:
The recordings of corpus of Spoken Finnish, Helsinki dialect, are
transcribed, and adjusted in the machine-readable form at the
University of Helsinki, Department of General
Linguistics. Documentation of the project, the recordings, and the
adjusting work is availabe with the corpus.
The size of the corpus: 127 x 30 min.
Character encoding: ASCII.
Document types:
The text in the corpus is (1) in the plain text format and (2) in the
sentence-per-line format. The texts in the sentence-per-line format
are pre-processed.
The size of the corpus:The size of the material in the sentence-per-line format is 304,685
words, 2,416,797 characters.
Character encoding: ASCII.
Document Type: running text. The text in the corpus is indexed with running numbers.
The size of the corpus: 840,672 words, 9,693,042 characters.
Character encoding: ASCII.
Document Type: running text. The texts are in the plain text
and a sentence-per-line formats.
Size of the corpus: 17,30,597 words, 12,520,546 characters
Character encoding: ASCII
Document type: running text.
The size of the corpus: 68,067 words, 464,792 characters.
Character encoding: ASCII.
WSOY (wsoy):
The corpus "wsoy" contains some books and fragments of books published
by Werner Söderström Osakeyhtiö (Helsinki
- Porvoo).
Document types: running text in the sentence-per-line format
in the sub-directory /snt/.
Size of the corpus: 979,516 words, 7,086,335 characters.
Character encoding: ASCII.
For further information, please contact: the address of contact person.
The Dvina Karelian (North Karelian) contains the following sub-corpora:
Life of Jesus
"Life of Jesus" in the North Karelian language.
(The second, corrected edition).
ISBN 91-88394-68-9, ISBN 952-9790-18-X. 63 pp.
Institute for Bible Translation.
Stockholm & Helsinki 1994.
Document type: running text.
Size of the corpora: 4,757 words, 36,417 characters.
Character encoding: ISO 8859-1 (Latin-1).
Gospel of Mark:
The Gospel of Mark in North-Karelian language.
ISBN 952-9790-27-9, ISBN 91-88794-22-9. 75 pp.
Institute for Bible Translation.
Stockholm & Helsinki 1996.
Document type: running text.
Size of the corpus: 14,213 words, 111,590 characters.
Character encoding: ISO 8859-1 (Latin-1).
The documents are donated to the University of Helsinki by the Institute for Bible Translation (Helsinki and Stockholm) to be used as research material. Helsinki. Reference to the corpora has to be done in papers in which they are used as a source. For further information, please contact: the address of the contact person.
Metadata descriptionsThe computer corpus of Karelian Lude dialect contains samples collected during the fieldwork trips done in Joensuu in Kuujärvi (1988), in Aunus (1989), and in Helsinki (1999). The number of informants is seven (7). The corpus is recorded, transliterated, and edited by Miikul Pahomov who is a native speaker of the Lude dialect. The corpus was edited to the machine-readable form during the databank project Databank for endangered Finno-Ugric languages. Reference to the corpora has to be done in the documents in which they are is used as a source.
The original tapes are stored in the recording archive at the Research Institute for the Domestic Languages Spoken in Finland. The Lude samples in Pertti Virtaranta's textbook "Lähisukukielten lukemisto" (Suomalaisen Kirjallisuuden Seura 1967) are used as a model for transcription. Description of the literation, and information on the informants is enclosed the corpus.
Document types: Tranliterated audio-tapes: the content of the
documents: ethnographic descriptions.
Size of the documents: 42,394 words, 285,470 characters.
Character encoding: ISO 8859-1 (Latin-1).
For further information, please contact: the address of the contact person.
Metadata descriptions for the Lude corpus
The Livvi (Olonets Karelian) corpus
The Livvi (Olonets Karelian) corpus contains the following texts:
Document type: running text.
Size of the copus: Sub-corpus 1. 56,883 words,
407,397 characters.
Character encoding: ISO 8859-1 (Latin-1).
Document type: running text. The file is in the sentence-per-file format.
Size of the corpus: 17,284 words, 120,632 characters.
Character encoding: ISO 8859-1 (Latin-1).
Document type: running text.
Size of the copus: 21,965 words, 155,155 characters.
Character encoding: ISO 8859-1 (Latin-1).
Document types: running text. The file in in the
sentence-per-line format.
Size of the copus: 12,488 words, 91,385 characters.
Character encoding: ISO 8859-1 (Latin-1).
Gospel of Matthew:
The Gospel of Matthew in Karelian (Olonets) Language. (Trial edition)
ISBN 952-9790-42-2, ISBN 91-88794-87-3.
Institute for Bible Translation.
Stockholm & Helsinki 1997.
Document type: running text.
Size of the copus: 20,235 words, 141,937
characters.
Character encoding: ISO 8859-1 (Latin-1).
Life of Jesus
"Life of Jesus" in the (Olonets) Karelian language.
(The second, corrected edition).
ISBN 952-9790-15-5, ISBN 91-88394-67-0. 63 pp.
Institute for Bible Translation.
Stockholm & Helsinki 1994.
Document type: running text.
Size of the copus: 3,311 words, 25,936 characters.
Character encoding: ISO 8859-1 (Latin-1).
The Livvi texts are donated to the University of Helsinki by the Institute for Bible Translation (Helsinki and Stockholm) to be used as research material. The computer corpora are adapted to the Unix software system with the financial support of the Department of General Linguistics, University of Helsinki. Reference to the corpora has to be done in the papers in which they are used as a source. For further information, please contact: the address of the contact person.
Metadata descriptions of the Livvi corpus
The Ingrian corpus contains the following texts:
The corpora were prepared during the project Data Bank for Endangered Finno-Ugric Languages by Manja Lehto. Refrerence to the corpora has to be done in papers in which they are is used as a source. For further information, please contact: the address of the contact person.
Metadata descriptions of the Ingrian corpora
The Veps (Vepsian) computer corpus
The Veps computer corpus contains the following sub-corpora:
Document type: running text.
Size of the corpus: 13,105
words, 88,709 characters.
Character encoding: ISO 8859-1 (Latin-1).
Document type: running text.
Size of the corpus: 3,316 words, 25,272 characters.
Character encoding: ISO 8859-1 (Latin-1).
The Veps texts are donated to the University of Helsinki by the Institute for Bible Translation (Helsinki and Stockholm) to be used as research material. The computer corpora are adapted to the Unix software system with the financial support of the Department of General Linguistics, University of Helsinki. Reference to the corpora has to be done in the papers in which they are used as a source. For further information, please contact: the address of the contact person.
Metadata descriptions of the Veps corpus
The Estonian corpus contains the following sub-corpora:
Text type: Short-stories, and articles published in Estonian
newspapers and magazines.
Size of the corpus "viro1": 142,362 words, 1,065,718 characters.
Document types: Running text.
Character encoding: Latin-1.
For further information on the corpus, please contact: the address of contact person.
Text type: Short-stories and fragments of novels.
Size of the corpus: 3,294 sentences, 50,282 words, and
276,269 characters.
Document types: Running text.
Character encoding: ASCII.
For further information on the corpus, please contact: the address of contact person.
Metadata descriptions for the Estonian corpora
The computer corpus of Liv (Livonian) contains the following documents:
Life of Jesus
Children's book "Life of Jesus". Livonian translation.
Tranlation: Juha-Lassi Tast.
ISBN-number = ISBN 952-9790-47-3
Institute for Bible Translation
Stockholm, Helsinki 1998.
Document type: running text.
Size of the corpus: 5,266 words, 33,361
characters.
Character encoding: ISO 8859-1 (Latin-1).
For further information, please contact: the
address
of the contact person.
Ethnographic texts:
A sample of Livonian ethnographic texts.
Document type: running text.
Size of the corpus: 2,309 words, 13,096 characters.
Character encoding: ISO 8859-1
(Latin-1).
For further information, please contact: the address of the contact person.
Metadata descriptions for the Livonian corpus
The computer corpus of Erzya Mordvin contains the following documents:
Children's Bible
Children's Bible in the Erzya Mordvin Language.
Arapovich, Borislav & Mattelmäki, Vera (eds.).
Translation: Adushkina, N.S., Shchemerova, V.S. & Nadkin, D.T.
ISBN 91-88394-23-9, ISBN 952-9790-01-5. 544 pp. &
Erzya-Russian wordlist for the Erzya-Mordvin
Children's Bible. 19 pp.
Institute for Bible Translation.
Stockholm & Helsinki 1993.
Document type: Running text, which have been modified into a sentence-per-line format and the
paragraphs have been separated by a new line.
Size of the corpus: 76,817 words, 528,786 characters.
Character encoding: Cyrillic alphabet converted to the ISO 8859-1 (Latin-1).
Gospel of Mark
The Gospel of Mark in Erzya-Mordvin language.
ISBN 91-88394-58-1, ISBN 952-9790-11-2. 81 pp.
Institute for Bible Translation.
Stockholm & Helsinki 1995.
Document type:
Size of the corpus: 12,071 words, 93,318 characters.
Charcter encoding: Cyrillic alphabet converted to the ISO
8859-1 (Latin-1).
Gospel of Luke and Acts of the Apostles
The Gospel of Luke and Acts of the Apostles
in the Erzya-Mordvin Language.
Adushkina, N.S., Bargova, T.S. & Gorbynov, G.I. (eds.).
Translation: Batkov, G.I., Devyatkin, G.S. & Nadkin, D.T.
ISBN 91-88794-56-3, 952-9790-33-3. 232 pp.
Institute for Bible Translation.
Stockholm & Helsinki 1996.
Document type: running text.
Size of the document: 40,289 words, 300,308 characters.
Character encoding: Cyrillic alphabet converted to the ISO
8859-1 (Latin-1).
Gospel of Matthew
The Gospel of Matthew
in the Erzya-Mordvin Language.
Translated by Adushkina, N.S.
Edited by Bargova, T.S., Batkov, G.I., Gorbunov, G.I.,
Devjatkin, G.S.
ISBN 91-88394-58-1, ISBN 952-9790-11-2.
Institute for Bible Translation.
Stockholm & Helsinki 1996.
Document type: running text.
Size of the document: 23,020 words, 171,502 characters.
Character encoding: Cyrillic alphabet converted to the ISO
8859-1 (Latin-1).
Novels
Novels written by Kuzjma Abramov, Kalinkin, and Kutorkin.
Publishing place: Saransk.
Document type: running text.
Size of the document: 865,007 words, 11,142,079 characters.
Character encoding: Cyrillic alphabet converted to the ISO
8859-1 (Latin-1).
Short-stories
Short-stories written by Vasili Arapov and
Petja Kljuchagin.
Publishing place: Saransk.
Document type: running text.
Size of the documents: 135,472 words, 1,636,435 characters.
Character encoding: Cyrillic alphabet converted to the ISO
8859-1 (Latin-1).
Poetry
Poems written by Vasili Arapov.
Publishing place: Saransk.
Document type: running text.
Size of the document: 4,862 words, 64,978 characters.
Character encoding: Cyrillic alphabet converted to the ISO
8859-1 (Latin-1).
Morphologically encoded text
A sample of a morphologically
coded novel written in the Erzya Mordvin language. The coding is done
by Jack Rueter (University of Helsinki, Department of Finno-Ugrian Studies).
Document type: running text.
Size of the document: 6,292 words, 50,973 characters.
Character encoding: Cyrillic alphabet converted to the
ISO 8859-1 (Latin-1).
List of words
List of words collected
at the initiative of Catherine the Great under the
direction of Bishop Damaskin and recorded in 1785.
Originally published in A. P. Feoktistov in "Russko-mordovskiy slovar, Iz istorii
otechestvennoy leksikografii, Izdatelstvo Nauka, Moskva 1971.
Document type: List of words in the alphabetic order.
Size of the document: 23,500 words.
Character encoding: ISO 8859-1 (Latin-1).
Sub-corpora 1.-3. are donated to the University of Helsinki by the Institute for Bible Translation (Helsinki and Stockholm), sub-corpora 4.-6. belong to the large collection of electronic material compiled and edited by Jack Rueter with the assistance of Mordvin research assistants, sub-corpus 7. is prepared by Jack Rueter, and sub-corpus 8. is prepared and by Dennis Estill. Sub-corpora 4.-8. have been donated to the University of Helsinkin Language Corpus Server by the editors. The use of the corpora located at the University of Helsinki Language Corpus Server is restricted to concern research and teaching. The computer corpora of Erzya have been adapted to the Unix software system with the financial support of the Department of General Linguistics, University of Helsinki. Reference to the corpora has to be done in the papers in which they are used as a source.
Metadata descriptions for the Erzya corpora
Document type: running text.
Size of the corpus: 5,935 words,
45,694 characters.
Character encoding: ISO 8859-1 (Latin-1).
For further information, please contact: the address of the contact person.
Document type: Words in
Moksha collected at the initiative of Catherine the Great under the
direction of Bishop Damaskin and recorded in 1785.
Size of the corpus: approx. 300 words.
Character encoding: ISO 8859-1 (Latin-1).
For further information, please contact: the address of the contact person.
Metadata descriptions for Moksha Mordvin corpora
The East Mari computer corpus contains the following documents:
Document type: running text.
Size of the document: 59,272 words, 415,375 characters.
Character encoding: Cyrillic alphabet converted to the ISO
8859-1 (Latin-1).
Document type: running text in the sentence-per-line format.
Size of the document: 12,981 words, 89,879 characters.
Character encoding: Cyrillic alphabet converted to the ISO
8859-1 (Latin-1).
Document type: running text.
Size of the document: 21,159 words, 145,092 characters.
Character encoding: ISO 8859-1 (Latin-1).
Document type: running text.
Size of the document: 16,483 words, 109,835 characters.
Character encoding: Cyrillic alphabet converted to the ISO
8859-1 (Latin-1).
The Eastern Mari texts are donated to the University of Helsinki by the Institute for Bible Translation (Helsinki and Stockholm) to be used as research material. Reference to the corpora has to be done in the papers in which they are used as a source. For further information, plese contact: the address of the contact person.
Metadata descriptions for the corpora of the Mari languages
Document type: running text.
Size of the document: 3,712 words, 28,859 characters.
Character encoding: Cyrillic alphabet converted to the ISO
8859-1 (Latin-1).
Document type: running text.
Size of the document: 6,833 words, 48,008 characters.
Character encoding: Cyrillic alphabet converted to the ISO
8859-1 (Latin-1).
The documents mentioned above have been donated to the University of Helsinki by the Institute for Bible Translation (Helsinki & Stockholm) to be used as research material.
For further information, please contact: the addresses of the contact person.
The texts are tagged morphologically and translated into English and German. The corpus is edited and encoded by Andri Hesselbäck (University of Uppsala). The texts are translated into English and German. The corpus is prepared during the project the Data Bank for Endangered Finno-Ugric Languages.
For further information, please contact: the addresses of the contact person.
Reference to the corpora has to be done in the papers in which they are used as a source.
Metadata descriptions for the Mari corpora
The Komi Zyrian computer corpora
The Komi Zyrian computer corpus contains the following sub-corpora:
Document type: running text.
Size of the corpus: 7,338 words, 48,883 characters.
Character encoding: Cyrillic alphabet converted to the ISO
8859-1 (Latin-1).
Document type: running text.
Size of the corpus: 11,932 words, 86,108 characters.
Character encoding: Cyrillic alphabet converted to the ISO
8859-1 (Latin-1).
Document type: running text.
Size of the corpus: 14,677 words, 101,908 characters.
Character encoding: Cyrillic alphabet converted to the ISO
8859-1 (Latin-1).
Document type: running text.
Size of the corpus: 14,769 words, 102,504 characters.
Character encoding: Cyrillic alphabet converted to the ISO
8859-1 (Latin-1).
For further information, please contact: the addresses of the contact person.
The Komi text corpus is in three formats: running texts, in the sentence-per-line format, and morphologically encoded format. The text corpora include the following texts:
(1) 2 short-stories by a Komi writer, (2) a text of a booklet for children, (3) an article from a Komi newspaper, (4) a scientific text (in two parts) from a Komi periodical and (5) religional texts:
(1) N'ina Kuratova (1983). Bobön'an' kör, Povest'jas, vis'tjas.
Komi kn'izhnöj izdatel'stvo, Syktyvkar.
FICT_ST__Ni_Ku_1983_BK_186-197
FICT_ST_Ni_Ku_1983_BK-198-212
(2) Rots'ev, Jegor (1987). Mitruk petö tundrays', 3 - 65.
Komi knizhnoj izdatelstvo, Syktyvkar.
FAT/FICT_NV_Je_Ro_1987_MPT_3-65
(3) P. Stolpovskij, SSSR-ys' pisat'el'jas sojuzsa ts'l'en. Komi mu 1991: 4.
NEWS_P_St_1991_KM_04
(4) Tsypanov, Jevgenij (1989). VK: 6, 49 - 55.
SCF_Je_Ts_1989_VK:6_49-55
SCF_Je_Ts_1989_VK:7_54-59
For further information, please contact: the addresses of the contact person.
For further information, please contact: the addresses of the contact person.
The texts in 1.-4. are donated to the University of Helsinki by the Institute for Bible Translation, Stockholm and Helsinki. The Komi text corpora in 5. are compiled and edited by Paula Kokkonen with the economical support of The Academy of Finland and the University of Helsinki.
The texts in (1) - (4) have been transliterated from the Komi official alphabet, which is based on the Cyrillic alphabet. The transliteration has been adjusted to the phonological system of Komi. In the coding, - distinctive phonemes are the basic units to be coded; - if the orthographic system includes letters not belonging to the phonemes of this language but needed, for example, in loan words, they have been added to the inventory of the orthographical units, i.e. alphabet; - if it is not possible to code a phoneme with one standardized unit, instead of artificial marks, combinations of several units have been used. The novel in 6. was compiled and edited by Jack Rueter.
The texts under the sub-directories /Books-of-Children/ and /New-Testament/ are in the unmodified form. These texts are donated to the University of Helsinki by the Institute for Bible Translation (Helsinki and Stockholm) to be used as research material. Reference to the corpora has to be done in the papers in which they are used as a source.
Metadata descriptions for the Komi corpora
The Komi Permyak computer corpus
The Komi Permyak computer corpus contains the following documents:
The texts are donated to the University of Helsinki by the Institute for Bible Translation (Helsinki and Stockholm) to be used as research material. Reference to the corpora has to be done in the papers in which they are used as a source. For further information, please contact: the address of the contact person.
Metadata descriptions for the Komi corpora
The Udmurt computer corpus contains the following sub-corpora:
Document type: running text.
Size of the corpus: 7,314 words, 50,238 characters.
Character encoding: Cyrillic alphabet converted to the ISO
8859-1 (Latin-1).
Document type: running text.
Size of the corpus:
Character encoding: Cyrillic alphabet converted to the ISO
8859-1 (Latin-1).
Document type: running text.
Size of the corpus:
Character encoding: Cyrillic alphabet converted to the ISO
8859-1 (Latin-1).
Document type: running text.
Size of the corpus: 133,575 1,007,598.
Character encoding: Cyrillic alphabet converted to the ISO
8859-1 (Latin-1).
Document type: running texts in the sentence-per-line
format. In addition to this, there are copies of the texts, which
are plain running texts.
Size of the corpus: 1. Drama: ; 2. Fairy-tales: 2,118 words,
14,618 characters; 3. Legends: 3,719 words, 25,694 characters;
4. Novels: 30,470
words, 276,092 characters; 5. Poems: 3,448 words, 41,687 characters; 6. Short-stories: 11,592
words, 118,689 characters; and 7. Stories: 11,827 words, 117,866 characters.
Character encoding: ISO 8859-1 (Latin-1). The texts are
manually edited to correspond to Udmurt phonematic system.
Reference to the Udmurt corpora has to be done in the papers in which they are used as a source. For further information, please contact: the address of the contact person.
Metadata descriptions for the Udmurt corpora
The Khanty computer corpus contains the following sub-corpora:
The corpora of the Khanty dialects are samples taken from the following text collections:
Rédei, Károly (1968).
Nord-ostjakische Texte (Kazym-Dialekt) mit
Skizze der Grammatik.
Gesammelt und herausgegeben von Károly
Rédei. Abhandlung der Akademie
der Wissenschaften in Göttingen,
philologisch-historische Klasse, dritte Folge 71.
Göttingen.
Steinitz, Wolfgang (1989).
Ostjakologische Arbeiten III. Texte aus dem
Nachlass.
Eds.: Hartung, Liselotte, Hauel, Petra, Sauer, Gert &
Schulze, Birgitte.
Janua Linguarum, Series Practica 256.
Mouton de Gruyter, Berlin.
Vértes, Edith (1980).
H. Paasonens
südostjakische Textsammlungen.
Suomalais-Ugrilaisen Seuran
Toimituksia 175.
Suomalais-Ugrilainen Seura, Helsinki.
Corpora are running texts and several corpora are morphologically analyzed. Morphologically encoded words of the texts are in the word-per-line format, and the plain texts are in sentence-per-line format. There are also texts in which the clauses and the sentences are marked with the information about the location of the sentences in the texts.
The text includes six different versions: (1) one version edited in the original form by using the Cyrillic alphabet; (2) the same text as transformed to the Latin alphabet; the same text as translated into (3) Finnish, (4) English and (5) Russian, and (6) the original text in the Latin format as morphologically coded and translated into English.
For further information, please contact: the addresses of the contact person.
Life of Jesus in Khanty (the Kazim dialect). (Second edition).
Translation: Nyomysova, Yevdokiya Andreyevna &
Lozyamova, Zoya Nikiforovna.
ISBN 952-9790-40-6, ISBN 91-88794-83-0. 63 pp.
Institute for Bible Translation.
Stockholm & Helsinki 1997.
The computer corpora on the Khanty dialects, and the textbook are compiled and edited by Merja Salo with the financial support of the Academy of Finland. Adaptation of the texts for public use have been done with the financial support of the Department of General Linguistics, University of Helsinki. The books of children are donated to the University of Helsinki by the Institute for Bible Translation, Helsinki and Stockholm. The use of the corpora is restricted to concern research and teaching. Reference to the corpora has to be done in the papers in which they are used as a source.
Metadata descriptions for the Khanty corpora
The Mansi corpus contains the following document:
"Life of Jesus" in Mansi. (Trial edition).
Kartano, Anne (ed.).
Translation: Afanasyeva, Klavdiya.
ISBN 952-9790-35-X, ISBN 91-88794-52-0. 63 pp.
Institute for Bible Translation.
Stockholm & Helsinki.
Document type: running text.
Size of the document: 3,421 words, 23,273 characters.
Character encoding: Cyrillic alphabet converted to the ISO
8859-1 (Latin-1).
The Mansi texts are donated to the University of Helsinki by the Institute for Bible Translation (Helsinki and Stockholm) to be used as research material. The computer corpus of Mansi has been adapted to the Unix software system with the financial support of the Department of General Linguistics, University of Helsinki. Reference to the corpus has to be done in the papers in which it is used as a source. For further information, please contact: the address of the contact person.
Metadata descriptions for the Mansi corpus
The Nenets computer corpora contains the following sub-corpora:
The use of the corpora is restricted to concern research and teaching. Reference to the corpora has to be done in the papers in which they are used as a source.
The corpus includes 9,992 sentences, some of them complex, with 39,415 words. Each sentence is preceded by two numbers which refer to its page and place in N. M. Tereshchenko, Nenecko-russkij slovar´ (Moskva: Sovetskaja Ènciklopedija, 1965) [temporarily separated with \]. Each sentence is followed by a transliterated Russian translation [temporarily separated with /].
Each part of a compound word is marked by a hyphen and given a separate morphological analysis. Clitic particles and supplementary elements are not accounted for by the morphological analysis. Russian words within Tundra Nenets sentences are transliterated in the same way as in the translations rather than transcribed phonologically. The character ý is supposed to appear as a stressed y.
Metadata descriptions for the Nenets corpora
The Enets computer corpus contains the following document:
Fragments of the Gospel of Luke translated into Enets.
Translation: Bolina, Darja Spiridonovna.
ISBN 91-88394-99-9. 31 pp.
Institute for Bible Translation.
Stockholm 1995.
Document type: running text.
Size of the document: 3,548 words, 22,931 characters.
Character encoding: Cyrillic alphabet converted to the ISO 8859-1
(Latin-1).
The Enets texts are donated to the University of Helsinki by the Institute for Bible Translation (Stockholm, Sweden) to be used as research material. Reference to the corpus has to be done in papers in which it is used as a source. For further information, please contact: the address of the contact person.
Metadata descriptions for the Enets corpus>
The Kamas computer corpus contains the following documents:
The source of the documents: Donner, Kai. Manuscripts. In A.J. Joki (ed.): Kai Donners Kamassisches Wörterbuch nebst Sprachproben und Hauptzügen der Grammatik. Lexica Societatis Fenno-ugricae VIII. (Suomalais-Ugrilainen Seura. Helsinki 1944).
Document type: morphologically encoded running texts, which are
translated into German.
Size of the documents: 38,340 words, 215,521 characters.
Character encoding: Cyrillic alphabet converted to the ISO
8859-1 (Latin-1).
The texts of the Kamas corpus are prepared by Jarmo Alatalo during the project the Data Bank for Endangered Finno-Ugric Languages. The texts are morphologically encoded and translated into German. The use of the corpus is restricted to concern resarch and teaching. Reference to the corpora must be done in the papers in which they are used as a source. For further information, please contact: the address of the contact person:
Metadata descriptions for the Kamas corpus
The Selkup computer corpus contains the following sub-corpora:
The Selkup corpora are prepared from texts collected by various researchers in fieldwork trips during several periods of time mostly in the first half of 20th century. Some of the corpora are from materials, which are published, but most of the data used as material is located in the archive of the Finno-Ugrian Society.
The texts of the computer corpora of Selkup were compiled and edited during the project the Data Bank for Endangered Finno-Ugric Languages. Part of work was done with the financial support of the Finno-Ugrian Society, Helsinki. The corpora are adapted to the Unix software system with the financial support of the Department of General Linguistics, University of Helsinki. Reference to the corpora has to be done in papers in which they are used as a source. For further information, please contact: the address of the contact person.
Metadata descsriptions for the Selkup corpus