Helsinki Corpora II (Dec. 1996)
Computer Corpora of Languages of the Former Soviet Union in the
University of Helsinki Language Corpus Server in 1996. In
Suihkonen, Pirkko (1998). Documentation of the Computer Corpora of
the Uralic Languages at the University of Helsinki. Technical Reports
TR-2. University of Helsinki. Department of General
Linguistics. Appendix III (pp. 63-74).
I. North-East Caucasian languages
II. Iranian languages
1. Kurdish
2. Ossete
3. Tajik
III. Chukotko-kamchatkan languages
IV. Slavonic languages
V. Tungusic languages
1. Even
2. Evenki
3. Nanai
VI. Turkic languages
1. Balkar
2. Crimean Tatar
3. Khakas
4. Kirghiz
5. Tatar
6. Turkmen
7. Uighur
8. Yakut
VII. Uralic languages
1. Enets
2. Karelian:
2.1. Dvina-Karelian
2.2. Olonets-Karelian
3. Khanti
4. Komi
4.1. Komi Permyak
4.2. Komi (Zyrian)
5. Mari:
5.1. Eastern Mari
5.2. Western Mari
6. Mordvin:
6.1. Erzya
6.2. Moksha
7. Nenets
8. Saami
8.1. Northern Saami
9. Selkup
10. Udmurt
11. Vepsian (Veps)
I. North East Caucasian languages
1. Avar
Location: /corp/caucasian-lgs/avar/
Files: 1 (plain text)
Words: 12994
Characters: 107818 (including punctuation)
Description: running text.
II. Iranian languages
1. Kurdish
Location: /corp/iranian-lgs/kurdish/
Files: 1 (plain text)
Words: 24917
Characters: 137252 (including punctuation)
Description: running text.
2. Ossete
Location: /corp/iranian-lgs/ossete/
Files: 1 (plain text)
Words: 72507
Characters: 617109 (including punctuation)
Description: running text.
3. Tajik
Location: /corp/iranian-lgs/tajik/
Files: 1 (plain text)
Words: 87654
Characters: 658738 (including punctuation)
Description: running text.
III. Chukotko-kamchatkan languages
1. Chukchi
Location: /corp/paleo-siberian-lgs/chukchi/
Files: 1 (plain text)
Words: 2918
Characters: 27168 (including punctuation)
Description: running text.
2. Koryak
Location: /corp/paleo-siberian-lgs/koryak/
Files: 1 (plain text)
Words: 2909
Characters: 24577 (including punctuation)
Description: running text.
IV. Slavonic languages
1. Ukrainian
Location: /corp/ukrainian/
Files: 1 (plain text)
Words: 7787
Characters: 46920 (including punctuation)
Description: running text.
V. Tungusic languages
1. Even
Location: /corp/tungusic-lgs/even/
Files: 1 (plain text)
Words: 2849
Characters: 23405 (including punctuation)
Description: running text.
2. Evenki
Location: /corp/tungusic-lgs/evenki/
Files: 1 (plain text)
Words: 1698
Characters: 15839 (including punctuation)
Description: running text.
3. Nanay
Location: /corp/tungusic-lgs/nanai/
Files: 1 (plain text)
Words: 2741
Characters: 21145 (including punctuation)
Description: running text.
VI. Turkic languages
1. Balkar
Location: /corp/turkic-lgs/balkar/
Files: 31 (plain text; a sentence-per-line format)
Paragraphs: 13776
Words: 131509 (including punctuation)
Characters: 1698424
Description: running text.
2. Crimean Tatar
Location: /corp/turkic-lgs/crimean-tatar/
Files: 1 (plain text)
Words: 57054
Characters: 597512 (including punctuation)
Description: running text.
3. Khakas
Location: /corp/turkic-lgs/khakas/
Files: 1 (plain text)
Words: 12826
Characters: 138534 (including punctuation)
Description: running text.
4. Kirghiz
Location: /corp/turkic-lgs/kirghiz
Files: 1 (plain text)
Words: 7006
Characters: 50031 (including punctuation)
Description: running text.
5. Tatar
Location: /corp/turkic-lgs/tatar/
Files: 1 (plain text)
Words: 15237
Characters: 100791 (including punctuation)
Description: running text.
6. Turkmen
Location: /corp/turkic-lgs/turkmen/
Files: 31 (plain text; a sentence-per-line format)
Words: 206867 (including punctuation)
Characters: 1562133
Description: running text.
7. Uighur
Location: /corp/turkic-lgs/uighur/
Files: 1 (plain text)
Words: 22533
Characters: 161503 (including punctuation)
Description: running text.
8. Yakut
Location: /corp/turkic-lgs/yakut/
Files: 3 (plain text)
Words: 60269
Characters: 743763 (including punctuation)
Description: running text.
VII. Uralic Languages
1. Enets
Files: 1 (plain text)
Words: 3547
Characters: 22813 (including punctuation)
Description: running text.
2. Karelian
2.1. Dvina-Karelian
Location: /corp/uralic-lgs/karelian/divina-karelian/
Files: 1 (plain text; a sentence-per-line
format)
Sentences: 532
Words: 4757
Characters: 36207 (including punctuation)
Description: running text.
Files: 1 (plain texts)
Words: 14127
Characters: 44230 (including punctuation)
Description: running text.
2.2. Olonets-Karelian
Location: /corp/uralic-lgs/karelian/olonets-karelian/
Files: 3 (plain text; a sentence-per-line format)
Sentences: 3299
Words: 33083
Characters: 236870 (including punctuation)
Description: running text.
Files: 1 (plain text)
Words: 56932
Characters: 444230 (including punctuation)
Description: running text.
3. Khanty
Location: /corp/uralic-lgs/khanti/
Files: 4 (plain text in a sentence-per-line format)
Sentences: 1385
Words: 24922 (including punctuation)
Characters; 138220
Description: running text.
Files: 3 (a clause-per-line format + references)
Clauses: 4069
Words: 20332 (including punctuation)
Characters: 164415
Description: running text.
Files: 3 (tagged texts in a word-per-line format)
Lines: 11694
Words: 21795 (including punctuation)
Characters: 309242
Description: running text.
Files: 1 (plain text)
Words: 3168
Characters: 18879 (including punctuation)
Description: running text.
4. Komi
4.1. Komi Permyak
Location: /corp/uralic-lgs/komi/permyak/
Files: 1 (plain texts)
Words: 12241
Characters: 157105 (including punctuation)
Description: running text.
4.2. Komi (Zyrian)
Location: /corp/uralic-lgs/komi/komi/
Files: 4 (plain texts)
Paragraphs: 1001
Words: 31046 (including punctuation)
Characters: 220378
Description: running text.
Files: 4 (a sentence-per-line format)
Sentences: 3421
Words: 38679 (including punctuation)
Characters: 229452
Description: running text.
Files: 3 (tagged data in a word-per-line format + references)
Lines: 6223
Words: 10694 (including punctuation)
Characters: 151541
Description: running text.
Files: 2 (plain text)
Words: 19206
Characters: 134101 (including punctuation)
Description: running text.
5. Mari
5.1. Eastern Mari
Location: /corp/uralic-lgs/mari/eastern-mari/
Files: 1 (a sentence-per-line format)
Sentences: 1512
Words: 16690
Characters: 98489 (including punctuation)
Description: running text.
Files: 2 (plain text)
Words: 21947
Characters: 196622 (including punctuation)
Description: running text.
5.2. Western Mari
Location: /corp/uralic-lgs/mari/western-mari/
Files: 1 (a sentence-per-line format)
Sentences: 505
Words: 3712
Characters: 28626 (including punctuation)
Description: running text.
Files: 1 (plain text)
Words: 6833
Characters: 47887 (including punctuation)
Description: running text.
6. Mordvin
6.1. Erzya
Location: /corp/uralic-lgs/mordvin/erzya/
Files: 15 (a sentence-per-line format)
Sentences: 5570
Words: 76055 (including punctuation)
Characters: 852493
Description: running text.
Files: 3 (plain text)
Words: 12072
Characters: 149567 (including punctuation)
Description: running text.
6.2. Moksha
Location: /corp/uralic-lgs/mordvin/moksha/
Files: 3 (plain text)
Sentences: 946
Words: 12575
Characters: 147877 (including punctuation)
Description: running text.
Files: 1 (plain text)
Words: 5935
Characters: 44801 (including punctuation)
Description: running text.
7. Nenets
Location: /corp/uralic-lgs/nenets/
Files: 1 (plain text)
Words: 3179
Characters: 23786 (including punctuation)
Description: running text.
8. Saami
8.1. Northern Saami
Location: /corp/uralic-lgs/saami/northern-saami/
Files: 1 (plain text in a sentence-per-line format)
Sentences: 2194
Words: 28664 (including punctuation)
Characters: 214119
Description: running text
9. Selkup
Location: /corp/uralic-lgs/selkup/
Files: 1 (plain text in a sentence-per-line format)
Lines: 5678
Words: 29712 (including punctuation)
Characters: 265509
Description: running text.
Files: 1 (morphologically and syntactically tagged sentences
Lines: 68284 translated into German)
Words: 129191
Characters: 745220 (including punctuation)
Description: running text.
10. Udmurt
Location: /corp/uralic-lgs/udmurt
Files: 29 (a sentence-per-line format)
Sentences: 10111 (plain text)
Words: 67542 (including punctuation)
Characters: 665911
Description: running text.
Files: 25 (a paragraph-per-line format)
Lines: 9710
Words: 150515 (including punctuation)
Characters: 1388496
Description: running text.
Files: 1 + Lists of Variables (a sentence-per-line format)
Sentences: 2330
Characters: 475320
Description: Statistically coded running texts.
Files: 1 (plain text)
Words: 7314
Characters: 49690
Sentences: 2194
Description: running text.
11. Vepsian (Veps)
Location: /corp/uralic-lgs/vepsian/
Files: 3 (plain text in a sentence-per-line format)
Sentences: 3287
Words: 34327
Characters: 234533 (including punctuation)
Description: running text.
Files: 9 (plain text)
Words: 62007
Characters: 793278 (including punctuation)
Description: running text.
University of Helsinki Language Corpus Server
P.S., Aug 27, 1998; 2007