THE COMPUTER COPORA OF THE INDO-EUROPEAN LANGUAGES

Latin, Greek, Ossete, Tajik , Kurdish (Kurmanji), English, German, Yiddish, Swedish, Russian, Ukrainian


The Ossete computer corpus

The Ossete computer corpus contains the following documents:

  1. Children's Bible in the Ossete Language.
    Arapovich, Borislav & Mattelmäki, Vera (eds.).
    ISBN 91-88394-22-0. 542 pp.
    Institute for Bible Translation.
    Stockholm 1993.
  2. Size of the document: 64,865 words, 427,239 characters.
    Document type: running text.
    Character encoding: The original character encoding: extended Cyrillic alphabet system; character encoding at the UNIX-operating system: ISO 8859-1 (Latin-1).

  3. "Jesus Friend of Children" in Ossete.
    ISBN 91-88394-37-9. 65 pp.
    Institute for Bible Translation
    Stockholm 1994.
  4. Size of the document: 7,557 words, 51,684 characters.
    Document type: running text.
    Character encoding: The original character encoding: extended Cyrillic alphabet system; character encoding at the UNIX-operating system: ISO 8859-1 (Latin-1).

  5. The Gospel of Mark in the Ossete Language.
    ISBN 91-88394-96-4, ISBN 5-89116-001-3.
    The Institute for Bible Translation.
    Stockholm.
  6. Size of the document: 11,650 words, 77,270 characters.
    Document type: running text.
    Character encoding: The original character encoding: extended Cyrillic alphabet system; character encoding at the UNIX-operating system: ISO 8859-1 (Latin-1).

The Ossete texts are donated to the University of Helsinki by the Institute for Bible Translation (Stockholm) to be used as research material. Reference to the corpora has to be done in the papers in which they are used as a source. For further information, please contact: address of the contact person.

Metadata descriptions for the Ossete corpus

The Tajik computer corpus

The Tajik texts are donated to the University of Helsinki by the Institute for Bible Translation (Stockholm) to be used as research material. Reference to the corpora has to be done in the papers in which they are used as a source. For further information, please contact: address of the contact person:

Metadata descriptions for the Tajik corpus

The Kurdish computer corpus

The Kurdish computer corpus contains the following documents:

The Kurdish texts are donated to the University of Helsinki by the Institute for Bible Translation (Stockholm, Sweden) to be used as research material. Reference to the corpora has to be done in each paper in which they are used as a source.

For further information, please contact: address of the contact person.

Metadata descriptions of the Kurdish corpus



The Greek computer corpus

The Greek corpus contains the following documents:

  1. The Greek corpus contains the following documents: Greek New Testament text, Nestle-Aland 26th ed. Copyright 1979. Deutsche Bibelgesellschaft, Stuttgart. The corpus can be used as research material by permission of the copyright holder.
  2. Greek New Testament Database, prepared by Paul A. Miller. Copyright 1988. The Gramcord Institute, Trinity Evangelical Divinity School.

The documents are licensed directly from the copyright holders. The analysis of the original versions were corrected, when the documents were adapted to the University of Helsinki Language Corpus Server.

The use of the corpora located at the University of Helsinki Corpus Server is restricted to concern research and teaching. Reference to the corpora has to be done in the papers in which they are used as a source. For further information, please contact: address of the contact person.

Metadata descriptions for the Greek corpus



The Latin computer corpora

The Latin computer corpora located at the University of Helsinki Language Corpus Server consists of various texts in Latin.

The corpora located at the University of Helsinki Language Corpus Server can be used as research material. Reference to the corpora has to be done in the papers in which it is used as a source.

For further information, please contact: address of the contact person.



The English computer corpora

The University of Helsinki Language Corpus Server contains the following machine-readable corpora of the English language:

The corpora located at the University of Helsinki Language Corpus Server are allowed to be used as research material. Information on the accessibility of the corpora is available on the websites of the UHLCS. Reference to the corpora has to be made in the papers in which it is used as a source. For further information, please contact: Address of the contact person.

Metadata descriptions for the English corpora



The German computer corpus

The corpus on German literature located at the University of Helsinki Language Corpus Server contains samples of the German literature. The corpus is in plain running text, and in the sentence-per-line format. The corpora in both formats are indexed.

The corpora located at the University of Helsinki Language Corpus Server can be used as research material. Information on the accessibility of the corpora is available on the websites of the UHLCS. Reference to the corpora has to be done in the papers in which it is used as a source. For further information, please contact: address of the contact person.

Metadata descriptions for the German corpus



The Yiddish corpus

The directory /corp/yiddish/ consists of the following sub-directories:

/yiddish/yiddish-texts/
/yidish/

The sub-directory /yiddish-texts/ includes the following files:

royte_pomerantsen.i
royte_pomerantsen.ii
royte_pomerantsen.iii

> The size of the corpus:

sentenceswordscharacters
text 1 3,089 20,464 110,345
text 2 3,211 21,485 115,306
text 3 2,132 12,949 70,126
sum 8,432 54,898 295,777

The sub-directory /yidish/ is under construction. (P.S., August 7, 1998)

The Yiddish texts are donated to University of Helsinki by Jussi Karlgern (University of Helsinki, Department of General Linguistics). Reference to the corpora has to be done in the papers in which they are used as a source. For further information, please contact: address of contact person:

Metadata descriptions for the Yiddish corpus

The Swedish computer corpora

The University of Helsinki Language Corpus Server contains the following machine-readable corpora of the Swedish language:

  1. The Finland Swedish corpus:
    The Finland Swedish corpus is prepared under auspices of Mirja Saari in the first half of the nineties. The Corpus is located at the Department of Scandinavian Languages and Literature, University of Helsinki. A copy of the corpus is available at the University of Helsinki Language Corpus Server. For more information on the corpus: http://www.nord.helsinki.fi/press.html.
  2. The Swedish text corpus:
    The University of Helsinki Language Corpus Server contains of a Swedish text corpus in the electronic form. The corpus consists of plain running texts, which are preprocessed, and converted in the sentence-per-line format. The prepocessed texts are marked with the ending ".pre", and the texts in the sentence-per-line format with the endign ".snt". There also is a morphologically analyzed version of the corpus (cf. the metadata file "Swedish-A.Tagged-Swedish-Text-Corpus" ).
  3. The tagged Swedish text corpus:
    The Tagged Swedish Text Corpus consists of various text types in the running text format. The corpus is morphologically analyzed with the help of the TWOL-analyzer (Two-level-analyzer) of Swedish. The text types are indicated in the names of the sub-corpora. The tagged Swedish corpus is prepared at the University of Helsinki, Department of General Linguistics in the first part of the nineties.

Character encoding: ASCII.

The use of the corpora located at the University of Helsinki Language Corpus Server is restricted to concern research and teaching. Information on the accessibility of the corpora is available on the websites of the UHLCS. Reference to the corpora has to be done in the papers in which it is used as a source. For further information, please contact: address of the contact person.

Metadata descriptions for the Swedish corpora



The Russian computer corpora

The Russian computer corpora located at the University of Helsinki Language Corpus Server contains the following sub-corpora:

  1. The corpus on Russian literature (the Fowler Database)
    The corpus of Russian literature (the Fowler Database) located at the University of Helsinki Language Corpus Server contains original Russian literature of different types. novels, poetry, and short stories, and also Russian translations. The corpus is in the indexed running text format, and in the sentence-per-line format. Also this corpus is indexed.

    The size of the corpus in the sentence-per-line format (including the indices): 133198 sentences, 2,736,636 words, 18,866,094 characters.

  2. The Russian Magazine Database: The Russian Magazine Database consists of articles published in the following three magazines: Novoe Vremja, Ogonek and Sputnik. The preprocessed texts are in the sentence-per-line format.

    The size of the database: 110,967 words, and 797,766 characters.
    Text type: Running texts in the sentence-per-line format.

  3. The Russian Ryscard Database:
    The Russian Ryscard Database consists of corpora, which originally are prepared at the Uppsala University. At the University of Helsinki, Department of General Linguistics, the corpora are adjusted in the sgml-format. The University of Helsinki, Department of General Linguistics, is a distributor of the Ryscard Database. The Russian Ryscard Database consists of preprocessed texts, which are in the sentence-per-line format.

    The size of the database in the sgml-format is 374,804 words, and 2,444,605 characters (including the tags and punctuation).

  4. The Russian Corpus of Newspaper Articles:
    The Russian Corpus of Newspaper Articles located at the University of Helsinki Language Corpus Server is received from Lingsoft Oy, Jyri Sooberg. At the University of Helsinki, Department of General Linguistics, the corpora are adjusted in the SGML-format, and the University of Helsinki, Department of General Linguistics, is a distributor of the corpus. Documentation of the corpus is available at the University of Helsinki Language Corpus Server.

    The size of the corpus: 179,724 words, 1,343,039 characters.

  5. The Russian Uppsala Database:
    The Russian Uppsala Database consists of corpora, which originally were prepared at the Uppsala University. At the University of Helsinki, Department of General Linguistics, the corpora are adjusted in the SGML-format. The University of Helsinki, Department of General Linguistics, is a distributor of the Ryscard Database. The Russian Uppsala Database consists of preprocessed texts, which are in the SGML-format. More information on the documentation is available within the corpus.

    The size of the database in the sgml-format: 1,450,122 words, and 9,639,469 characters (including the tags and punctuation).

The corpora located at the University of Helsinki Language Corpus Server are allowed to be used as research material. Information on the accessibility of the corpora is available on the websites of the UHLCS. Reference to the corpora has to be done in the papers in which it is used as a source.

For further information, please contact: address of the contact person.

Metadata descriptions for the Russian corpöra



The Ukrainian computer corpus

The Ukrainian computer corpus contains the following document:

The Ukrainian texts are donated to the University of Helsinki by the Institute for Bible Translation (Stockholm, Sweden) to be used as research material. Reference to the corpora has to be done in the papers in which they are used as a source. For further information, please contact: address of the contact person.

Metadata descriptions for the Ukrainian corpöra



University of Helsinki Language Corpus Server


P.S., 1995; 1998; 2002; 2007.