THE COMPUTER COPORA OF THE INDO-EUROPEAN LANGUAGES

Latin, Greek, Ossete, Tajik , Kurdish (Kurmanji), English, German, Yiddish, Swedish, Russian, Ukrainian


The Ossete computer corpus

The Ossete computer corpus contains the following documents:

  1. Children's Bible in the Ossete Language.
    Arapovich, Borislav & Mattelmäki, Vera (eds.).
    ISBN 91-88394-22-0. 542 pp.
    Institute for Bible Translation.
    Stockholm 1993.
  2. Size of the document: 64,865 words, 427,239 characters.
    Document type: running text.
    Character encoding: The original character encoding: extended Cyrillic alphabet system; character encoding at the UNIX-operating system: ISO 8859-1 (Latin-1).

  3. "Jesus Friend of Children" in Ossete.
    ISBN 91-88394-37-9. 65 pp.
    Institute for Bible Translation
    Stockholm 1994.
  4. Size of the document: 7,557 words, 51,684 characters.
    Document type: running text.
    Character encoding: The original character encoding: extended Cyrillic alphabet system; character encoding at the UNIX-operating system: ISO 8859-1 (Latin-1).

  5. The Gospel of Mark in the Ossete Language.
    ISBN 91-88394-96-4, ISBN 5-89116-001-3.
    The Institute for Bible Translation.
    Stockholm.
  6. Size of the document: 11,650 words, 77,270 characters.
    Document type: running text.
    Character encoding: The original character encoding: extended Cyrillic alphabet system; character encoding at the UNIX-operating system: ISO 8859-1 (Latin-1).

The Ossete texts are donated to the University of Helsinki by the Institute for Bible Translation (Stockholm) to be used as research material. Reference to the corpora has to be done in the papers in which they are used as a source. For further information, please contact: address of the contact person.

Metadata descriptions for the Ossete corpus

The Tajik computer corpus

The Tajik texts are donated to the University of Helsinki by the Institute for Bible Translation (Stockholm) to be used as research material. Reference to the corpora has to be done in the papers in which they are used as a source. For further information, please contact: address of the contact person:

Metadata descriptions for the Tajik corpus

The Kurdish computer corpus

The Kurdish computer corpus contains the following documents:

The Kurdish texts are donated to the University of Helsinki by the Institute for Bible Translation (Stockholm, Sweden) to be used as research material. Reference to the corpora has to be done in each paper in which they are used as a source.

For further information, please contact: address of the contact person.

Metadata descriptions of the Kurdish corpus



The Greek computer corpus

The Greek corpus contains the following documents:

  1. The Greek corpus contains the following documents: Greek New Testament text, Nestle-Aland 26th ed. Copyright 1979. Deutsche Bibelgesellschaft, Stuttgart. The corpus can be used as research material by permission of the copyright holder.
  2. Greek New Testament Database, prepared by Paul A. Miller. Copyright 1988. The Gramcord Institute, Trinity Evangelical Divinity School.

The documents are licensed directly from the copyright holders. The analysis of the original versions were corrected, when the documents were adapted to the University of Helsinki Language Corpus Server.

The use of the corpora located at the University of Helsinki Corpus Server is restricted to concern research and teaching. Reference to the corpora has to be done in the papers in which they are used as a source. For further information, please contact: address of the contact person.

Metadata descriptions for the Greek corpus



The Latin computer corpora

The Latin computer corpora located at the University of Helsinki Language Corpus Server consists of various texts in Latin.

The corpora located at the University of Helsinki Language Corpus Server can be used as research material. Reference to the corpora has to be done in the papers in which it is used as a source.

For further information, please contact: address of the contact person.



The English computer corpora

The University of Helsinki Language Corpus Server contains several large corpora of the English language.

The University of Helsinki Language Corpus Server contains the following machine-readable corpora of the English language:

  • Gutenberg corpus:
    For more information on tue Gutenberg corpora, cf. the following web-addresses: (1) http://promo.net/pg/history.html (2) http://promo.net/pg/.
  • Susanne corpus:
    For more information on Susanne corpus, cf. the following web-addresses: (1) http://clwww.essex.ac.uk/w3c/corpus_ling/content/corpora/list/public/susanne.html (2) http://www.cogs.susx.ac.uk/users/geoffs/RSue.html.
  • Wall Street Journal:
    The directory "wsj" 1 through 41 contain text from Wall Street Journal. In directory "bin" there are scripts for word frequency counts KWIC-concordance creation. (Atro Voutilainen, Dec. 13, 1990)
  • The corpora located at the University of Helsinki Language Corpus Server are allowed to be used as research material. Information on the accessibility of the corpora is available on the websites of the UHLCS. Reference to the corpora has to be made in the papers in which it is used as a source. For further information, please contact: Address of the contact person.

    Metadata descriptions for the English corpora



    The German computer corpus

    The corpus on German literature located at the University of Helsinki Language Corpus Server contains samples of the German literature. The corpus is in plain running text, and in the sentence-per-line format. The corpora in both formats are indexed.

    The corpora located at the University of Helsinki Language Corpus Server can be used as research material. Information on the accessibility of the corpora is available on the websites of the UHLCS. Reference to the corpora has to be done in the papers in which it is used as a source.

    Metadata descriptions for the German corpus



    The Yiddish corpus

    The size of the Yiddish corpus:

    sentenceswordscharacters
    text 1 3,089 20,464 110,345
    text 2 3,211 21,485 115,306
    text 3 2,132 12,949 70,126
    sum 8,432 54,898 295,777

    Reference to the corpora has to be done in the papers in which they are used as a source.

    Metadata descriptions for the Yiddish corpus

    The Swedish computer corpora

    The University of Helsinki Language Corpus Server contains several large machine-readable corpora of the Swedish language.

    1. The Finland Swedish corpus:
      The Finland Swedish corpus is prepared under auspices of Mirja Saari in the first half of the nineties. The Corpus is located at the Department of Scandinavian Languages and Literature, University of Helsinki. A copy of the corpus is available at the University of Helsinki Language Corpus Server. For more information on the corpus: http://www.nord.helsinki.fi/press.html.
    2. The Swedish text corpus:
      The University of Helsinki Language Corpus Server contains of a Swedish text corpus in the electronic form. The corpus consists of plain running texts, which are preprocessed, and converted in the sentence-per-line format. The prepocessed texts are marked with the ending ".pre", and the texts in the sentence-per-line format with the endign ".snt". There also is a morphologically analyzed version of the corpus (cf. the metadata file "Swedish-A.Tagged-Swedish-Text-Corpus" ).
    3. The tagged Swedish text corpus:
      The Tagged Swedish Text Corpus consists of various text types in the running text format. The corpus is morphologically analyzed with the help of the TWOL-analyzer (Two-level-analyzer) of Swedish. The text types are indicated in the names of the sub-corpora. The tagged Swedish corpus is prepared at the University of Helsinki, Department of General Linguistics in the first part of the nineties.

    Character encoding: ASCII.

    The use of the corpora located at the University of Helsinki Language Corpus Server is restricted to concern research and teaching. Information on the accessibility of the corpora is available on the websites of the UHLCS. Reference to the corpora has to be done in the papers in which it is used as a source. For further information, please contact: address of the contact person.

    Metadata descriptions for the Swedish corpora



    The Russian computer corpora

    The following Russian corpora are available at the University of Helsinki Language Corpus Server:

    1. The corpus on Russian literature (the Fowler Russian bata base):
      The corpus of Russian literature (the Fowler Database) located on the University of Helsinki Language Corpus Server contains original Russian literature of different types: novels, poetry, and short stories, and also samples of literature translated into Russian. The corpus which is running text is in the sentence-per-line and in the sgml-format. More information on the Fowler Data base is available on the following website

    2. The Tampere Corpus: The Tampere corpus "contains a corpus of texts that are taken from Russian journals in 1999 and 2000. The texts are collected at the University of Tampere (Department of Translation Studies, Russian Language and Culture)" (README (the Tampere corpus). Some texts are transliterated into the Latin alphabet. The alphabet used in transliteration is given in the README file. The texts are analyzed morphologically with the help of the automatic morphological analyzer of Russian (TWOL).

    3. The Corpus of Spoken Russian:
      The original texts of the Corpus of Spoken Russian come from the Institute of Russian Language in Moscow. The corpus is in the sgml-format: the sgml-format is prepared at the Department of General Linguistics, University of Helsinki. The corpus is documented in the README file.
    4. The Russian Uppsala Corpus:
      The Russian Uppsala Corpus was prepared at the Department of Slavic Studies at Uppsala University, Sweden. The corpus is described in detail on the websites of the Uppsala Corpus at the Uppsala University, Departemnt of Slavic Studies. The corpus is adjusted in the sgml-format at the University of Helsinki, Department of General Linguistics. Permission to use the corpus at the University of Helsinki Language Corpus Server is obtained from Professor Ingrid Maier and Professor Lennart Lönngren in Sept. 2007 (The Uppsala Russian Corpus).

      The size of the database in the sgml-format: 1,450,122 words, and 9,639,469 characters (including the tags and punctuation).

    The Russian corpora located at the University of Helsinki Language Corpus Server can be used as research material. Reference to the corpora has to be done in the documents in which they are used as a source.

    Metadata descriptions for the Russian corpora



    The Ukrainian computer corpus

    The Ukrainian computer corpus contains the following document:

    The Ukrainian texts are donated to the University of Helsinki by the Institute for Bible Translation (Stockholm, Sweden) to be used as research material. Reference to the corpora has to be done in the papers in which they are used as a source. For further information, please contact: address of the contact person.

    Metadata descriptions for the Ukrainian corpöra



    University of Helsinki Language Corpus Server


    P.S., 1995; 1998; 2002; 2007.