Documentation and conservation of the corpora and categorial encoding


  1. Documentation and metadata descriptions
  2. Structural encoding
  3. Morphological and syntactic encoding
  4. Documentation and conservation of the corpora and categorial encoding

  1. Documentation and metadata descriptions

    The corpora at the UHLCS are collected about since the last quarter of the 20th century, and that is why the systems used in documentation of the corpora vary. Quite often, also the economical resources needed for documentation were not taken into account in financing the electronic data base projects. Information on publication (the name of the author, the name of the publisher, the publishing date, the number of the issue, etc.) is usually taken into account in documentation of most of the data. There have been several international projects on documentation and conservation of data in the electronic form in particular in the 90's, and in addition to the basic information on documents, they also have dealt with detailed information on structure and semantics of documents. The list below contains addresses of some of the most important international initiatives for documentation and metadata descriptions for machine-readable data:

    1. TEI: Text Encoding Initiative
    2. EAGLES: Expert Advisory Group on Language Engineering Standards
    3. ISLE: International Standards for Language Engineering
    4. OLAC: Open Language Archives Community
    5. The Open Language Archives Community Archiving and linguistic resources or How to keep your data from becoming endangered
    6. Requirements on the Infrastructure for Open Language Archiving
    7. ISLE tools for documentation and metadata descriptions

    The corpora at the UHLCS which were prepared in a close connection with international projects are documented by following some versions of the TEI-project. These corpora were also analyzed structurally according to the principles developed within those projects. Documentation of the corpora which were prepared on the projects with minor economical resources was done step by step. In the first phase, only the basic data on publication was documented. Later, these data were prepared with the help of the tools prepared for documenting machine-readable data (ISLE tools).


  2. Structural encoding

    The structural encoding of the corpora varies. There are several corpora which are plain running text without any kind of structural tagging, but there are also corpora in which the structure is encoded according to standards prepared on international projects (TEI, ISLE, etc.). In the case that the original structure of the documents is saved, also the structure of the texts without additional structural encoding can be distinguished.

    1. /general-linguistics
      1. /afro-asiatic-lgs/.../somali: the corpus consists of texts translated from Finnish into Somali. The corpus is in the sentence-per-line format, and also the translations are in the same format. The sentences are marked with separate tags, and also the articles are separated with tags.
      2. /indo-european-lgs/latin: ASCII, capital letters, /latin: specific numbering of the data,
        /.../english: /gutenberg: sentence-per-line format; /susanne: word-per-line format, the corpus contains grammatical encoding; /WSJ: structural encoding.
        /.../yiddish: /running texts.
        /.../russian: fowler-corpus, spoken, uppsala-corpus: sgml-tags; tampere-corpus: morphologically analyzed data (TWOL)
      3. /uralic-lgs/.../viro-1: reduced structural tagging, viro-2: preprocessed texts;
        /.../finnish: bible: a) running text, b) complex categorial encoding;
        hkv: syntactic analysis, sentence-per-line format.
      4. /multi-lingual-data/words: word-per-line format.

    2. /general-linguistics-kotus
      1. /uralic-lgs/.../finnish: a-contracts: morphologically analyzed data (TWOL), structural encoding (two types), running texts, ftc: morphological encoding (TWOL), running texts, parole: TEI-tags and structural encoding

    3. /language-departments
      1. /indo-european-lgs/.../swedish: fisc: structural encoding, morphological encoding (TWOL), sentence-per-line format and sentence identification, preprocessed texts sentence-per-line format
      2. /niger-congo-lgs/.../swahili: plain running texts, preprocessed running texts

    4. /multilingual-language-archive
      1. The data received from the IBT: running texts: TEI-encoding can be inserted into the texts from the scripts.
      2. morphologically analyzed corpora:
        1. /chuvash: word-per-line format; the English translations of the words are located in the same line with the original word,
        2. /ingrian: word-per-line format, /erzya mordvin: word-per-line format; the English translations of the words are located in the same line with the original words,
        3. /hill-mari: word-per-line format; the English translations of the words are located in the same line with the original words,
        4. /zyrian: word-per-line format; the Finnish translations of the words are located in the same line with the original words; the data are also in the sentence-per-line format, and the numbers of the sentences in the texts are identified,
        5. /khanty (several dialects): word-per-line format; the English translations of the words are located in the same line with the original word; the data are also in the sentence-per-line format, and the numbers of the sentences in the texts are identified,
        6. /ume-saami: word-per-line format; the Swedish translations of the words are located in the same line with the original words,
        7. /kamas: the sentences are separated with specific limitors,
        8. /nenets: each sentence is preceded by two numbers which refer to its page and place in the original text,
        9. /selkup (several dialects): morphologically tagged corpus: sentence-per-line and word-per-line formats, the German translations of the words are located in the same line with the original words.


  3. Morphological and syntactic encoding

  4. Most of the corpora at the UHLCS are running texts. The corpora from Chuvash, Ingrian, Selkup, and Tundra Nenets, and some corpora from Northern Khanty, Erzya Mordvin, and Komi Zyrian are morphologically analyzed. These corpora are analyzed manually. There are also corpora from Finnish, Swedish, Russian, and Swahili which are analyzed with automatic morphological analyzers. Automatic morphological analyzers are available for Finnish, English, Swedish, and Swahili, and automatic analyzers for some other languages (e.g. Komi Zyrian and Erzya Mordvin) are under preparation. Syntactic analyzers are available for machine-readable data from Finnish and English. Documentation and analyzes of the corpora at the UHLCS are described in the following documents:

    Hakulinen, Auli, Fred Karlsson, and Maria Vilkuna. 1980. Suomen tekstilauseiden piirteitä: kvantitatiivinen tutkimus. Publications No. 6. Helsinki: Department of General Linguistics, University of Helsinki.

    Koskenniemi, Kimmo. 1983. Two-Level Morphology: A General Computational Model for Word-Form Recognition and Production. Publications No. 11. Helsinki: Department of General Linguistics, University of Helsinki.

    Suihkonen, Pirkko. 1997. Documentation of the Computer Corpora of Uralic Languages at the University of Helsinki. Technical Reports, No. TR-2. Helsinki: Department of General Linguistics, University of Helsinki. Pp. 16–51.

    Suihkonen, Pirkko. 2003. Metadata descriptions for combining information on multimodal data located at the University of Helsinki Language Corpus Server. In Sándor Darányi (ed.). HOMO 2003 - Information society, cultural heritage and folklore text analysis, 24-26 November 2003, Budapest, Hungary.


  5. Conservation of the corpora and preparing new versions from the corpora

    1. "For the databanks of different languages made up of the electronic texts/language material, the material will be processed both manually and electronically, but their contents shall not be changed. For example, information about clauses, paragraphs, word classes or other linguistic properties which are required for the academic research into the material as well as the data required when processing the material electronically can be linked to the texts/language material" (Data contract form).
    2. The original versions of the corpora and all the versions prepared from the original corpora are preserved in separate sub-directories with the corpora available for public use. In the preliminary versions prepared for the public use, the characters or character combinations adapted into the UNIX-operating system correspond to the characters in the original texts. The new versions of the corpora have to be prepared according to the principles defined in the contracts between the owners of the corpora and the University of Helsinki, Department of General Linguistics.

    3. Also the new versions of the corpora must be documented with the metadata descriptions. The origin of the corpus, and all the phases needed in preparing the new version from the corpus must be described in detail in the metadata file. The metadata file must also contain the same information on the origin of the corpus as the metadata prepared for describing the original corpus.
    4. All the new corpora must be documented with the metadata descriptions.


© P.S. 2007; Last modified: Mon Nov 24 17:59:05 EET 2008