The corpora at the UHLCS are collected about since the last quarter of the 20th century, and that is why the systems used in documentation of the corpora vary. Quite often, also the economical resources needed for documentation were not taken into account in financing the electronic data base projects. Information on publication (the name of the author, the name of the publisher, the publishing date, the number of the issue, etc.) is usually taken into account in documentation of most of the data. There have been several international projects on documentation and conservation of data in the electronic form in particular in the 90's, and in addition to the basic information on documents, they also have dealt with detailed information on structure and semantics of documents. The list below contains addresses of some of the most important international initiatives for documentation and metadata descriptions for machine-readable data:
The corpora at the UHLCS which were prepared in a close connection with international projects are documented by following some versions of the TEI-project. These corpora were also analyzed structurally according to the principles developed within those projects. Documentation of the corpora which were prepared on the projects with minor economical resources was done step by step. In the first phase, only the basic data on publication was documented. Later, these data were prepared with the help of the tools prepared for documenting machine-readable data (ISLE tools).
The structural encoding of the corpora varies. There are several corpora which are plain running text without any kind of structural tagging, but there are also corpora in which the structure is encoded according to standards prepared on international projects (TEI, ISLE, etc.). In the case that the original structure of the documents is saved, also the structure of the texts without additional structural encoding can be distinguished.
Hakulinen, Auli, Fred Karlsson, and Maria Vilkuna. 1980. Suomen
tekstilauseiden piirteitä: kvantitatiivinen tutkimus. Publications
No. 6. Helsinki: Department of General Linguistics, University of
Helsinki.
Koskenniemi, Kimmo. 1983. Two-Level Morphology: A General
Computational Model for Word-Form Recognition and
Production. Publications No. 11. Helsinki: Department of General
Linguistics, University of Helsinki.
Suihkonen, Pirkko. 1997. Documentation of the Computer Corpora of
Uralic Languages at the University of Helsinki. Technical Reports,
No. TR-2. Helsinki: Department of General Linguistics, University of
Helsinki. Pp. 16–51.
Suihkonen, Pirkko. 2003. Metadata
descriptions for combining information on multimodal data located at
the University of Helsinki Language Corpus Server. In Sándor Darányi
(ed.). HOMO 2003 - Information society, cultural heritage and
folklore text analysis, 24-26 November 2003, Budapest,
Hungary.