Appendix 1. Documentation of the Computer Corpora of Uralic Languages at the University of Helsinki (Pirkko Suihkonen 1998). The document is available at the Department of General Linguistics, University of Helsinki.
Appendix 2. Data contract forms
Appendix 3. The computer corpora compiled and edited during the project SA 1013 4233, in 1996-1998
Appendix 4. The computer corpora and the books received from the Institute for Bible Translation
Project SA 1013 4233, 1996-1998 (Uhanalaisten suomalais-ugrilaisten kielten tietopankki / Data Bank for Endangered Finno-Ugrian Languages), the Academy of Finland and the University of Helsinki, Department of Finno-Ugrian Studies and Department of General Linguistics. Director of the Project: Professor Seppo Suhonen, Department of Finno-Ugrian Studies.
Project SA 1013 4233 was carried out in co-operation with the "Datamaskinell dokumentasjon av utsette uralske språk" project of the Joint Committee of the Nordic Research Councils for the Humanities in 1996-1997. Its general goals were: (a) to collect and document linguistic computer corpora and (b) basic linguistic research. The project participants came from Finland, Sweden and Norway. The part of the project concerning the documentation of the computer corpora of the Uralic languages also formed an extension of the project SA 1011 928, in 1991-1993.
The tasks of the project were: (1) to advance the collection and annotation of the computer corpora especially of endangered Uralic languages, (2) to adapt the computer corpora of various languages at the University of Helsinki Language Corpus Server, and (3) to carry out basic research work concerning quantification, the deictic systems of languages, and language typology. The research was carried out at the Department of General Linguistics which gave the tools, such as efficient computers, to be used in work in practice. My sincerest thanks go to Fred Karlsson, Jan-Ola Östman, Kimmo Koskenniemi and to Seppo Suhonen and the project participants for their support and co-operation during the project.
1. The work on the computer corporaThe work on the computer corpora was divided into several sub-areas as listed below.
(1) Developing a system to help the other project participants to prepare
computer corpora of the languages they are working with.
(2) Co-operating with the project members who prepared the computer
corpora of the Uralic languages and supporting their work to make the
corpora available for public use.
(3) Planning meetings for the participants and training them in corpus
linguistics.
(4) Taking care of the documentation of the copy-rights between the
University of Helsinki Language Corpus Server and the owners of the
corpora.
(5) Compiling and editing the computer corpora prepared by the other participants and
adapting them to the University of Helsinki Language Corpus Server.
(6) Editing the computer corpora of various languages received
especially from the Institute for Bible Translation (Helsinki, Stockholm
and Moscow), and adapting them to the University of Helsinki Language
Corpus Server.
(7) Advertising on the project and preparing information on the project,
the University of Helsinki Language Corpus Server and the computer
corpora available on that corpus server.
(8) Other corpus linguistics activities.
Suihkonen, Pirkko (1998). Documentation of the Computer Corpora of the Uralic Languages at the University of Helsinki. Technical Reports, No 2. University of Helsinki, Department of General Linguistics. (Appendix 1)
Various versions of the report were given to the participants during the project for evaluation, testing and discussion. The report was tested in coding system of Udmurt. Part of the coding report was in preparation during the previous corpus project in 1991-1993. In order to get information on different types of languages, the coding system was also tested on North Saami (the first version of the sample coding of North Saami and some other languages was made in 1994 and 1995). During the preparation of the report, various coding systems prepared within the framework of European Union were also considered. Because only two of the participants had professional experience in corpus linguistics, it was decided that the corpora would be coded according to the principle that the most important thing was to prepare the linguistic data in such a way that the special properties of each of the languages would be taken into account in the analysis. Particularly owing to the shortage of the time, each of the participants had to concentrate mainly on analyzing the language s/he was investigating, and methodology and general information on corpus linguistics and language technology were only discussed in the case that it was possible to do so within the time span allotted. General background on corpus linguistics was one of the most important topics of the meetings of the project participants.
Some of the main principles in preparing the coding systems were as follows: (1) the properties of languages would be analyzed as accurately as possible, and also the complexity in the linguistic units should be distinguished in the coding system, (2) the coding should be carried out in the way that information on the structural properties of the original documents would be preserved, and (3) it should be possible to transform the coded corpora to the standards used in coding the computer corpora according to the standards developed within the other corpus projects. The goal was that the corpora prepared during the project should be coded according to same principles.
One of the basic tasks was to prepare a system for coding linguistic data collected from minority languages. Another important task concerned the co-operation of the participants of the project, which was carried out by meetings and discussions, and promoting the study of corpus linguistics.
The meetings were held twice a year. For each meeting, a special program was arranged in order to promote and support the work in practice. The topics were as follows:
(1) The topic of the first meeting: How to prepare computer corpora within the framework of the UNIX-operating system. The University of Helsinki Language Corpus Server is located at the UNIX-operating system. The tutorials consisted of lectures on methodology and practical demonstrations. The teachers were Pirkko Suihkonen and Risto Vilenius (University of Helsinki, Department of General Linguistics).
(2) The second meeting concerned the preparation of the computer corpora using automatic morphological analyzers, computer corpora in grammar writing and coding extralingustic information on the computer corpora. Special interest was directed towards the standards used in corpus linguistics carried out under the auspices of the European Union. The teachers were Kimmo Koskenniemi and Atro Voutilainen (University of Helsinki, Department of General Linguistics).
(3) Discussions on the practical encoding of the computer corpora; each of the participants presented his or her work and gave examples of the problems encountered. Particularly during these discussions the coding system prepared for the project was analyzed and evaluated on the basis of experience gained from encoding the corpora of different languages.
(4) Statistical methods in corpus linguistics; the seminar consisted of an introduction to statistical methods used in linguistics. The topics discussed included examples of both descriptive and analytical methods. The teacher of the seminar which took place at the Biological Research Centre in Tvärminne was Lauri Tarkkonen (University of Helsinki, Department of Statistics).
(5) Publishing and dissemination of the corpora was the topic of the last general meeting. The corpora were planned to be published in two ways: (a) on the University of Helsinki Corpus Server, and (b) on a CD-rom. The plan was also to publish the working reports of the project participants.
The copyright contracts between the Department of General Linguistics as the representative of the University of Helsinki Language Corpus Server and the owners of the corpora, who are also the editors of the corpora, and some writers and printing houses, cover the principles on how the computer corpora from different sources can be used on the University of Helsinki Language Corpus Server. The corpora are supposed to be used as a source of research and teaching, and if they are used for commercial purposes, a separate contract is required. The contract form accepted by all the participants was based on that prepared for delivering some previous corpora of the University of Helsinki Language Corpus Server. The contract is written in Finnish, English, Swedish and Russian (Appendix 2). The translation of the contracts into North Saami was also discussed.
The solution concerning waiting for the opportunities to be able to use the UNICODE-sets particularly concerned the computer corpora of various languages received from the Institute for Bible Translation (Helsinki, Stockholm and Moscow (Appendix 4). Most of the corpora obtained by the UHLCS during the project were donated by the Institute for Bible Translation, and several of them were also adjusted at the server. Most of the corpora consisted of languages spoken in the area of the former Soviet Union, and the alphabetic system used in the corpora was the Cyrillic alphabetic system. Some additional problems were caused by the fact that the corpora written in Cyrillic have also been prepared using various text editing and make-up programs. In order to support the work on adjusting the Cyrillic alphabet to be available at the UNIX operating system, Pirkko Suihkonen co-operated with Tuomas Vanhala and Arto Ihantoja (the Department of General Linguistics) who worked on the UNICODE system. Because all the sets of the UNICODE characters were not available during the project, it was decided that the final part of adjusting the corpora at the UNIX operating system after the full sets of the UNICODE characters are available at the server.
The contracts with the Institute for Bible Translation also covered delivering their publications to the library of the Department of General Linguistics (Appendix 4). The contract with the Institute was signed in 1994, and during the project and after it, the Institute has delivered numerous publications to the library of the Department of General Linguistics. Also copies of some of the publications used as corpora edited during the project, mainly thanks to the authors of the corpora, were delivered to the library of the Department of General Linguistics. It was expected that an opportunity to use printed documents will help the researchers who compile the corpora written in different alphabets from one software system to another, because all the inaccuracies can be checked from the hard copies.
s/^(\*?muitalus(a|si|sa)?)(.*)$/\1+\3_N_+COUNT\$story/g;
s/^(\*?hálddahus(a|si|sa)?)(.*)$/\1+\3_N_+COUNT\$administration/g;
s/^(\*?mearkkahus(a|si|sa)?)(.*)$/\1+\3_N_+COUNT\$meaning/g;
s/^(\*?guoskkahus(a|si|sa)?)(.*)$/\1+\3_N_+COUNT\$contact/g;
During the project, Erja Kujala had time to prepare the rules only for the nouns and part of the translations. The project did not have money to continue the coding, and the work had to be suspended. If the work cannot be continued later, the main function of the working period was training: during this time Erja Kujala was able to get to know the morphology of the North Saami language very thoroughly. (Erja Kujala completed her MA thesis in 1999.)
Leonid Ivshin also acted as a language informant. He also actively studied Finnish and he participated in Finnish classes.
A Mongolian Student Enkhtuvshin Dorjgotov was invited with the help of scholarship of the CIMO. Dorjgotov studied computational linguistics in the academic year 1998-1999. Enkhtuvshin Dorjgotov has acted as a language informant.
An important part of the work dealt with advertising on the project. It was a plan to present to colleagues information on the project at various linguistic congresses and also to the public. Advertising and reporting on the project were considered a necessary part of the work. After the first year, a brochure on the computer corpora was prepared (Appendix 5). Information on the project was disseminated at many conferences in Finland, the most important of which were the Days of Linguistics (Kielitieteen päivät) and the Days of Science (Tieteen päivät), and some presentations were also given abroad (Appendix 6). Advertising and reporting mainly consisted of posters, lectures and discussions. The opportunity to deliver information on the project at a conference which took place at the Institute of Finland in Berlin was also important. The roots and origin of the people speaking the Uralic languages formed the topic of the conference. Also some reports on the Finnish TV were given, and in 1998-1999, the demonstration on the Uralic languages in the Science Centre Heureka was also updated.
An important task was to prepare a network publication on the University of Helsinki Language Corpus Server (Appendix 6; http://www.ling.helsinki.fi/uhlcs/). This contains general information on the server and the corpora available at it. Because there are no financial resources to give services to the users of the computer corpora located on the University of Helsinki Language Corpus Server and for advertising the corpora more extensively, this forms the most important source of information on the corpora which are permanently available to the public.
2.1. Quantification and deictic systems of languages
The work dealt with data bases on quantification and deictic systems. The work is in progress.
2.2. Other related activities deal with (1) organizing a section on corpus linguistics and semantics delivered during the Days of Linguistics (Kielitieteen päivät), and (2) acting as a member of the steering committee of the Department of General Linguistics.
The corpus section was organized during the Days of Linguistics with the co-operation of Seppo Suhonen in 1996, and the Section of Semantics with the assistance of the doctoral students at the Department of General Linguistics in 1997 (cf. Appendix 6). The presentations given during the corpus session were reviewed and the reviews are available at the following address: http://www.ling.Helsinki.Fi/~suihkone/professional-activities/korpussektio.html. Both these sessions formed an important channel for focusing attention to the topics of the project.
As for the activities as a member of the steering committee, in addition to the routine tasks as a member of the committee, the activities particularly concerned the position of linguistics in Finland in general and the possibilities of linguists to work in the field. Also the continuation of work in preparing the linguistic corpora was discussed from the point of view of financial support from CSC (Centre for Scientific Computing (Tieteellinen laskenta Oy)). The first version of a proposal for working plan was sent to the CSC in August 1998.
The work on the computer corpora compiled and edited during the project took most of the time, and there was not much time for the own research work. According to the working plan, a part of the work was planned to be carried out with the co-operation of a research assistant. The goal was to make the corpora available as soon as possible. Because the project did not have money to hire a research assistant for this purpose, the work with the corpora took more time than it was planned, and the most important results of the own research work dealt with preparing the research material for further research. The work on the computer corpora continued throughout winter and spring 1999. Moreover some of the computer corpora prepared by the researchers working on the project were only received in winter and spring 1999, and for that reason, they will be adapted to the University of Helsinki Language Corpus Server later. The most important results of the corpus project were to put the computer corpora for public use. The corpora which were edited and encoded during the project form vital research material for the studies of these languages. The importance of the fact that the work particularly concerned the endangered languages cannot be stressed enough. In addition to this, the corpora received from various sources, particularly from the Institute for Bible Translation, are extremely important data from these languages. Further, it should not be forgotten that these corpora are globally unique, and most of the corpora represent samples of languages which are not yet known very thoroughly.
Originally, the University of Helsinki Language Corpus Server was an example of the results of collaboration between several linguistic departments at the University of Helsinki. As such, it offers an important channel for developing the methods to be used in linguistic research. The opportunity to use the computer corpora of various languages means that the UHLCS is extremely important for researchers and students. The UHLCS is an example of activities of people with a great interest in languages and linguistics. It also is a good example of efforts to promote research work in the field. Also the corpora at the UHLCS represent professional skills at a very high level. Unfortunately, this is not enough in competition for resources in the academic world. In order to be able to make reasonable progress in this work, its importance has to be noted by those making decisions on the distribution of the financial resources.
While the computer corpora project was in progress, Prof. Kimmo Koskenniemi (University of Helsinki, Department of General Linguistics) worked actively for establishing a new corpus server maintained by the CSC. The corpora are planned for different kinds of use, including commercial purposes. When considering the use of the corpora in the future, it is useful to support the efforts to give a copy of the computer corpora of the Uralic languages prepared during the projects to the CSC. The same concerns the corpora received from the Institute for Bible Translation and from other sources. When thinking of the roles of the University of Helsinki Language Corpus Server and the CSC, it should be noted that much of the progress in the academic field has been done by people who are working at research units with real problems and who are involved in research of the topics connected with these problems. The UHLCS, as independent of the commercial goals, and as the data bank of endangered languages and a multilingual data bank is extremely important. For these reasons, in spite of the possibilities that copies of the data were in the CSC, the existence and activities of the UHLCS should strongly be supported.