DATABANK FOR
ENDANGERED FINNO-UGRIC LANGUAGES


WORKING REPORT

Pirkko Suihkonen


Appendix 1. Documentation of the Computer Corpora of Uralic Languages at the University of Helsinki (Pirkko Suihkonen 1998). The document is available at the Department of General Linguistics, University of Helsinki.

Appendix 2. Data contract forms

Appendix 3. The computer corpora compiled and edited during the project SA 1013 4233, in 1996-1998

Appendix 4. The computer corpora and the books received from the Institute for Bible Translation


Project SA 1013 4233, 1996-1998 (Uhanalaisten suomalais-ugrilaisten kielten tietopankki / Data Bank for Endangered Finno-Ugrian Languages), the Academy of Finland and the University of Helsinki, Department of Finno-Ugrian Studies and Department of General Linguistics. Director of the Project: Professor Seppo Suhonen, Department of Finno-Ugrian Studies.

Project SA 1013 4233 was carried out in co-operation with the "Datamaskinell dokumentasjon av utsette uralske språk" project of the Joint Committee of the Nordic Research Councils for the Humanities in 1996-1997. Its general goals were: (a) to collect and document linguistic computer corpora and (b) basic linguistic research. The project participants came from Finland, Sweden and Norway. The part of the project concerning the documentation of the computer corpora of the Uralic languages also formed an extension of the project SA 1011 928, in 1991-1993.

The tasks of the project were: (1) to advance the collection and annotation of the computer corpora especially of endangered Uralic languages, (2) to adapt the computer corpora of various languages at the University of Helsinki Language Corpus Server, and (3) to carry out basic research work concerning quantification, the deictic systems of languages, and language typology. The research was carried out at the Department of General Linguistics which gave the tools, such as efficient computers, to be used in work in practice. My sincerest thanks go to Fred Karlsson, Jan-Ola Östman, Kimmo Koskenniemi and to Seppo Suhonen and the project participants for their support and co-operation during the project.

1. The work on the computer corpora

The work on the computer corpora was divided into several sub-areas as listed below.

(1) Developing a system to help the other project participants to prepare computer corpora of the languages they are working with.
(2) Co-operating with the project members who prepared the computer corpora of the Uralic languages and supporting their work to make the corpora available for public use.
(3) Planning meetings for the participants and training them in corpus linguistics.
(4) Taking care of the documentation of the copy-rights between the University of Helsinki Language Corpus Server and the owners of the corpora.
(5) Compiling and editing the computer corpora prepared by the other participants and adapting them to the University of Helsinki Language Corpus Server.
(6) Editing the computer corpora of various languages received especially from the Institute for Bible Translation (Helsinki, Stockholm and Moscow), and adapting them to the University of Helsinki Language Corpus Server.
(7) Advertising on the project and preparing information on the project, the University of Helsinki Language Corpus Server and the computer corpora available on that corpus server.
(8) Other corpus linguistics activities.

1.1. A manual for coding the computer corpora of various languages

An important task for the goals of the project was to prepare a system of how the other participants would code and edit their linguistic data in order to prepare them for linguistic computer corpora. The result of this work was a report which was published in early 1998:

Suihkonen, Pirkko (1998). Documentation of the Computer Corpora of the Uralic Languages at the University of Helsinki. Technical Reports, No 2. University of Helsinki, Department of General Linguistics. (Appendix 1)

Various versions of the report were given to the participants during the project for evaluation, testing and discussion. The report was tested in coding system of Udmurt. Part of the coding report was in preparation during the previous corpus project in 1991-1993. In order to get information on different types of languages, the coding system was also tested on North Saami (the first version of the sample coding of North Saami and some other languages was made in 1994 and 1995). During the preparation of the report, various coding systems prepared within the framework of European Union were also considered. Because only two of the participants had professional experience in corpus linguistics, it was decided that the corpora would be coded according to the principle that the most important thing was to prepare the linguistic data in such a way that the special properties of each of the languages would be taken into account in the analysis. Particularly owing to the shortage of the time, each of the participants had to concentrate mainly on analyzing the language s/he was investigating, and methodology and general information on corpus linguistics and language technology were only discussed in the case that it was possible to do so within the time span allotted. General background on corpus linguistics was one of the most important topics of the meetings of the project participants.

Some of the main principles in preparing the coding systems were as follows: (1) the properties of languages would be analyzed as accurately as possible, and also the complexity in the linguistic units should be distinguished in the coding system, (2) the coding should be carried out in the way that information on the structural properties of the original documents would be preserved, and (3) it should be possible to transform the coded corpora to the standards used in coding the computer corpora according to the standards developed within the other corpus projects. The goal was that the corpora prepared during the project should be coded according to same principles.

1.2. Co-operation with the members of the project (items (2)-(4) above)

In the working plan, the co-operation with the other project participants was specified under the heading "the goal is to promote the coding of the computer corpora of the Uralic languages and work on corpus linguistics". In practice, the work was organized as follows.

One of the basic tasks was to prepare a system for coding linguistic data collected from minority languages. Another important task concerned the co-operation of the participants of the project, which was carried out by meetings and discussions, and promoting the study of corpus linguistics.

The meetings were held twice a year. For each meeting, a special program was arranged in order to promote and support the work in practice. The topics were as follows:

(1) The topic of the first meeting: How to prepare computer corpora within the framework of the UNIX-operating system. The University of Helsinki Language Corpus Server is located at the UNIX-operating system. The tutorials consisted of lectures on methodology and practical demonstrations. The teachers were Pirkko Suihkonen and Risto Vilenius (University of Helsinki, Department of General Linguistics).

(2) The second meeting concerned the preparation of the computer corpora using automatic morphological analyzers, computer corpora in grammar writing and coding extralingustic information on the computer corpora. Special interest was directed towards the standards used in corpus linguistics carried out under the auspices of the European Union. The teachers were Kimmo Koskenniemi and Atro Voutilainen (University of Helsinki, Department of General Linguistics).

(3) Discussions on the practical encoding of the computer corpora; each of the participants presented his or her work and gave examples of the problems encountered. Particularly during these discussions the coding system prepared for the project was analyzed and evaluated on the basis of experience gained from encoding the corpora of different languages.

(4) Statistical methods in corpus linguistics; the seminar consisted of an introduction to statistical methods used in linguistics. The topics discussed included examples of both descriptive and analytical methods. The teacher of the seminar which took place at the Biological Research Centre in Tvärminne was Lauri Tarkkonen (University of Helsinki, Department of Statistics).

(5) Publishing and dissemination of the corpora was the topic of the last general meeting. The corpora were planned to be published in two ways: (a) on the University of Helsinki Corpus Server, and (b) on a CD-rom. The plan was also to publish the working reports of the project participants.

The copyright contracts between the Department of General Linguistics as the representative of the University of Helsinki Language Corpus Server and the owners of the corpora, who are also the editors of the corpora, and some writers and printing houses, cover the principles on how the computer corpora from different sources can be used on the University of Helsinki Language Corpus Server. The corpora are supposed to be used as a source of research and teaching, and if they are used for commercial purposes, a separate contract is required. The contract form accepted by all the participants was based on that prepared for delivering some previous corpora of the University of Helsinki Language Corpus Server. The contract is written in Finnish, English, Swedish and Russian (Appendix 2). The translation of the contracts into North Saami was also discussed.

1.3. Compiling and editing the computer corpora of the University of Helsinki Language Corpus Server (items (5)-(6) above; Appendices 3 and 4)

The corpora received from the project participants (Appendix 3) were adapted to the University of Helsinki Language Corpus Server in different kinds of ways. In principle, as soon as the new computer corpora were received, they were adjusted for public use. Because the corpora were prepared using various software systems, only those corpora which were able to be adjusted at the UHLCS by using the characters available on the corpus server were fully adjusted and edited. Those corpora which included characters other than those which were available on the Latin-1 standard at the UNIX operating system were only adjusted at the UHLCS as such, and no additional editing work was done. When the texts written in the Cyrillic alphabet were adjusted to the UHLCS, the texts had to be adapted to the characters available in the Latin-1 character set. The development of the technical tools used in the UNIX operating system has been a subject of intensive development, especially in the adapting of character standards used in the UNIX operating system over the last few years. During the project, the possibilities of using the sets of standards of characters called UNICODE have improved. For that reason, the decision was taken to adjust corpora including special characters to the UHLCS only when requested, and it was planned to adapt the other corpora at the server only when the UNICODE-sets containing all the possible character sets are available.

The solution concerning waiting for the opportunities to be able to use the UNICODE-sets particularly concerned the computer corpora of various languages received from the Institute for Bible Translation (Helsinki, Stockholm and Moscow (Appendix 4). Most of the corpora obtained by the UHLCS during the project were donated by the Institute for Bible Translation, and several of them were also adjusted at the server. Most of the corpora consisted of languages spoken in the area of the former Soviet Union, and the alphabetic system used in the corpora was the Cyrillic alphabetic system. Some additional problems were caused by the fact that the corpora written in Cyrillic have also been prepared using various text editing and make-up programs. In order to support the work on adjusting the Cyrillic alphabet to be available at the UNIX operating system, Pirkko Suihkonen co-operated with Tuomas Vanhala and Arto Ihantoja (the Department of General Linguistics) who worked on the UNICODE system. Because all the sets of the UNICODE characters were not available during the project, it was decided that the final part of adjusting the corpora at the UNIX operating system after the full sets of the UNICODE characters are available at the server.

The contracts with the Institute for Bible Translation also covered delivering their publications to the library of the Department of General Linguistics (Appendix 4). The contract with the Institute was signed in 1994, and during the project and after it, the Institute has delivered numerous publications to the library of the Department of General Linguistics. Also copies of some of the publications used as corpora edited during the project, mainly thanks to the authors of the corpora, were delivered to the library of the Department of General Linguistics. It was expected that an opportunity to use printed documents will help the researchers who compile the corpora written in different alphabets from one software system to another, because all the inaccuracies can be checked from the hard copies.

1.4. Other corpus linguistic activities
1.4.1. Coding the computer corpus of North Saami: research assistant Erja Kujala

During the project, Erja Kujala, a graduate student at the Department of Finno-Ugrian Studies, worked as a research assistant for three months. She prepared rules for the morphological encoding of the North Saami texts. A goal was to prepare a basis of the tools for morphological encoding of machine readable texts of North Saami. The encoding system contains lexical and morphological information and translation of the Saami words into English. In the coding, the encoding was carried out by using the system of rules called "regular expressions", in which it was possible to give information about morphological variants (see the sample below).

s/^(\*?muitalus(a|si|sa)?)(.*)$/\1+\3_N_+COUNT\$story/g;
s/^(\*?hálddahus(a|si|sa)?)(.*)$/\1+\3_N_+COUNT\$administration/g;
s/^(\*?mearkkahus(a|si|sa)?)(.*)$/\1+\3_N_+COUNT\$meaning/g;
s/^(\*?guoskkahus(a|si|sa)?)(.*)$/\1+\3_N_+COUNT\$contact/g;

During the project, Erja Kujala had time to prepare the rules only for the nouns and part of the translations. The project did not have money to continue the coding, and the work had to be suspended. If the work cannot be continued later, the main function of the working period was training: during this time Erja Kujala was able to get to know the morphology of the North Saami language very thoroughly. (Erja Kujala completed her MA thesis in 1999.)

1.4.2. Preparing an automatic analyzer of Udmurt

During the project, a scholarship from the CIMO (the Centre for International Mobility) was granted for an Udmurt student, Leonid Ivshin, from the University of Izhevsk. The CIMO granted a scholarship of nine months to be used by Leonid Ivshin at the Department of General Linguistics for the academic year 1999-2000. Leonid Ivshin worked at the Department of General Linguistics during the spring term in 1998. During that period, he worked for preparing an automatic morphological analyzer for Udmurt. His first topic in preparing the morphological analyzer dealt with nouns. A goal was to prepare a basis for an automatic morphological analyzer of Udmurt. A copy of this will be available at the University of Helsinki Language Corpus Server (there are large computer corpora of Udmurt on the University of Helsinki Language Corpus Server).

Leonid Ivshin also acted as a language informant. He also actively studied Finnish and he participated in Finnish classes.

1.4.3. South Saami

Compiling and editing a corpus of South Saami was one of the activities carried out under the auspices of the NOS-H. The corpus has not yet been delivered to the University of Helsinki Language Corpus Server.

1.4.4. Karelian

A Karelian post-graduate student Jelena Adel started her studies at the University of Helsinki, Department of Finno-Ugrian Studies by working as a research assistant on the project. Jelena Adel was involved in the work on the computer corpora of Karelian, none of which are located on the UHLCS.

1.4.5. Mongolian

A Mongolian Student Enkhtuvshin Dorjgotov was invited with the help of scholarship of the CIMO. Dorjgotov studied computational linguistics in the academic year 1998-1999. Enkhtuvshin Dorjgotov has acted as a language informant.

1.5. Advertising the project

An important part of the work dealt with advertising on the project. It was a plan to present to colleagues information on the project at various linguistic congresses and also to the public. Advertising and reporting on the project were considered a necessary part of the work. After the first year, a brochure on the computer corpora was prepared (Appendix 5). Information on the project was disseminated at many conferences in Finland, the most important of which were the Days of Linguistics (Kielitieteen päivät) and the Days of Science (Tieteen päivät), and some presentations were also given abroad (Appendix 6). Advertising and reporting mainly consisted of posters, lectures and discussions. The opportunity to deliver information on the project at a conference which took place at the Institute of Finland in Berlin was also important. The roots and origin of the people speaking the Uralic languages formed the topic of the conference. Also some reports on the Finnish TV were given, and in 1998-1999, the demonstration on the Uralic languages in the Science Centre Heureka was also updated.

An important task was to prepare a network publication on the University of Helsinki Language Corpus Server (Appendix 6; http://www.ling.helsinki.fi/uhlcs/). This contains general information on the server and the corpora available at it. Because there are no financial resources to give services to the users of the computer corpora located on the University of Helsinki Language Corpus Server and for advertising the corpora more extensively, this forms the most important source of information on the corpora which are permanently available to the public.

2. Personal research work of Pirkko Suihkonen

The topics of research work dealt with (1) semantics of quantification in natural language, (2) the deictic systems of languages and, (3) language typology. The work was organized as follows.

2.1. Quantification and deictic systems of languages

The work dealt with data bases on quantification and deictic systems. The work is in progress.

2.2. Other related activities deal with (1) organizing a section on corpus linguistics and semantics delivered during the Days of Linguistics (Kielitieteen päivät), and (2) acting as a member of the steering committee of the Department of General Linguistics.

The corpus section was organized during the Days of Linguistics with the co-operation of Seppo Suhonen in 1996, and the Section of Semantics with the assistance of the doctoral students at the Department of General Linguistics in 1997 (cf. Appendix 6). The presentations given during the corpus session were reviewed and the reviews are available at the following address: http://www.ling.Helsinki.Fi/~suihkone/professional-activities/korpussektio.html. Both these sessions formed an important channel for focusing attention to the topics of the project.

As for the activities as a member of the steering committee, in addition to the routine tasks as a member of the committee, the activities particularly concerned the position of linguistics in Finland in general and the possibilities of linguists to work in the field. Also the continuation of work in preparing the linguistic corpora was discussed from the point of view of financial support from CSC (Centre for Scientific Computing (Tieteellinen laskenta Oy)). The first version of a proposal for working plan was sent to the CSC in August 1998.

3. Comments and evaluation

The work on the computer corpora compiled and edited during the project took most of the time, and there was not much time for the own research work. According to the working plan, a part of the work was planned to be carried out with the co-operation of a research assistant. The goal was to make the corpora available as soon as possible. Because the project did not have money to hire a research assistant for this purpose, the work with the corpora took more time than it was planned, and the most important results of the own research work dealt with preparing the research material for further research. The work on the computer corpora continued throughout winter and spring 1999. Moreover some of the computer corpora prepared by the researchers working on the project were only received in winter and spring 1999, and for that reason, they will be adapted to the University of Helsinki Language Corpus Server later. The most important results of the corpus project were to put the computer corpora for public use. The corpora which were edited and encoded during the project form vital research material for the studies of these languages. The importance of the fact that the work particularly concerned the endangered languages cannot be stressed enough. In addition to this, the corpora received from various sources, particularly from the Institute for Bible Translation, are extremely important data from these languages. Further, it should not be forgotten that these corpora are globally unique, and most of the corpora represent samples of languages which are not yet known very thoroughly.

Originally, the University of Helsinki Language Corpus Server was an example of the results of collaboration between several linguistic departments at the University of Helsinki. As such, it offers an important channel for developing the methods to be used in linguistic research. The opportunity to use the computer corpora of various languages means that the UHLCS is extremely important for researchers and students. The UHLCS is an example of activities of people with a great interest in languages and linguistics. It also is a good example of efforts to promote research work in the field. Also the corpora at the UHLCS represent professional skills at a very high level. Unfortunately, this is not enough in competition for resources in the academic world. In order to be able to make reasonable progress in this work, its importance has to be noted by those making decisions on the distribution of the financial resources.

While the computer corpora project was in progress, Prof. Kimmo Koskenniemi (University of Helsinki, Department of General Linguistics) worked actively for establishing a new corpus server maintained by the CSC. The corpora are planned for different kinds of use, including commercial purposes. When considering the use of the corpora in the future, it is useful to support the efforts to give a copy of the computer corpora of the Uralic languages prepared during the projects to the CSC. The same concerns the corpora received from the Institute for Bible Translation and from other sources. When thinking of the roles of the University of Helsinki Language Corpus Server and the CSC, it should be noted that much of the progress in the academic field has been done by people who are working at research units with real problems and who are involved in research of the topics connected with these problems. The UHLCS, as independent of the commercial goals, and as the data bank of endangered languages and a multilingual data bank is extremely important. For these reasons, in spite of the possibilities that copies of the data were in the CSC, the existence and activities of the UHLCS should strongly be supported.