Pirkko Suihkonen, April 28, 2001

Multimodal Linguistic Data Bank:
a Pilot Project, Leipzig May 2000 - Nov. 2001

The data bank project at the Max Planck Institute for Evolutionary Anthropology, Department of Linguistics, Leipzig, started on April, 2000. The work covered the following topics:

* The preparation of a system for the documentation of electronic linguistic data, especially data on endangered languages. This data can be used in research.
* The development of system for archiving and using electronic linguistic data.

The system for the documentation of electronic linguistic data consists of the following parts: a) adapting data to the UNICODE-format, b) preparing information on the data structure, and c) preparing metadata descriptions of the data. The system of adapting data to the electronic archive contains metadata descriptions of electronic linguistic data and a catalogue located in the archive. The metadata descriptions have been developed in cooperation with the international metadata project steered at the Technical Department of the Max Planck Institute of Psycholinguistics, Nijemgen International Standards for Linguistic Engineering). The contracts made with data owners, who give data to the bank, and the MPI-EVA, Department of Linguistics are also included in the archives data-collecting system.

The data bank is located in the UNIX-Operating system. The data types include running texts, dictionaries and material prepared with the relational data base format. Data that contains texts originally written in a non-Latin-1 script (e.g in Cyrillic or in IPA-characters), have been adapted to the UNICODE-format. In addition to this, the bank includes examples of maps and video data.

The basic elements in the system on using data in the data bank contains a server that can be used in practical work. The principles of the functions of the server were presented in May 2001. Tools that interact with the server. The data bank contains some tools for organizing data and collecting different kinds of information on data. The data bank can be used at the MPI-EVA, Department of Linguistics, Leipzig. The possibility of using the data bank outside of the institute must be discussed separately (cf. the info-page).

*The following researchers and research assistants participated in the project:

Claudia Schmidt (Leipzig) has worked as a research assistant. Claudia edited data originally written in the IPA and converted them to UNICODE-format.
Jack Rueter, who donated the data collections in Mordvin and Komi Zyrian, which he prepared with the cooperation of Mordvin and Komi researchers and co-workers, also participated in the work on adapting data to the UNIX-operating system (September 2001).
Hardi Teder and Andrei Gapeyev worked as research assistants (August and September 2000) and prepared several tools to be used in the practical work, and adapted data to the UNIX-operating system.
Peter Froelich, who is an IT specialist at the MPI-EVA, Dept. of Linguistics, Leipzig, participated particularly in the work on the software required by the data bank.

* The following researchers have donated data to the data bank:

Anvita Abbi (New Delhi, India): data on languages spoken in India that she has gathered with her students over the last twenty years.
Bernard Comrie (the MPI-EVA, Leipzig): data on the Intercontinental Dictionary Series prepared by various researchers.
Jack Rueter, Helsinki, Finland: data on Mordvin and Komi.
Pirkko Suihkonen, Bibinur Zagulyayeva & Galina Tronina: Udmurt-English-Finnish Dictionary (in progress).

* The contract to be used on receipt of data in the data bank has been drawn up on the basis of the contract used at the University of Helsinki, Department of General Linguistics, when getting data for the University of Helsinki Language Corpus Server. Likewise, the computer account application form to be used by people outside of the MPI-EVA, Department of Linguistics, Leipzig has been used at the University of Helsinki.

* The head of the pilot project was Pirkko Suihkonen. One aspect on this pilot has been to develop a basis for electronic linguistic corpora that can be used in linguistic research in language typology.

Pirkko Suihkonen, November 2001
Last modified: Jan. 12, 2002.