On Editing the Corpora


  1. Running text, ASCII:
    In the case that the ASCII codes cover the character set of the writing system of the language, the ASCII characters can be used as such. If the ASCII code is used in simplifying a more complex writing system, this must be taken into account in converting the data into the original character set.

  2. Running text, Latin-1:
    All the corpora written with the Latin-1 character set can be adapted as such into the UNIX operating system. They can also be used with the Emacs editor available at the working site.

  3. Running text, Latin-1 format which is transformed into the UNICODE:
    A great number of corpora at the UHLCS are written with the Cyrillic alphabet. In most of the languages, this alphabet system contains more characters than the basic Russian alphabet system, and several characters are formed with diacritics. When these texts were adapted into the UNIX operating system, the characters were replaced with the characters in the Latin-1 character set. These characters are converted into the UNICODE character sets with the help of the scripts available in the directories locating within the corpus directories. The work is still in progress.

    1. Adapting the UNICODE character sets in the Emacs editor:

    2. A UNICODE character set must be defined already when making the access to the computers of the CSC.
      Some hints:

      1. "The most consistent way to get the Emacs editor to work correctly in the files written with the utf-8 characters is to use the "prefer-coding-system" to set utf-8 as the preferred encoding. Other useful commands are "set-buffer-coding-system", "set-terminal-coding-system" and "set-keyboard-coding-system" (Eero Vitie, http://forums.csc.fi/kitwiki/pilot/view/KitWiki/LinuxToolsUnicode);
      2. "The fonts come from the computer which is connected to the X-term that is in use. The user has to tell to the Emacs editor about the utf-8 environment. It makes a difference, if the Emacs editor works in the window of its own, or if it is used in the X-term text mode. In this (last) case the X-term uses the fonts, and the Emacs editor is only writing the utf-characters on the monitor. The utf-8 characters can be made available in both cases. In the case that the Emacs editor works without the window of its own, it uses the utf-8 characters, but in the case that it works in a window of its own, it "smells", if the file uses the Latin-1 characters or the utf-8 characters. The command 'env LC_CTYPE="fi_FI.utf8"' in the Emacs editor works correctly in the Linux operating system, i.e. it opens the Emacs editor in the window of its own, and it is "smelled" in this window, if the file is in the utf-8 form or the Latin-1 form, and the system shows the characters on the basis of this information. It is not possible to use the Latin-1 characters and the utf-8 characters in the same file, because, if the Emacs editor finds even one character in the Latin-1 format, the whole file is interpreted to be in that format" (Jyrki Havia, 2007).
      3. The type of the ssh-program and the type of the computer are important from the point of view of displaying the characters. All the ssh-programs do not support the UNICODE, but for example "putty" contains an option in which the utf-8 codes must be defined already when opening a session at the computer of the CSC (/Translations/UTF8 and /SSH/X11 [x] Enable X11 forwarding) (Jack Rueter).

      4. See also "man utf-8".

  4. The use of the Emacs editor: input of the characters from the keyboard:
    The utf-8 characters can be put in the texts from the keyboard as follows: ESC - x, ucs, TAB, Carriage Return, and the utf-8 code which consists of four characters. More information on the utf-8 character sets can be found e.g. in the following addresses: University of Helsinki, Department of General Linguistics: Names of characters; UNICODE; What is Unicode?; Fonts and software resources for the Unicode Character Set.

  5. Displaying the utf-8 characters on the monitor:
    In the Emacs-editor, the utf-8 characters are expressed with the character codes. The characters cannot be seen with such computers which do not support the interaction with the utf-8. Also in this case, the utf-8 characters can be seen with web-browsers.

  6. Preprocessed data:
    Many of the old corpora at the UHLCS are preprocessed, i.e. they are reformed in the way that punctuation characters were separated from the text proper, and the capital letters were transformed into small ones by changing them into combinations of characters, for example: *a = A, etc. Preprocessing has been important for the reason that usually the programs used in analyzing machine-readable data analyzed separately the words written with the lower and upper case letters, and the same concerned the words in which a punctuation character was in the end or in the beginning of the word. Many new programs are prepared in the way that there is no need to preprocess the data.

  7. Morphologically analyzed data:
    The indices used in morphologically analyzed data are described in the documents prepared for description of the documentation system (the indices used in analyzing the corpora of the Uralic languages).


P.S. 2007; Last modified: Mon Nov 24 18:46:10 EET 2008