On Editing the Corpora
Running text, ASCII:
In the case that the ASCII codes
cover the character set of the writing system of the language, the
ASCII characters can be used as such. If the ASCII code is used in
simplifying a more complex writing system,
this must be taken into account in converting the data
into the original character set.
- Running text, Latin-1:
All the corpora written with
the Latin-1 character set can be adapted as such into the
UNIX operating system. They can also be used with the Emacs editor available at
the working site.
- Running text, Latin-1 format which is transformed into the
A great number of corpora at the UHLCS are written with
the Cyrillic alphabet. In most of the languages, this alphabet system
contains more characters than the basic Russian alphabet system, and
several characters are formed with diacritics. When these texts
were adapted into the UNIX operating system, the characters were
replaced with the characters in the Latin-1 character set. These
characters are converted into the UNICODE character sets with the help
of the scripts
available in the directories locating within the corpus
directories. The work is still in progress.
- Adapting the UNICODE character sets in the Emacs editor:
UNICODE character set must be defined already when making the access to
the computers of the CSC.
- "The most consistent way to get the Emacs editor to work
correctly in the files written with the utf-8 characters is to use
the "prefer-coding-system" to set
utf-8 as the preferred encoding. Other useful commands are
"set-buffer-coding-system", "set-terminal-coding-system" and
"set-keyboard-coding-system" (Eero Vitie,
- "The fonts come from the computer which is connected to the X-term
that is in use. The user has to tell to the Emacs editor about the
utf-8 environment. It makes a difference, if the Emacs editor works in the
window of its own, or if it is used in the X-term text mode. In
this (last) case the X-term uses the fonts, and the Emacs editor is only
writing the utf-characters on the monitor. The utf-8
characters can be made available in both cases.
In the case that the Emacs editor works
without the window of its own, it uses the utf-8 characters,
but in the case that
it works in a window of its own, it "smells", if the file uses the
Latin-1 characters or the utf-8 characters. The command
'env LC_CTYPE="fi_FI.utf8"' in the Emacs editor works correctly in the Linux
operating system, i.e. it opens the Emacs editor in the window of its own,
and it is "smelled" in this window, if the file is in the utf-8 form or the
Latin-1 form, and the system shows the characters on the basis
of this information. It is not possible to use the Latin-1 characters
and the utf-8 characters in the same file, because,
if the Emacs editor finds even one
character in the Latin-1 format, the whole file is interpreted to be in
that format" (Jyrki Havia, 2007).
The type of the ssh-program and the type of the computer are important
from the point of view of displaying the characters.
All the ssh-programs do not support
the UNICODE, but for example "putty" contains an option in which the
utf-8 codes must be defined already when opening a session at the
computer of the CSC (/Translations/UTF8 and /SSH/X11 [x] Enable X11 forwarding) (Jack Rueter).
- See also "man utf-8".
- The use of the Emacs editor: input of the characters from the keyboard:
The utf-8 characters can be put in the texts from the
keyboard as follows: ESC - x, ucs, TAB, Carriage Return, and the utf-8
code which consists of four characters. More information on the utf-8 character
sets can be found e.g. in the following addresses: University of Helsinki,
Department of General Linguistics:
What is Unicode?;
software resources for the Unicode Character Set.
Displaying the utf-8 characters on the monitor:
Emacs-editor, the utf-8 characters are expressed with the
The characters cannot be seen with such computers which do
not support the interaction with the utf-8.
Also in this case, the utf-8 characters can be seen with
- Preprocessed data:
Many of the old corpora at the UHLCS are preprocessed,
i.e. they are reformed in the way that punctuation characters were
separated from the text proper, and the capital letters were
transformed into small ones by changing them into combinations of
characters, for example: *a = A, etc. Preprocessing has been important
for the reason that usually the programs used in analyzing
machine-readable data analyzed separately the words written with the
lower and upper case letters, and the same concerned the words in
which a punctuation character was in the end or in the beginning of
the word. Many new programs are prepared in the way that there is no
need to preprocess the data.
- Morphologically analyzed data:
The indices used in
morphologically analyzed data are described in the documents prepared
for description of the documentation system
(the indices used in
analyzing the corpora of the Uralic languages).
© P.S. 2007; Last modified: Mon Nov 24 18:46:10 EET 2008