Character Encoding


The character sets in the corpora located at the UHLCS reflect the history of corpus linguistics. The original texts which typically are books and newspapers are prepared with various text editing and type-setting programs and character sets. The original texts of the corpora have been in the machine-readable form, or they have been scanned into electronic form. When the data were adapted into the UNIX operating system, information on the original documents were saved. During the first years the texts were adapted into the UNIX operating system with the seven-bit ASCII code. If the original texts contained characters which were not available in the ASCII character set, these characters were replaced with a combination of two or several characters. When the eight-bit Latin-1 character set with various extensions was available on the UNIX operating system, also the corpora were adapted into the Latin-1 form. Later, intensive work is done for converting into the UNICODE form the corpora which contain characters not available in the Latin-1 alphabet system. At the UHLCS this concerns in particular corpora which originally were prepared with the Cyrillic alphabet system. The first attempts to convert these corpora into the utf-8 character sets were done with the financial support of the ECHO project.

P.S. 2007; Last modified: Mon Nov 24 18:27:40 EET 2008