Character Encoding: /general-linguistics


Character encoding of the corpora

/general-linguistics

  1. /afro-asiatic-lgs
    1. /cushitic-lgs
      1. /somali
        1. Character encoding: Scandinavian alphabet in which '', '', '' and corresponding capital letters are marked with numeral codes: \202, \224, \216, etc.
    1. /semitic-lgs
      1. /hebrew
        1. forthcoming
  2. /indo-european-lgs
    1. /germanic-lgs
      1. /english
        1. /gutenberg
        2. /susanne
        3. /WSJ
        4. Character sets: the corpora are prepared with the English alphabet system, and no additional characters are needed.
      2. /yiddish
        1. /royte-pomerantsen
          1. /yiddish-texts:
            Character set: ASCII.
      3. /greek
        1. forthcoming
      4. /latin-1
        1. /apa:
          Character set: ASCII. The data is prepared with capital letters.
      5. /slavonic-lgs
        1. /russian
          1. /fowler-corpus
          2. Character sets: the corpus is in the ASCII form. Some Cyrillic characters are replaced with combinations of characters. The character set is described in the README-file.
          3. /spoken:
            Character set: the corpus is in the ASCII form. Some of the Cyrillic characters are replaced with character combinations.
          4. /tampere-corpus:
            Character set: The corpus is in the ASCII form. The character set is described in detail in the README-file.
          5. /uppsala-corpus:
            Character set: the corpus is in the seven-bit ASCII form. The character set is not described separately.
      6. /multilingual-data
        1. /words:
          Character set: character set: the corpora are in the ASCII form.
      7. /uralic-lgs
        1. /finno-ugric-lgs
          1. /baltic-finnic-lgs
            1. /estonian
              1. /viro1:
                Character set: the corpus is in the Latin-1 form.
              2. viro2:
                Character set: the corpus is in the seven-bit form. The characters formed with diacritics are marked as follows: , , (, , ), {, |, } [, \ and ]; , = *o, *u, <*i>
            2. /finnish
              1. /bible
                1. /KRaamattu38:
                  Character set: Latin-1 character set. There is also a form in which the characters with diacritics are marked with the numeral codes ('' = \204, etc.).
                2. /KRaamattu92:
                  Character set: Latin-1 character set.
              1. /hkv:
                Character set: ASCII.


P.S. 2007; Last modified: Mon Nov 24 17:00:16 EET 2008