Character Encoding: /multilingual-language-archive


Character encoding of the corpora

/multilingual-language-archive

All the corpora which are received from the Institute for Bible Translation before 2007 are prepared with various text editing and folding programs: MSWord, WordPerfect, Word, PageMaker, etc. These corpora are transformed manually to be used at the UNIX-operating system. Basic scripts which can be used in converting data adapted into the utf-8 format are located in the directories at the same level with the data directories (the names of the directories: XXX-in-preparation (XXX = the abbreviation of the name of the languages). In addition to the versions adapted into the Latin-1 format, the data directories also contain material converted into the utf-8 format. In 90's, several corpora originally written with the Cyrillic alphabet were converted into the Latin-1 format manually.

  1. /chukotko-kamchatkan-lgs
    1. /chukchi
      1. /New-Testament:
        Character sets: direct conversion; the UNICODE-format (the script).
    2. /koryak
      1. /New-Testament:
        Character sets: direct conversion; the UNICODE-format (the script).
  2. /indo-european-lgs
    1. /iranian-lgs
      1. /west-iranian-lgs
        1. /kurdish
          1. /New-Testament:
            Character sets: direct conversion; the UNICODE-format (the script).
        2. /tajik
          1. /Bible-of-Children:
            Character sets: direct conversion; the UNICODE-format (the script).
          2. /Books-of-Children:
            Character sets: direct conversion; the UNICODE-format (the script).
      2. /east-iranian-lgs
        1. /ossete
          1. /Bible-of-Children:
            Character sets: direct conversion; the UNICODE-format (the script).
          2. /Books-of-Children:
            Character sets: direct conversion; the UNICODE-format (the script).
          3. /New-Testament:
            Character sets: direct conversion; the UNICODE-format (the script).
    2. /slavonic-lgs
      1. /east-slavonic-lgs
        1. /ukrainian
          1. /Books-of-Children:
            Character sets: direct conversion; the UNICODE-format (the script).
  3. /north-east-caucasian-lgs
    1. /avar-andi-tsez-lgs
      1. /avar
        1. /New-Testament:
          Character sets: direct conversion; the UNICODE-format (the script).
    2. /lak-dargva-lgs
      1. /lak
        1. /New-Testament:
          Character sets: direct conversion; the UNICODE-format (the script).
    3. /lezgi-lgs
      1. /tabassaran
        1. /New-Testament:
          Character sets: direct conversion; the UNICODE-format (the script).
  4. /mongolic-lgs
    1. /east-mongolic-lgs
      1. /kalmyk
        1. /New-Testament:
          Character sets: direct conversion; the UNICODE-format (the script).
  5. /tungusic-lgs
    1. /north-tungusic-lgs
      1. /even
        1. /New-Testament:
          Character sets: direct conversion; the UNICODE-format (the script).
      2. /evenki
        1. /Books-of-Children:
          Character sets: direct conversion; the UNICODE-format (the script).
    2. /south-tungusic-lgs
      1. /nanay
        1. /New-Testament:
          Character sets: direct conversion; the UNICODE-format (the script).
  6. /turkic-lgs
    1. /bolgar-turkic-lgs
      1. /chuvash
        1. /paasonen-texts:
          Character sets: the original text are prepared with the Finno-Ugric transcription system. The corpus is transformed into the Latin-1 format, and the characters used in conversion are not available as such in the UNIX-operation system. Information on the original publication is available in the README-file.
    2. /north-turkic-lgs
      1. /khakas
        1. /Books-of-Children:
          Character sets: direct conversion; the UNICODE-format (the script).
        2. /New-Testament:
          Character sets: direct conversion; the UNICODE-format (the script).
      2. tuvin
        1. /New-Testament:
          Character sets: direct conversion; the UNICODE-format (the script).
      3. /yakut
        1. /Bible-of-Children:
          Character sets: the data is in the UNICODE-format.
        2. /New-Testament:
          Character sets: direct conversion; the UNICODE-format (the script).
    3. /north-west-turkic-lgs
      1. /balkar
        1. /New-Testament:
          Character sets: direct conversion; the UNICODE-format (the script).
        2. /Psalms:
          Character sets: direct conversion; the UNICODE-format (the script).
      2. /bashkir
        1. /New-Testament:
          Character sets: direct conversion; the UNICODE-format (the script).
      3. /crimean-turkish
        1. /New-Testament:
          Character sets: direct conversion; the UNICODE-format (the script).
        2. /wordlist:
          Character set: Latin-1.
      4. /kirghiz
        1. /Books-of-Children:
          Character sets: direct conversion; the UNICODE-format (the script).
      5. /tatar
        1. /Books-of-Children:
          Character sets: direct conversion; the UNICODE-format (the script).
        2. /New-Testament:
          Character sets: direct conversion; the UNICODE-format (the script).
    4. /south-east-turkic-lgs
    5. /uighur
      1. /New-Testament:
        Character sets: direct conversion; the UNICODE-format (the script).
    6. /uzbek
      1. /Bible-of-Children:
        Character sets: direct conversion; the UNICODE-format (the script).
      2. /dictionary
        Character set: ASCII
  7. /south-west-turkic-lgs
    1. /azerbaijani
      1. /New-Testament:
        Character sets: direct conversion; the UNICODE-format (the script).
    2. /gagauz
      1. /New-Testament:
        Character sets: direct conversion; the UNICODE-format (the script).
    3. /turkmen
      1. /New-Testament:
        Character sets: the data is converted manually into the Latin-1 format. There is also a version in the UNICODE-format (the script).
      2. /Old-Testament:
        The character sets: the data is converted manually into the Latin-1 format. There is also version in the UNICODE-format (the script).
  • /uralic-lgs
      1. Finno-ugric languages:
        1. /saami-lgs
          1. /kildin-saami
            1. /Books-of-Children:
              Character set: direct conversion; the UNICODE format.
          2. /northern-saami
            1. /metadata-descriptions:
            2. report:
              Character set: direct conversion; the Latin-1 format.
            3. /vuolab:
              Character set: the first version of the data adjusted into the Latin-1 format (not proof-read).
            4. /ume-saami
              1. /data:
                Character set: the data is adapted manually into the Latin-1 format.
        2. /baltic-finnic-lgs
          1. /ingrian
            1. /english-translations:
              Character set: Latin-1 format.
            2. /morphologically-tagged-corpora:
              The data in the directory /texts in the morphologically encoded form.
            3. /texts: The character set: the data is in the Latin format. The corpus is prepared manually from the original data written with the Finno-Ugric transcription system. The corpus is described in the README-file.
          2. /karelian
            1. /dvina-karelian
              1. /Books-of-Children:
                Character set: the data is transformed manually in to Latin-1 format.
              2. /New-Testament:
                Character set: the data is converted manually into the Latin-1 format.
            2. /livvi
              1. /Bible-of-Children:
                Character set: the data is converted manually into the Latin-1 format.
              2. /Books-of-Children :
                Character set: the data is converted manually into the Latin-1 format.
              3. /New-Testament:
                Gospel-of-John, Gospel-of-Mark:
                Character set: the data is converted manually into the Latin-1 format.
                Gospel-of-Matthew, Gospel-of-Luke:
                Character set: the data is converted directly into the Latin-1 format. The rules for converting the data into the UNICODE-format are available.
              4. New-Testament-all:
                xml-, htm-, and txt-formats.
            3. /lude
              1. /texts:
                Character set: the data is adapted into the Latin-1 format as such. The data is transliterated from the audio-tapes. The basis for the transliteration system and information on the documentation described in the README-file.
          3. /livonian
            1. /Books-of-Children:
              Character set: the data is converted directly into the UNIX-operating system. The corpus can be converted into the UNICODE-format (the script).
            2. /suhonen:
              Character set: the data is converted directly into the Latin-1 format.
          4. /veps
            1. /Bible-of-Children:
              Character set: the data is adapted directly into the UNIX operating system. There is a script available for converting the data into the UNICODE format.
            2. /Books-of-Children:
              Character set: a copy of the data is converted manually into the Scandinavian alphabet. There is also a version converted into the UNICODE format (the script).
            3. /New-Testament:
            4. Gospel-of-John, Gospel-of-Mark:
              Character set: a copy of the data is converted manually into the Scandinavian alphabet. There is also a version converted into the UNICODE format.
              Gospel-of-Matthew:
              Character set: the data is converted into the UNICODE format.
            5. New-Testament-all:
              xml-, htm-, and txt-formats.
        3. /mari-lgs
          1. /eastern-mari
            1. /Bible-of-Children:
              Character set: the data is adapted into the UNIX operating system. The corpus is also converted into the UNICODE format (the script).
            2. New-Testament-all
              xml-, htm-, and txt-formats
          2. /western-mari
            1. /Books-of-Children:
              Character set: the data is adapted into the UNIX operating system. The corpus is also converted into the UNICODE format (the script).
            2. /hill-mari-texts:
              Character set: the data is originally written in the Finno-Ugric transcription system. The data is adapted directly into the UNIX operating system.
            3. /New-Testament:
              Character set: the data is adapted into the UNIX operating system. The corpus is also converted into the UNICODE format (the script).
        4. /mordvin-lgs
          1. erzya
            1. /Bible-of-Children:
              Character set: the data is converted manually into the Latin-1 character set.
            2. /dictionary:
              Character set: the data is prepared to be converted into the UNICODE character set (the cript).
            3. /epos:
              Character set: the data is prepared to be converted into the UNICODE character set (the script).
            4. /historical-word-list:
              Character set: the data is adapted manually into the Latin-1 character set. The principal word stress is marked in the data.
            5. /journals:
              Character set: the data is prepared to be converted into the UNICODE character set.
            6. /morphologically-tagged-corpora:
              Character set: the data is converted manually into the Latin-1 format.
            7. /New-Testament:
              Character set: the data is adapted in the UNIX operating system. The corpus can be converted into the UNICODE format (the script).
            8. /novels:
              Character set: the data is prepared to be converted into the UNICODE character set (the files in the directories abra-*). Other files: the data are prepared to be converted into the UNICODE character set.
            9. /poetry:
              Character set: the data is prepared to be converted into the UNICODE character set.
            10. /short-stories:
              Character set: (in the directory /arap-v-*) the data is prepared to be converted into the UNICODE character set. Other files: the data is prepared to be converted into the UNICODE format.
            11. New-Testament-all:
              xml-, htm-, and txt-formats.
          2. /moksha
            1. /Books-of-Children, /historical-word-list:
              Character set: the data is adapted manually into the Latin-1 character set. The principal stress is marked in the words.
            2. /New-Testament:
              ???
            3. /novels ???
        5. /permic-lgs
          1. /komi
            1. /permyak
              1. /Books-of-Children:
                Character set: the data is converted into the UNICODE format.
              2. /New-Testament:
                Character set: the data is converted into the UNICODE format (the script).
            2. /zyrian
              1. /Books-of-Children:
                Character set: the data is adapted into the UNIX operating system. The data can be converted into the UNICODE format with the scripts available for users.
              2. komi-texts, komi-texts-snt, morphologically-tagged-corpora:
                Character set: the data is adapted into the Latin-1 character set manually. Description of the characters is given in the README-file.
              3. /New-Testament:
                Character set: the data is adapted into the UNIX operating system. The data can be converted into the UNICODE format with the scripts available for users.
              4. New-Testament-all:
                xml-, htm-, and txt-formats
              5. /novels:
                Character set: the data is prepared to be converted into the UNICODE format. Information on the system can be asked from the owner of the corpus.
            3. /udmurt
              1. /Books-of-Children:
                Character set: the data is converted into the UNICODE format.
              2. /New-Testament:
                Character set: the data is adapted into the UNIX operating system. The data can be converted into the UNICODE format with the scripts available for users.
              3. New-Testament-all:
                xml-, htm-, and txt-formats.
              4. ??
              5. /novels:
                Character set: the data is prepared to be converted into the UNICODE format. Information on the system can be asked from the editor of the corpus.
              6. /udmurt-snt, /udmurt-texts-unmodified:
                Character set: the data is converted manually into the Latin-1 character set.
              7. /udmurt-statistical-data:
                Character set: the numerical data is prepared manually. The encoding deals with grammatical categories. The encoding system is explained in the README files.
        6. /ugric-lgs
          1. /khanty
            1. /atlym-dialect, /kazym-dialect, /konda-dialect, /nizjam-dialect, /obdorsk-dialect, /synja-dialect:
              Character set: the data is converted manually into the Latin-1 format. The description of the coding system is available in the set.
            2. /Books-of-Children:
              Character set: the data is converted into the UNICODE format.
            3. /rugin:
              Character set: the data is converted manually into the Latin-1 format. The description of the coding system is available in the set.
          2. /mansi
            1. Books-of-Children:
              Character set: the data is converted into the UNICODE format.
      2. Samoyedic-languages
      3. /samoyedic-lgs
        1. /enets
          1. /New-Testament:
            Character set: the data is converted into the UNICODE format.
        2. /nenets
          1. /tundra-nenets
            1. New-Testament:
              Character set: the data is converted into the UNICODE format.
            2. /sample-sentences:
              Character set: the data is converted manually into the Latin-1 format. The conversion is described in the README-file.
        3. /kamas:
          1. /texts-donner:
            Character set: the data is converted manually into the Latin-1 format. The data is described in the README-file.
        4. /selkup
          1. /h-dialects, /ivankino-dialect, /ket-dialect, /tundra-dialect, /tym-dialect, /upper-ob-dialect:
            Character set: the data is adapted manually into the Latin-1 format. The character set is described in the README-file.


    P.S. 2007; Last modified: Mon Nov 24 17:41:07 EET 2008