Character Sets of the Corpora at the UHLCS


Character encoding of the corpora

/general-linguistics

  1. /afro-asiatic-lgs
    1. /cushitic-lgs
      1. /somali
        1. forthcoming: Character encoding: Scandinavian alphabet in which '', '', '' and corresponding capital letters are marked with numeral codes: \202, \224, \216.
    2. /semitic-lgs
      1. /hebrew
      2. forthcoming
  2. /indo-european-lgs
    1. /germanic-lgs
      1. /english
        1. /gutenberg
        2. /susanne
        3. /World-Street-Journal
        4. Character sets: the corpora are prepared with the English alphabet system, and no additional characters are needed.
      2. /german
        1. forthcoming: Character set: the corpus is prepared with the seven-bit character set, and the vowels originally marked with diacritics are marked with the combinations 'ae', 'oe', and 'ue'.
      3. /swedish
        1. forthcoming
      4. /yiddish
        1. /royte-pomerantsen
          1. /yiddish-texts: Character set: in the original texts in Yiddish, the main stress is marked with the acute accent character. The stress is not marked in the Yiddish texts.
      5. /greek
        1. forthcoming
      6. /latin-1
        1. /apa: Character set: the seven-bit ASCII code. The data is prepared with capital letters.
      7. /slavonic-lgs
        1. /russian
          1. /fowler-corpus
          2. CHARACTER-MAP: Character sets: the corpus is in the ASCII form. Some of the Cyrillic characters are replaced with combinations of characters. The character set is described in the README-file.
          3. /spoken: Character set: the corpus is in the ASCII form. Some of the Cyrillic characters are replaced with character combinations.
          4. /tampere-corpus: Character set: The corpus is in the ASCII form. The character set is described in detail in the README-file.
          5. /uppsala-corpus: Character set: the corpus is in the seven-bit ASCII form. The character set is not described separately.
      8. /multilingual-data
        1. /words: Character set: character set: the corpora are in the ASCII form.
      9. /uralic-lgs
        1. /baltic-finnic-lgs
          1. /estonian
            1. /viro: Character set: the corpus is in the Latin-1 form.
            2. viro2: Character set: the corpus is in the seven-bit form. The characters formed with diacritics are marked as follows: , , (, , ), {, |, } [, \ and ]; , = *o, *u, <*i>
          2. /finnish
            1. /bible
              1. /KRaamattu38: Character set: Latin-1 character set. There is also a form in which the characters with diacritics are marked with the numeral codes: '' = \204.
              2. /KRaamattu92: Character set: Latin-1 character set.
            2. /fintag
              1. forthcoming
              1. /hkv: Character set: the seven-bit ASCII codes.
            3. /spoken
              1. forthcoming


/general-linguistics-kotus

  1. /indo-european-lgs
    1. /germanic-lgs
      1. /swedish
        1. forthcoming
  2. /uralic-lgs
    1. /baltic-finnic-lgs
      1. /finnish
        1. /a-contract: Latin-1
        2. /b-contract: latin-1
        3. /ftc: Lati-1
        4. /parole: Latin-1

/language-departments

  1. /germanic-lgs
    1. /swedish: Charcter set: seven-bit ASCII code: anf|r s}som folktingets utl}tande f|ljande
  2. /niger-congo-lgs
    1. /bantu-lgs
      1. /swahili: Character set: seven-bit ASCII code. The corpora are described in the README-file.
  3. /slavonic-and-baltic-lgs
    1. /russian

/multilingual-language-archive

All the corpora which were received from the Institute for Bible Translation before 2007 were written with various text editing and folding programs: MSWord, WordPerfect, Word, PageMaker, etc. The corpora were transformed manually to be used in the UNIX-operating system. With a few exceptions, all the data which was originally prepared with the Cyrillic alphabet is converted into the UNICODE-format. The work on converting the data into the UNICODE format, and all the versions which are available do not work perfectly. All the scripts which can be used in re-converting the data are available for the users. The scripts are located in the directories XXX-in-preparation (XXX = the abbreviation of the name of the languages). (On the use of the Emacs editor in editing the data in the UNICODE form, please, see the instructions in the file README-emacs-and-UNICODE). Some corpora which originally are written with the Cyrillic or Extended Cyrillic alphabet were converted into the Latin-1 format manually.

  • /chukotko-kamchatkan-lgs
    1. /chukchi
      1. /New-Testament: Character sets: the data is in the UNICODE-format.
    2. /koryak
      1. /New-Testament: Character sets: the data is in the UNICODE-format.
  • /indo-european-lgs
    1. /iranian-lgs
      1. /west-iranian-lgs
        1. /kurdish
          1. /New-Testament: Character sets: the data is in the UNICODE-format.
        2. /tajik
          1. /Bible-of-Children: Character sets: the data is in the UNICODE-format.
          2. /Books-of-Children: Character sets: the data is in the UNICODE-format.
        3. /east-iranian-lgs
          1. /ossete
            1. /Bible-of-Children: Character sets: the data is in the UNICODE-format.
            2. /Books-of-Children: Character sets: the data is in the UNICODE-format.
            3. /New-Testament: Character sets: the data is in the UNICODE-format.
        4. /slavonic-lgs
          1. /east-slavonic-lgs
            1. /ukrainian
              1. /Books-of-Children: Character sets: the data is in the UNICODE-format.
        5. /mongolic-lgs
          1. /west-mongolic-lgs
            1. /kalmyk
              1. /New-Testament: Character sets: the data is in the UNICODE-format.
        6. /north-east-caucasian-lgs
          1. /avar-andi-tsez-lgs
            1. /avar
              1. /New-Testament: Character sets: the data is in the UNICODE-format.
          2. /lak-dargva-lgs
            1. /lak
              1. /New-Testament: Character sets: the data is in the UNICODE-format.
          3. /lezgi-lgs
            1. /tabassaran
              1. /New-Testament: Character sets: the data is in the UNICODE-format.
        7. /quechuan-lgs
          1. /quechua-cuzco: Character sets: the data is scanned from the old paper copies, and the scanned version is waiting for proofreading.
        8. /tungusic-lgs
          1. /north-tungusic-lgs
            1. /even
              1. /New-Testament: Character sets: the data is in the UNICODE-format.
            2. /evenki
              1. /Books-of-Children: Character sets: the data is in the UNICODE-format.
          2. /south-tungusic-lgs
            1. /nanay
              1. /New-Testament: Character sets: the data is in the UNICODE-format.
        9. /turkic-lgs
          1. /bolgar-turkic-lgs
            1. /chuvash
              1. /paasonen-texts: Character sets: the original text are prepared with the Finno-Ugric transcription system. The corpus is transformed into the Latin-1 format, and the characters used in conversion are not available as such in the UNIX-operation system. Information on the original publication is available in the README-file.
          2. north-turkic-lgs
            1. khakas
              1. Books-of-Children: Character sets: the data is in the UNICODE-format.
              2. New-Testament: Character sets: the data is in the UNICODE-format.
            2. tuvin
              1. New-Testament: Character sets: the data is in the UNICODE-format.
            3. yakut
              1. Bible-of-Children: Character sets: the data is in the UNICODE-format.
              2. New-Testament: Character sets: the data is in the UNICODE-format.
          3. north-west-turkic-lgs
            1. balkar
              1. New-Testament: Character sets: the data is in the UNICODE-format.
              2. Psalms: Character sets: the data is in the UNICODE-format.
            2. bashkir
              1. New-Testament: Character sets: the data is in the UNICODE-format.
            3. crimean-turkish
              1. New-Testament: Character sets: the data is in the UNICODE-format.
              2. wordlist: Character set: Latin-1.
            4. kirghiz
              1. Books-of-Children: Character sets: the data is in the UNICODE-format.
            5. tatar
              1. Books-of-Children: Character sets: the data is in the UNICODE-format.
              2. New-Testament: Character sets: the data is in the UNICODE-format.
          4. south-east-turkic-lgs
          5. uighur
            1. New-Testament: Character sets: the data is in the UNICODE-format.
          6. uzbek
            1. Bible-of-Children:
            2. dictionary
        10. south-west-turkic-lgs
          1. azerbaijani
            1. New-Testament: Character sets: the data is in the UNICODE-format.
          2. gagauz
            1. New-Testament: Character sets: the data is in the UNICODE-format.
          3. turkmen
            1. New-Testament: Character sets: the data is converted manually into the Latin-1 format. There is also a version in the UNICODE-format.
            2. Old-Testament: The character sets: the data is converted manually into the Latin-1 format. There is also version in the UNICODE-format.
      2. uralic-lgs
        1. baltic-finnic-lgs
          1. ingrian
            1. english-translations: Character set: Latin-1 format.
            2. morphologically-tagged-corpora: The data in the directory /texts in the morphologically encoded form.
            3. texts: The character set: the data is in the Scandinavian character set. The corpus is prepared manually from the original data written with the Finno-Ugric transcription system. The corpus is described in the README-file.
          2. dvina-karelian
            1. Books-of-Children: Character set: the data is transformed manually in to Latin-1 format.
            2. New-Testament: Character set: the data is converted manually into the Latin-1 format.
          3. livonian
            1. Books-of-Children: Character set: the data is converted directly into the UNIX-operating system. The corpus can be converted into the UNICODE-format.
            2. suhonen: Character set: the data is converted directly into the Latin-1 format.
            livvi
              Bible-of-Children: Character set: the data is converted manually into the Latin-1 format.
            1. Books-of-Children: Character set: the data is converted manually into the Latin-1 format.
            2. New-Testament:
            3. Gospel-of-John, Gospel-of-Mark: Character set: the data is converted manually into the Latin-1 format.
            4. Gospel-of-Matthew, Gospel-of-Luke: Character set: the data is converted directly into the Latin-1 format. The rules for converting the data into the UNICODE-format are available.
          4. lude
            1. texts: Character set: the data is adapted into the Latin-1 format as such. The data is transliterated from the audio-tapes. The basis for the transliteration system and information on the documentation described in the README-file.
          5. vepsian
            1. Bible-of-Children: Character set: the data is adapted directly into the UNIX operating system. There is a script available for converting the data into the UNICODE format.
            2. Books-of-Children: Character set: a copy of the data is converted manually into the Scandinavian alphabet. There is also a version converted into the UNICODE format.
            3. New-Testament:
            4. Gospel-of-John, Gospel-of-Mark: Character set: a copy of the data is converted manually into the Scandinavian alphabet. There is also a version converted into the UNICODE format.
            5. Gospel-of-Matthew: Character set: the data is converted into the UNICODE format.
        2. mari-lgs
          1. eastern-mari
            1. Bible-of-Children: Charcter set: the data is adapted into the UNIX operating system. The corpus is also converted into the UNICODE format.
          2. western-mari
            1. Books-of-Children: Charcter set: the data is adapted into the UNIX operating system. The corpus is also converted into the UNICODE format.
            2. hill-mari-texts: Character set: originally the data is written in the Finno-Ugric transcription system. The data is adapted directly into the UNIX operating system.
            3. New-Testament: Charcter set: the data is adapted into the UNIX operating system. The corpus is also converted into the UNICODE format.
        3. mordvin-lgs
          1. erzya
            1. Bible-of-Children: Charcter set: the data is converted manually into the Latin-1 character set.
            2. dictionary: Character set: the data is prepared to be converted into the UNICODE character set.
            3. epos: Character set: the data is prepared to be converted into the UNICODE character set.
            4. historical-word-list: Character set: the data is adapted manually into the Latin-1 charcter set. The main word stress is marked in the data.
            5. journals: Character set: the data is prepared to be converted into the UNICODE character set.
            6. morphologically-tagged-corpora: Character set: the data is converted manually in the Latin-1 character set.
            7. New-Testament: Character set: the data is adapted in the UNIX operating system. The corpus can be converted into the UNICODE format.
            8. novels: Character set: the data is prepared to be converted into the UNICODE character set (the files in the directories abra-*). Other files: the data is prepared to be converted into the UNICODE character set.
            9. poetry: Character set: the data is prepared to be converted into the UNICODE character set.
            10. short-stories: Character set: (in the directory /arap-v-*) the data is prepared to be converted into the UNICODE character set. Other files: the data is prepared to be converted into the UNICODE character set.
          2. moksha
            1. Books-of-Children, historical-word-list: Character set: the data is adapted manually into the Latin-1 charcter set. The main word stress is marked in the data.
            2. New-Testament ???
            3. novels ???
        4. permic-lgs
          1. komi
            1. permyak
              1. Books-of-Children: Character set: the data is converted into the UNICODE format.
              2. New-Testament: Character set: the data is converted into the UNICODE format.
            2. zyrian
              1. Books-of-Children: Character set: the data is adapted into the UNIX operating system. The data can be converted into the UNICODE format with the scripts available for users.
              2. komi-texts, komi-texts-snt, morphologically-tagged-corpora: Character set: the data is adapted into the Latin-1 character set manually. Description of the characters is given in the README-file.
              3. New-Testament: Character set: the data is adapted into the UNIX operating system. The data can be converted into the UNICODE format with the scripts available for users.
              4. novels: Character set: the data is prepared to be converted into the UNICODE format. Information on the system can be asked from the owner of the corpus.
          2. udmurt
            1. Books-of-Children: Character set: the data is converted into the UNICODE format.
            2. New-Testament: Character set: the data is adapted into the UNIX operating system. The data can be converted into the UNICODE format with the scripts available for users. ???
            3. novels: Character set: the data is prepared to be converted into the UNICODE format. Information on the system can be asked from the owner of the corpus.
            4. udmurt-snt, udmurt-texts-unmodified: Character set: the data is converted manually into the Latin-1 character set.
            5. udmurt-statistical-data: Character set: the numerical data is prepared manually. The encoding deals with grammatical categories. The encoding system is explained in the README files.
        5. saami-lgs
          1. kildin-saami
            1. Books-of-Children: Character set: the data is converted into the UNICODE format.
          2. northern-saami
            1. metadata-descriptions:
            2. report: Characater set: the data is converted manually into the Latin-
            3. format.
            4. vuolab: Character set: the first, non-corrected version of the data adjusted into the Latin-1 format.
            5. ume-saami
              1. data: Character set: the data is adapted manually into the Latin-1 format.
          3. samoyedic-lgs
            1. enets
              1. New-Testament: Character set: the data is converted into the UNICODE format.
            2. kamas:
              1. texts-donner: Character set: the data is converted manually into the Latin-1 format. The data is described in the README-file.
            3. nenets
              1. tundra-nenets
                1. New-Testament: Character set: the data is converted into the UNICODE format.
                2. sample-sentences: Character set: the data is converted manually into the Latin-1 format. The conversion is described in the README-file.
            4. selkup
              1. h-dialects, ivankino-dialect, ket-dialect, tundra-dialect, tym-dialect, upper-ob-dialect: Character set: the data is adapted manually into the Latin-1 format. The character set is described in the README-file.
          4. ugric-lgs
            1. khanty
              1. atlym-dialect, kazym-dialect, konda-dialect, nizjam-dialect, obdorsk-dialect, synja-dialect: Character set: the data is converted manually into the Latin-1 format. The coding system is available in the set.
              2. Books-of-Children: Character set: the data is converted into the UNICODE format.
              3. rugin: Character set: the data is converted manually into the Latin-1 format. The coding system is available in the set.
            2. mansi
              1. Books-of-Children: Character set: the data is converted into the UNICODE format.


    P.S. 30 Nov - 18 Dec 2007