Data-structure scripts for bible text translations

 

 
The Scripts
 
Demonstrations:
A latin-1 encoded corpus The same corpus converted to unicode The same corpus in the XML-format tagged with a script
(To see the Unicode-corpus properly you might need to set the character encoding of your browser to Unicode (UTF-8).
To view the XML-file, if not shown neatly with your browser, save the file to your computer and open with a text editor.)

 

Documentation

 
1. Scripts
2. Perl programs (called from scripts)
3. XML definitions
 
 
 

1.

Scripts:

1.1.structure_paragraphs_by_linefeed.sh
1.2.structure_paragraphs_by_emptyline.sh
1.3.dictionary.sh
 

1.1.

structure_paragraphs_by_linefeed.sh

Script takes three parameters:
- Name of corpus to be analyzed
- Address of metadata description of the corpus
- Name of file to be created

Script will add XML-headers (including reference to metadata-address given) and analyze and tag text with information on titles, paragraphs, bible-references, other references and footnotes.
Paragraphs are divided by linefeeds.
Script uses perl programs footnotes_references_biblerefs.pl, titles.pl and headers_paragraph_linefeed.pl.

Example of usage:
. structure_paragraph_by_linefeed.sh Gospel-of-Luke-enets http://www.ling.helsinki.fi/uhlcs/metadata/corpus-metadata/uralic-lgs/enets/Enets-Gospel-of-Luke.imdi Gospel-of-Luke-enets-tagged
 

1.2.

structure_paragraphs_by_emptyline.sh

Does the same as structure_paragraphs_by_linefeed.sh, except paragraphs are divided by empty lines.
Uses perl programs footnotes_references_biblerefs.pl, titles.pl and headers_paragraph_emptyline.pl.

Example of usage:
. structure_paragraph_by_emptyline.sh Stories-of-God-evenki http://www.ling.helsinki.fi/uhlcs/metadata/corpus-metadata/tungusic-lgs/evenki/Evenki-Stories-of-God.imdi Stories-of-God-evenki-tagged
 

1.3.

dictionary.sh

Tags a dictionary (word list with translations) file with information about the structure.
Uses perl-program dictionary.pl.

Example of usage:
. dictionary.sh
 

2.

Perl programs (called from scripts):

 
2.1.footnotes_references_biblerefs.pl
2.2.titles.pl
2.3.1.headers_paragraph_linefeed.pl
2.3.2.headers_paragraph_emptyline.pl
2.4.titles_utf8.pl
2.5.dictionary.pl
 

2.1.

footnotes_references_biblerefs.pl

Finds and tags
- lines that start with an asterisk "*" with <FOOTNOTE> and </FOOTNOTE>,
- lines that end with two numbers separated by a hyphen, eg. 23-18 with <BIBLE_REF> and </BIBLE_REF>
- lines that have any non-whitespace characters separated by a hyphen with spaces before and after " - " with <REF> and </REF>.

Skips everything inside <PUBLICATION_INFO> and </PUBLICATION_INFO> tags.
 

2.2.

titles.pl

Marks lines with <TITLE> tags if they are preceded by at least one empty line and followed by exactly one empty line.
First text line is tagged <BOOK_TITLE> and if it is followed by another text line, that one is tagged <BOOK_TITLE_2>.

Skips everything inside <PUBLICATION_INFO> and <REF> tags.
 

2.3.1.

headers_paragraph_linefeed.pl

Marks paragraphs separated by linefeed with <P> tags.
Adds XML-headers and a reference to metadata-file. Metadata-file name is given as the last parameter on commandline (after files to be processed). XML-headers include a reference to DTD ( bible_text.dtd)
Adds <CORPUS> and </CORPUS> tags to beginning and end of text. Metatadata-reference is given as an attribute of <CORPUS>.

Skips anything already tagged.
 

2.3.2.

headers_paragraph_emptyline.pl

Marks paragraphs separated by empty lines with <P> tags.
Adds XML-headers and a reference to metadata-file. Metadata-file name is given as the last parameter on commandline (after files to be processed). XML-headers include a reference to DTD ( bible_text.dtd)
Adds <CORPUS> and </CORPUS> tags to beginning and end of text. Metatadata-reference is given as an attribute of <CORPUS>.

Skips anything already tagged.
 

2.4.

titles_utf8.pl

This script uses a function from utf8-package and can only be used with text that has nothing but valid utf-8 encoded characters.

The script will mark any lines that only contain capital letters with <TITLE> and </TITLE> tags.
The interpretation is made based on utf8-function IsUpper.

Skips everything inside <PUBLICATION_INFO> tags.

If text has any malformed utf-8 characters, this script won't work    -->    use titles.pl instead
 

2.5.

dictionary.pl

Reads a dictionary file three lines at a time. Marks the three-line sections with <WORD> tags. The first line within a section is marked with <ORIGINAL>, the second line with <ENGLISH> and the third line with <FINNISH> tags.
Adds XML-headers to the beginning of the file, including reference to DTD ( dictionary.dtd).
All of dictionary text os tagged with <DICTIONARY>.
 

3.

XML-definitions

 
3.1.bible_text.dtd
3.2.dictionary.dtd
 

3.

XML-definitions

The structure of the XML-files that are created with the scripts is defined in DTD:s.
Scripts structure_paragraphs_by_emptyline.sh and structure_paragraphs_by_linefeed.sh create XML-files that follow Document Type Definition bible_text.dtd and script dictionary.sh creates XML-files defined in dictionary.dtd.

DTD:s are stored in the same directory as the scripts, at the moment
http://www.ling.helsinki.fi/uhlcs/metadata/data-structure/
 

3.1.

bible_text.dtd

 
Defines the structure of XML-files created by structure-scripts.
File bible_text.dtd
Reference to DTD is automatically added to XML-files.
 

3.2.

dictionary.dtd

Defines the structure of XML-files created by dictionary-script.
File dictionary.dtd
Reference to DTD is automatically added to XML-files.
 

July 2003     eeahonen@ling.helsinki.fi    (Documentation version 2003-07-09)