| The Scripts | ||||
| Demonstrations: | ||||
| ||||
| (To see the Unicode-corpus properly you might need to set the character encoding of your browser to Unicode (UTF-8). To view the XML-file, if not shown neatly with your browser, save the file to your computer and open with a text editor.) | ||||
Documentation | ||||
| 1. | Scripts | |||
| 2. | Perl programs (called from scripts) | |||
| 3. | XML definitions | |||
1. | Scripts: | |||
| 1.1. | structure_paragraphs_by_linefeed.sh | |||
| 1.2. | structure_paragraphs_by_emptyline.sh | |||
| 1.3. | dictionary.sh | |||
1.1. | structure_paragraphs_by_linefeed.sh | |||
|
Script takes three parameters: - Name of corpus to be analyzed - Address of metadata description of the corpus - Name of file to be created Script will add XML-headers (including reference to metadata-address given) and analyze and tag text with information on titles, paragraphs, bible-references, other references and footnotes. Paragraphs are divided by linefeeds. Script uses perl programs footnotes_references_biblerefs.pl, titles.pl and headers_paragraph_linefeed.pl. Example of usage: . structure_paragraph_by_linefeed.sh Gospel-of-Luke-enets http://www.ling.helsinki.fi/uhlcs/metadata/corpus-metadata/uralic-lgs/enets/Enets-Gospel-of-Luke.imdi Gospel-of-Luke-enets-tagged | ||||
1.2. | structure_paragraphs_by_emptyline.sh | |||
|
Does the same as structure_paragraphs_by_linefeed.sh, except paragraphs
are divided by empty lines. Uses perl programs footnotes_references_biblerefs.pl, titles.pl and headers_paragraph_emptyline.pl. Example of usage: . structure_paragraph_by_emptyline.sh Stories-of-God-evenki http://www.ling.helsinki.fi/uhlcs/metadata/corpus-metadata/tungusic-lgs/evenki/Evenki-Stories-of-God.imdi Stories-of-God-evenki-tagged | ||||
1.3. | dictionary.sh | |||
|
Tags a dictionary (word list with translations) file with information about the structure. Uses perl-program dictionary.pl. Example of usage: . dictionary.sh | ||||
2. | Perl programs (called from scripts): | |||
| 2.1. | footnotes_references_biblerefs.pl | |||
| 2.2. | titles.pl | |||
| 2.3.1. | headers_paragraph_linefeed.pl | |||
| 2.3.2. | headers_paragraph_emptyline.pl | |||
| 2.4. | titles_utf8.pl | |||
| 2.5. | dictionary.pl | |||
2.1. | footnotes_references_biblerefs.pl | |||
|
Finds and tags - lines that start with an asterisk "*" with <FOOTNOTE> and </FOOTNOTE>, - lines that end with two numbers separated by a hyphen, eg. 23-18 with <BIBLE_REF> and </BIBLE_REF> - lines that have any non-whitespace characters separated by a hyphen with spaces before and after " - " with <REF> and </REF>. Skips everything inside <PUBLICATION_INFO> and </PUBLICATION_INFO> tags. | ||||
2.2. | titles.pl | |||
|
Marks lines with <TITLE> tags if they are preceded by at least one empty
line and followed by exactly one empty line. First text line is tagged <BOOK_TITLE> and if it is followed by another text line, that one is tagged <BOOK_TITLE_2>. Skips everything inside <PUBLICATION_INFO> and <REF> tags. | ||||
2.3.1. | headers_paragraph_linefeed.pl | |||
|
Marks paragraphs separated by linefeed with <P> tags. Adds XML-headers and a reference to metadata-file. Metadata-file name is given as the last parameter on commandline (after files to be processed). XML-headers include a reference to DTD ( bible_text.dtd) Adds <CORPUS> and </CORPUS> tags to beginning and end of text. Metatadata-reference is given as an attribute of <CORPUS>. Skips anything already tagged. | ||||
2.3.2. | headers_paragraph_emptyline.pl | |||
|
Marks paragraphs separated by empty lines with <P> tags. Adds XML-headers and a reference to metadata-file. Metadata-file name is given as the last parameter on commandline (after files to be processed). XML-headers include a reference to DTD ( bible_text.dtd) Adds <CORPUS> and </CORPUS> tags to beginning and end of text. Metatadata-reference is given as an attribute of <CORPUS>. Skips anything already tagged. | ||||
2.4. | titles_utf8.pl | |||
|
This script uses a function from utf8-package and can only be
used with text that has nothing but valid utf-8 encoded
characters. The script will mark any lines that only contain capital letters with <TITLE> and </TITLE> tags. The interpretation is made based on utf8-function IsUpper. Skips everything inside <PUBLICATION_INFO> tags. If text has any malformed utf-8 characters, this script won't work --> use titles.pl instead | ||||
2.5. | dictionary.pl | |||
|
Reads a dictionary file three lines at a time. Marks the three-line sections
with <WORD> tags. The first line within a section is marked with <ORIGINAL>,
the second line with <ENGLISH> and the third line with <FINNISH> tags. Adds XML-headers to the beginning of the file, including reference to DTD ( dictionary.dtd). All of dictionary text os tagged with <DICTIONARY>. | ||||
3. | XML-definitions | |||
| 3.1. | bible_text.dtd | |||
| 3.2. | dictionary.dtd | |||
3. | XML-definitions | |||
|
The structure of the XML-files that are created with the scripts is defined in DTD:s. Scripts structure_paragraphs_by_emptyline.sh and structure_paragraphs_by_linefeed.sh create XML-files that follow Document Type Definition bible_text.dtd and script dictionary.sh creates XML-files defined in dictionary.dtd. DTD:s are stored in the same directory as the scripts, at the moment http://www.ling.helsinki.fi/uhlcs/metadata/data-structure/ | ||||
3.1. | bible_text.dtd | |||
|
Defines the structure of XML-files created by structure-scripts. File bible_text.dtd Reference to DTD is automatically added to XML-files. | ||||
3.2. | dictionary.dtd | |||
|
Defines the structure of XML-files created by dictionary-script. File dictionary.dtd Reference to DTD is automatically added to XML-files. | ||||
|
July 2003 eeahonen@ling.helsinki.fi
(Documentation version 2003-07-09) | ||||