Kimmo Koskenniemi 24.1.2012
Let us suppose that we have a plain text text file taken from the Gutenberg repository and we would like to produce a list of all distinct word forms in the file and the frequency of each word form in this text. The text file comes as it is and as we produce the list and the frequencies, we have to explore and examine the data in order to get the results we want.
There may be other tools for this particular task but now we are interested in using some simple Unix/Linux command line tools. The idea behind this approach is that nontrivial tasks can often be reduced into a sequence of simple-looking elementary tasks using a so called pipeline.
It is advisable to use the Unicode throughout as the default character coding standard because it is the only universal character coding which can handle all presently written languages and is widely used and supported by modern data interchange, computing systems and software. Use the command:
iconv -f LATIN1 -t UTF8
to convert character encoding into UTF8 if the data is in LATIN1 or some other older encoding.
We use as an example a novel Helsinkiin written by Juhani Aho in 1920. It is in public domain because the author has deceased in 1921 i.e. more than 70 years ago. The text is available from the Gutenberg repository: http://www.gutenberg.org/ebooks/13580
The text is already in UTF-8 encoding but we need to exclude some added description and license text from the beginning an from the end of the file. That can be convenientlyt done with e.g. the Emacs editor. Suppose that you have fetched the file, trimmed it and named the resulting text file as helsinkiin.txt.
We start by a pipeline of two programs, cat and less. The former just reads a file and passes it forward unchanged. The latter shows the coming lines one screenful at time i.e. stops as the screen is full:
$ cat helsinkiin.txt | less
Arrows and PageUp and PageDown keys let you move back and forth. Space advances one full page, b goes backward one page. In order to go to the end of the file, type > and to go to the beginning, type <. If you want to quit, type q.
If you are not an experienced user of Less, you can ask for help by typing h. Don't be confused of the length of the instructions, just pick up the item you are looking for.
Now you can glance and browse the file. Looking at the first two pages, you notice that empty or blank lines separate paragraphs and the text looks mostly quite normal. Some lines start with a double hyphen --, ellipsis is represented as ... and a bit unusal quotation marks are used e.g. »Elias Lönnrot». Double hyphens occur also in the middle of some paragraphs, e.g. loppunut.--Tottahan.
After having this overall view we make a plan how to tokenize the text into words by (1) eliminating unnecessary punctuation, (2) splitting lines so that each word stands on a line of its own and (3) sorting and counting such lines.
A program tr is capable of translating characters to other characters or deleting given characters from the input lines. In order to avoid incorrect splitting we may choose to replace punctuation characters into spaces. But before we do it, we may need to have a closer look to the data. In order to see how the periods are used, let us use the Less program and its regular expression facility and starti Less as above.
For searching, there is the slash ´/´ command which accepts a regular expression. Sentences end in a period followed by a space or a newline and ellipses consist of three periods in a row. Other kinds of periods are suspect. The following Less command locates the next line with such a period:
Typing n you may move to the following instance satisfying the condition. Actually, you may get them all by typing an ampersand & instead of the slash, i.e.:
We notice that there are many instances of .» and some of .-- which pose no problems. In addition, we find j.n.e and n.k. which are abbreviations of three or two words. Thus, we may replace all periods with spaces, and after a similar screening, some other punctuation characters as well:
cat helsinkiin.txt | tr '.,;?!()»' ' ' | less
We have a close look on the colon : and apostrophe ' using similar tests in Less and find that colons only occur at the end of words and that apostrophes (or single quotes) are always a part of the word where it signals unpronounced letters. Thus, we delete also the colon but leave the apostrophe untouched.
The double hyphen -- needs another tool like Sed which is capable of replacing units which are longer than just one character, thus:
Sed is a stream editor, and s/xxx/yyy/g is the command for substituting whatever matches the part between the first and the second slash with the part between the second and third slash. The xxx may be a regular expression. The g states that all occurences, not just the first, must be replaced. Now we have:
cat helsinkiin.txt | tr '.,;:?!()»' ' ' | \ sed 's/--//g' | less
Now we have divided the long pipeline on two lines witha a backslash.
We decide that capital and lower case letters are treated identically, so we translate uppercase letters into their lower case counterpart:
cat helsinkiin.txt | tr '.,;:?!()»' ' ' | \ sed 's/--//g' | tr 'A-ZÅÄÖ' 'a-zåäö' | less
At each step we follow closely with Less, how the material is now. We have been selective in deleting unnecessary characters but do we know what letters and characters there actually are? We use the regular expression of Less for finding out:
We notice that the text contains underscores _ which have to be removed. Hyphens are found but they may remain as they are, i.e. as parts of the word tokens. We found one letter with diacritics: vis-à-vis'ksi but that needs no special treatment either.
Now we have lines where the words have been normalized but there may be several words on a line and the words are separated from each other by one or more spaces. The tool tr may translate spaces ` ` into newline characters n. We add another step to our pipeline:
cat helsinkiin.txt | tr '.,;:?!()»_' ' ' | \ sed 's/--//g' | tr 'A-ZÅÄÖ' 'a-zåäö' | \ tr -s ' ' '\n' | less
Now we have a stream of short lines containing each a single word form.
We do not have a program to do both the sorting and the counting. Instead, sort sorts and uniq -c counts, thus we add them to our chain:
cat helsinkiin.txt | tr '.,;:?!()»_' ' ' | \ sed 's/--//g' | tr 'A-ZÅÄÖ' 'a-zåäö' | \ tr -s ' ' '\n' | sort | uniq -c | less
Now we have the list of 7320 distinct word forms in our text (which we can see, if we go to the end of the text by > in Less.) For each word form we have its frequency in front of the word form.
By replacing the less by redirection of the output into a file, we can store the result if we really need. Then we would perhaps like to remove the first two lines. The first we could have removed by eliminating empty lines by an extra step. The occurrence of an apostrophe as a token appears to arise from a typo in the original text file.