Yliopiston etusivulle In English
Helsingin yliopisto
CLT260: Morfologiset kielenkäsittelyohjelmat - lukuvuosi 2011-2012

Yhteystiedot

Nykykielten laitos

PL 24 (Unioninkatu 40)
00014 HELSINGIN YLIOPISTO

Puhelin +358 (09) 1911 (vaihde)
Faksi +358 (09) 191 28313

Kimmo Koskenniemi 24.1.2012

Counting word frequencies using shell command

Let us suppose that we have a plain text text file taken from the Gutenberg repository and we would like to produce a list of all distinct word forms in the file and the frequency of each word form in this text. The text file comes as it is and as we produce the list and the frequencies, we have to explore and examine the data in order to get the results we want.

There may be other tools for this particular task but now we are interested in using some simple Unix/Linux command line tools. The idea behind this approach is that nontrivial tasks can often be reduced into a sequence of simple-looking elementary tasks using a so called pipeline.

Encoding

It is advisable to use the Unicode throughout as the default character coding standard because it is the only universal character coding which can handle all presently written languages and is widely used and supported by modern data interchange, computing systems and software. Use the command:

iconv -f LATIN1 -t UTF8

to convert character encoding into UTF8 if the data is in LATIN1 or some other older encoding.

Text file

We use as an example a novel Helsinkiin written by Juhani Aho in 1920. It is in public domain because the author has deceased in 1921 i.e. more than 70 years ago. The text is available from the Gutenberg repository: http://www.gutenberg.org/ebooks/13580

The text is already in UTF-8 encoding but we need to exclude some added description and license text from the beginning an from the end of the file. That can be convenientlyt done with e.g. the Emacs editor. Suppose that you have fetched the file, trimmed it and named the resulting text file as helsinkiin.txt.

Looking at the file

We start by a pipeline of two programs, cat and less. The former just reads a file and passes it forward unchanged. The latter shows the coming lines one screenful at time i.e. stops as the screen is full:

$ cat helsinkiin.txt | less

Arrows and PageUp and PageDown keys let you move back and forth. Space advances one full page, b goes backward one page. In order to go to the end of the file, type > and to go to the beginning, type <. If you want to quit, type q.

If you are not an experienced user of Less, you can ask for help by typing h. Don't be confused of the length of the instructions, just pick up the item you are looking for.

Now you can glance and browse the file. Looking at the first two pages, you notice that empty or blank lines separate paragraphs and the text looks mostly quite normal. Some lines start with a double hyphen --, ellipsis is represented as ... and a bit unusal quotation marks are used e.g. »Elias Lönnrot». Double hyphens occur also in the middle of some paragraphs, e.g. loppunut.--Tottahan.

After having this overall view we make a plan how to tokenize the text into words by (1) eliminating unnecessary punctuation, (2) splitting lines so that each word stands on a line of its own and (3) sorting and counting such lines.

Eliminating punctuation

A program tr is capable of translating characters to other characters or deleting given characters from the input lines. In order to avoid incorrect splitting we may choose to replace punctuation characters into spaces. But before we do it, we may need to have a closer look to the data. In order to see how the periods are used, let us use the Less program and its regular expression facility and starti Less as above.

For searching, there is the slash ´/´ command which accepts a regular expression. Sentences end in a period followed by a space or a newline and ellipses consist of three periods in a row. Other kinds of periods are suspect. The following Less command locates the next line with such a period:

/\.[^ .]

Typing n you may move to the following instance satisfying the condition. Actually, you may get them all by typing an ampersand & instead of the slash, i.e.:

&\.[^ .]

We notice that there are many instances of and some of .-- which pose no problems. In addition, we find j.n.e and n.k. which are abbreviations of three or two words. Thus, we may replace all periods with spaces, and after a similar screening, some other punctuation characters as well:

cat helsinkiin.txt | tr '.,;?!()[]»' ' ' | less

We have a close look on the colon : and apostrophe ' using similar tests in Less and find that colons only occur at the end of words and that apostrophes (or single quotes) are always a part of the word where it signals unpronounced letters. Thus, we delete also the colon but leave the apostrophe untouched.

The double hyphen -- needs another tool like Sed which is capable of replacing units which are longer than just one character, thus:

sed 's/--//g'

Sed is a stream editor, and s/xxx/yyy/g is the command for substituting whatever matches the part between the first and the second slash with the part between the second and third slash. The xxx may be a regular expression. The g states that all occurences, not just the first, must be replaced. Now we have:

cat helsinkiin.txt | tr '.,;:?!()[]»' ' ' | \
  sed 's/--//g' |  less

Now we have divided the long pipeline on two lines witha a backslash.

We decide that capital and lower case letters are treated identically, so we translate uppercase letters into their lower case counterpart:

cat helsinkiin.txt | tr '.,;:?!()[]»' ' ' | \
  sed 's/--//g' | tr 'A-ZÅÄÖ' 'a-zåäö' | less

At each step we follow closely with Less, how the material is now. We have been selective in deleting unnecessary characters but do we know what letters and characters there actually are? We use the regular expression of Less for finding out:

&[^a-zåäö '0-9]

We notice that the text contains underscores _ which have to be removed. Hyphens are found but they may remain as they are, i.e. as parts of the word tokens. We found one letter with diacritics: vis-à-vis'ksi but that needs no special treatment either.

Splitting the lines

Now we have lines where the words have been normalized but there may be several words on a line and the words are separated from each other by one or more spaces. The tool tr may translate spaces ` ` into newline characters n. We add another step to our pipeline:

cat helsinkiin.txt | tr '.,;:?!()[]»_' ' ' | \
 sed 's/--//g' | tr 'A-ZÅÄÖ' 'a-zåäö' | \
 tr -s ' ' '\n' | less

Now we have a stream of short lines containing each a single word form.

Sorting and counting

We do not have a program to do both the sorting and the counting. Instead, sort sorts and uniq -c counts, thus we add them to our chain:

cat helsinkiin.txt | tr '.,;:?!()[]»_' ' ' | \
 sed 's/--//g' | tr 'A-ZÅÄÖ' 'a-zåäö' | \
 tr -s ' ' '\n' | sort | uniq -c | less

Now we have the list of 7320 distinct word forms in our text (which we can see, if we go to the end of the text by > in Less.) For each word form we have its frequency in front of the word form.

By replacing the less by redirection of the output into a file, we can store the result if we really need. Then we would perhaps like to remove the first two lines. The first we could have removed by eliminating empty lines by an extra step. The occurrence of an apostrophe as a token appears to arise from a typo in the original text file.

Hae laitoksen sivuilta:

Laitoksen etusivulle | Tiedekunnan etusivulle | Yliopiston etusivulle

Copyright © 2003-2005 Helsingin yliopisto. Kaikki oikeudet pidätetään.