Yliopiston etusivulle In English
Helsingin yliopisto
clt350: Statistical Parsing Methods - lukuvuosi 2009-2010


Nykykielten laitos

PL 24 (Unioninkatu 40)

Puhelin +358 (09) 1911 (vaihde)
Faksi +358 (09) 191 28313

3. Part-of-Speech Tagging

3.1. OpenNLP POS Tagger

  • The OpenNLP Documentation shows how to build and use the OpenNLP tools. You can pipe output from one OpenNLP tool into the next, for example from the sentence detector into the tokenizer, and from the tokenizer into the POS tagger.
  • This course uses OpenNLP Tools version 1.3.0. If you work on ruuvi you don't need to build the tools, you can use the script given in the practical work below.
  • If you work on your own computer you need to build the tools and you also need to download the maximum entropy models for English. Please use version 1.3.0 of the tools and version 1.3.0 of the models.
  • Further reading
  • Practical work
    • Copy the script clt350-opennlp-tagger.sh to your directory and make it executable. This script runs the OpenNLP sentence detector, OpenNLP tokenizer and OpenNLP POS tagger. It takes input from stdin and sends output to stdout.
    • Use it like this to tag Sonnet 130:
      ./clt350-opennlp-tagger.sh <sonnet130.txt >tagged130.txt &

3.2. Stanford POS Tagger

3.3. Comparing Taggers

  • Further reading
  • Practical work
    • OpenNLP POS tagger and Stanford POS tagger both use the Penn Treebank tagset. Compare the results of tagging Sonnet 130 with the two taggers. Which words are tagged differently? Which tagger is more accurate?
    • Try tagging bigger texts and corpora: Jane Austen's Northanger Abbey, or half a million words in Jane Austen's six main novels. Which tagger is faster?
© 2007-2010 Graham Wilcock

Hae laitoksen sivuilta:

Laitoksen etusivulle | Tiedekunnan etusivulle | Yliopiston etusivulle

Copyright © 2003-2005 Helsingin yliopisto. Kaikki oikeudet pidätetään.