Yliopiston etusivulle In English
Helsingin yliopisto
clt231: Introduction to Natural Language Processing - lukuvuosi 2009-2010

Yhteystiedot

Nykykielten laitos

PL 24 (Unioninkatu 40)
00014 HELSINGIN YLIOPISTO

Puhelin +358 (09) 1911 (vaihde)
Faksi +358 (09) 191 28313

5. NLTK Text Corpora.

  • Lecture notes
  • Further reading
  • Practical work
    • Open Python IDLE from the Start menu and do:
      >>> import nltk
      >>> from nltk.corpus import gutenberg
      >>> from nltk.corpus import brown
      
    • The Gutenberg Corpus
      How many words (tokens) are there in Jane Austen's novel Persuasion?
      How many times does the word persuasion occur?
      Make a concordance for persuasion in the novel.
      How many letters (including punctuation and spaces) are there in the novel?
      How many sentences are there? Find and print the longest sentence.
    • The Brown Corpus
      Make a frequency distribution for the "news" category of the Brown Corpus.
      Print counts of the modal verbs can, could, may, might, must, will.
      Print counts of the wh- words what, when, where, who, how, why.
      Make a frequency distribution for the "romance" category.
      Print counts of the modal verbs and the wh- words.
      Compare the counts for the two genres. Are there any clear differences?
    • Male and female authors
      Make a frequency distribution for Jane Austen's novel Persuasion.
      Print counts of the modal verbs can, could, may, might, must, will.
      Print counts of the personal pronouns he, him, himself, she, her, herself.
      Make a frequency distribution for Herman Melville's novel Moby Dick.
      Print counts of the modal verbs and the personal pronouns.
      Compare the counts for the two authors. Are there any clear differences?
© 2006-2009 Graham Wilcock

Hae laitoksen sivuilta:

Laitoksen etusivulle | Tiedekunnan etusivulle | Yliopiston etusivulle

Copyright © 2003-2005 Helsingin yliopisto. Kaikki oikeudet pidätetään.