The following seminar takes place in the year 2000 on the six Tuesdays and Thursdays between November 21 and December 7 at 16-18 (or 4-6 p.m.) in the seminar room F205 at Unioninkatu 38 in Helsinki. The seminar is led by Roman Yangarber, PhD from New York University. Each participant will present one paper from the reading list and take part in the discussions. Two credits for Ctl310.

Unioninkatu 38 is the second block upwards from the University Main Building, right by the Department of General Linguistics except the entrance is from Unioninkatu.

The following description is by Yangarber. You can contact him directly as The local contact is Jussi Piitulainen, Paper copies of the papers will be available for copying at Fabianinkatu 28 shortly.

Quasi-Unsupervised Learning for Natural Language Processing

The objective of this seminar is to explore approaches to Machine Learning which we will call "quasi-unsupervised".

The term quasi-unsupervised is introduced in contrast to "unsupervised" in a strict sense, which in AI refers to methods requiring no manual intervention whatsoever. These largely surface as various clustering algorithms. On the other hand we have supervised techniques, such as decision trees and lists, HMMs, etc., widely used in NLP.  These are  characterised by their reliance on annotated resources (corpora), the production of which may involve substantial costs.

In this seminar we will consider a series of problems whose solutions use very large corpora, but require "almost" no manual supervision. Instead of corpus annotation, they require minimal human input at an initiatial stage, and typically proceed by some form of "bootstrapping" from a few "seed" examples.

The problems we will study appear in very different areas of NLP - machine translation, word sense disambiguation, dictionary induction, information extraction, etc.  No coherent unifying framework for them exists to date.  Intuitively, these approaches appear to succeed because they exploit the vast amounts of redundancy which is inherent in natural language.

We will study several papers from recent literature.  Besides covering the fundamental information contained in the papers, we will have the following background goals:

  1. attempt to understand some of the common underlying principles, which account for the success of these approaches;
  2. discuss which other NLP problems may be approached in a similar fashion;
  3. discuss how the principles in 1. might apply to the problems in 2.

The process

The seminar will last three weeks. We should have two meetings per week. That will enable us to read six papers, which is substantial. Each session will be 2 hours long.

The requirement for the seminar is that each participant present one paper.

At each session, one student will prepare a single paper. The presentation consists of an oral exposition of the main topics of the paper, and answering questions from the audience. If there are more than six members in the seminar, we can have two-person teams preparing a single paper: one member of the team will give the exposition and the other will answer questions from the audience.

During the first session, I will introduce the seminar by presenting an overview of the topic, and (hopefully) the first paper.

The performance will be judged based on the quality of the presentation and the quality of the questions.

Background requirements

The seminar should require no special mathematical background beyond a basic understanding of statistics and first year college-level mathematics. The students should have a general computational background, with a good understanding of algorigthms. Whenever possible we will attempt to use standard mathematical techniques as a "black box", but with ample references for further in-depth reading.

Proposed reading list

We will choose six of the following, depending on interest:

Background texts

Linkkejä: [Kieliteknologian opetusohjelma lkv 2000-2001]
Last Modified: Friday, 27-Oct-2000 13:04:13 EEST