The following seminar takes place in the year 2000 on the six Tuesdays and Thursdays between November 21 and December 7 at 16-18 (or 4-6 p.m.) in the seminar room F205 at Unioninkatu 38 in Helsinki. The seminar is led by Roman Yangarber, PhD from New York University. Each participant will present one paper from the reading list and take part in the discussions. Two credits for Ctl310.
Unioninkatu 38 is the second block upwards from the University Main Building, right by the Department of General Linguistics except the entrance is from Unioninkatu.
The following description is by Yangarber. You can contact him directly as email@example.com. The local contact is Jussi Piitulainen, firstname.lastname@example.org. Paper copies of the papers will be available for copying at Fabianinkatu 28 shortly.
The objective of this seminar is to explore approaches to Machine Learning which we will call "quasi-unsupervised".
The term quasi-unsupervised is introduced in contrast to "unsupervised" in a strict sense, which in AI refers to methods requiring no manual intervention whatsoever. These largely surface as various clustering algorithms. On the other hand we have supervised techniques, such as decision trees and lists, HMMs, etc., widely used in NLP. These are characterised by their reliance on annotated resources (corpora), the production of which may involve substantial costs.
In this seminar we will consider a series of problems whose solutions use very large corpora, but require "almost" no manual supervision. Instead of corpus annotation, they require minimal human input at an initiatial stage, and typically proceed by some form of "bootstrapping" from a few "seed" examples.
The problems we will study appear in very different areas of NLP - machine translation, word sense disambiguation, dictionary induction, information extraction, etc. No coherent unifying framework for them exists to date. Intuitively, these approaches appear to succeed because they exploit the vast amounts of redundancy which is inherent in natural language.
We will study several papers from recent literature. Besides covering the fundamental information contained in the papers, we will have the following background goals:
The seminar will last three weeks. We should have two meetings per week. That will enable us to read six papers, which is substantial. Each session will be 2 hours long.
The requirement for the seminar is that each participant present one paper.
At each session, one student will prepare a single paper. The presentation consists of an oral exposition of the main topics of the paper, and answering questions from the audience. If there are more than six members in the seminar, we can have two-person teams preparing a single paper: one member of the team will give the exposition and the other will answer questions from the audience.
During the first session, I will introduce the seminar by presenting an overview of the topic, and (hopefully) the first paper.
The performance will be judged based on the quality of the presentation and the quality of the questions.
The seminar should require no special mathematical background beyond a basic understanding of statistics and first year college-level mathematics. The students should have a general computational background, with a good understanding of algorigthms. Whenever possible we will attempt to use standard mathematical techniques as a "black box", but with ample references for further in-depth reading.
We will choose six of the following, depending on interest:
Ellen Riloff, "Automatically Generating Extraction Patterns from Untagged Text" (ps.gz), pages 1044-1049 in Proceedings of Thirteenth National Conference on Artificial Intelligence (AAAI-96), 1996, The AAAI Press/MIT Press.
Avrim Blum and Tom Mitchell, 1998, "Combining labeled and unlabeled data with co-training" (ps.gz), pages 92-100 in Proceedings of the 11th Annual Conference on Computational Learning Theory (COLT-98), ACM Press, New York.
Eugene Agichtein and Luis Gravano, 2000, "Snowball: Extracting Relations from Large Plain-Text Collections" (pdf.gz), Proceedings of the 5th ACM International on Digital Libraries (DL'00).