Word Sense Discovery and Disambiguation

by Krister Lindén

Abstract

The work is based on the assumption that words with similar syntactic usage have similar meaning, which was proposed by Harris (1954; 1968). We study the assumption from two standpoints: firstly, different meanings (word senses) of a word should manifest themselves in different usages (contexts); and secondly, similar usages (contexts) should lead to similar meanings (word senses).

If we start with the different meanings of a word, we should be able to find distinct contexts for the meanings in text corpora. We separate the meanings by grouping and labelling contexts in an unsupervised or weakly supervised manner (Linden and Lagus 2002; Linden 2003; Linden 2004a). We are confronted with the question of how best to represent contexts in order to induce effective classifiers of contexts, because differences in context is the only means we have to separate word senses.

If we start with similar contexts, we should be able to discover similarities in meaning. We can do this monolingually or bilingually. In the monolingual material, we find synonyms and other related words in an unsupervised way (Linden and Piitulainen 2004b). In the bilingual material, we find translations by supervised learning of transliterations (Linden 2004c;Linden 2005a). In both the monolingual and bilingual case, we first discover words with similar contexts, i.e., synonym or translation lists. In the monolingual case, we also aim at finding structure in the lists by discovering groups of similar words, e.g., synonym sets.

In the introduction (
Linden 2005b), we consider the larger background issues of how meaning arises, how it is quantized into word senses, and how it can be stored. We also consider how to define context, how to collect realistic contexts, and how to represent them. We discuss how to evaluate the context classifiers and word sense classifications, and finally we present the word sense discovery and disambiguation methods proposed in the publications of for this work.

This work confirms that the hypothesis proposed by Harris is useful, and we implement three methods for exploiting his hypothesis which have practical consequences for creating thesauruses and translation dictionaries, e.g., for information retrieval and machine translation purposes.

Introduction

Lindén, K. (2005b). PS, PDF, bibtex
Word Sense Discovery and Disambiguation. Publications of the Department of General Linguistics, University of Helsinki, No. 37, June.

Word-Sense Disambiguation

Lindén, K. and Lagus, K. (2002). PS, PDF, bibtex
Word Sense Disambiguation in Document Space. In Proceedings of 2002 IEEE International Conference on Systems, Man and Cybernetics, Hammamet, Tunisia, October 6-9.
Lindén, K. (2003). PS, PDF, bibtex
Word Sense Disambiguation with THESSOM. In Proceedings of WSOM'03 - Intelligent Systems and Innovational Computing, Kitakyushu, Japan, September.
Lindén, K. (2004a). Final Version (preprint PS, preprint PDF), bibtex
Evaluation of Linguistic Features for Word Sense Disambiguation with Self-Organized Document Maps. Journal of Computers and the Humanities, 38(4):417-435, November.

Word-Sense Discovery

Monolingually

Lindén, K. and Piitulainen, J. (2004b). PS, PDF, bibtex
Discovering Synonyms and Other Related Words. In the Proceedings of CompuTerm 2004, Geneva, Switzerland, August 29.

Cross-lingually

Lindén, K. (2004c). PS, PDF, bibtex
Finding Cross-lingual Spelling Variants. In the Proceedings of SPIRE 2004, Padua, Italy, October 5-8.
Lindén, K. (2005a). Final Version (preprint PS, preprint PDF), bibtex
Multilingual Modeling of Cross-lingual Spelling Variants. Journal of Information Retrieval.
Final revision accepted, April 18, 2005.


Last modified: Tue Jun 27 14:27:03 EEST 2006