This document describes the recognition, classification and coreference resolution of names in BRIEFS. It covers only English text processing.
In this section I will explain some central concepts and problems of the
field.
An enormous amount of information exists only in natural language form. Most of the world's news for example are released in newspapers, radio and TV broadcasts. If this information is to be manipulated, analyzed or even filtered automatically, it must first be distilled into more structured form in which single facts are more accessible (Grishman 1997).
A broad definition of 'information extraction' covers any method of filtering information from large volumes of text. This would include the retrieval of documents from collections and the tagging of particular terms in the texts. However over the last decades a narrower definition has become popular through a series of Message Understanding Conferences (MUC). In MUC-7 'information extraction' (IE) is defined to be the extraction or pulling out of pertinent information from large volumes of natural language texts. This is also sometimes called 'text extraction'. The goal of IE is not to understand all that is said in the text, which is usually referred to as 'text understanding', but to find answers to specific questions presented by the user. The usual primary objective of extraction is to build databases that are more suitable than free text for querying, e.g. using SQL.
According to Jerry R. Hobbs (2000) an IE system is a cascade of transducers or modules that at each step add structure and often lose information, hopefully irrelevant, by applying rules that are acquired manually and/or automatically. Different systems have their own sets of modules, but generally the same functions are performed.
The process of extracting information can be divided to two major parts. First the single facts are extracted from the text and secondly the facts are integrated, producing larger facts or infering new facts. The individual facts are extracted by matching the text against patterns that describe the linguistic realizations of the sought information. Since natural language is so complex, it is not practical to describe these patterns on the level of words. Therefore it is necessary to find structure in the text, recognizing constituents and relations. This usually means lexical analysis, recognition of names and other special lexical structures like dates and syntactical analysis or partial parsing. Once this has been done, fewer patterns are needed to find the relevant pieces of text and the facts manifested in them.
The second part of the process is to combine these single facts into complete descriptions of events. Integrating facts based on their proximity in a text doesn't take us very far. Integration through named entities (objects of interest) is more reliable, it usually tells us how to put the single facts together and more importantly, it works with multiple texts. If we have several documents describing a deal between the same named participants and the documents' dates are all in the same week, we can safely assume that they all describe the same event.
When reading a text, one name can often be found in several different surface forms. Not only does the form vary according to the grammar, but a person, for example, can be referred to by his/her full name, Christian name or surname, a noun phrase or a pronoun. A company's name can vary even more. There are no clear conventions on how the names should be written. The full name can give indications on the company's type of juridical liability (e.g. "Inc" or "Corp") or home region. These are easily dropped in the text once the named entity has been established and there is no fear of mix-up. In addition to the deliberate variation there can be misspellings and differences in the language used. E.g. names of places or indications of the line of business of the company can be translated from the local language to the document's language, but are not necessarily so.
The task described in this document is to recognize references to named entities in the texts. In MUC sites this is often referred to as discourse processing.
Here are some examples from the texts used in BRIEFS with the reference chain shown in italics. The first one shows how the same company is referred to by a full name, an acronym and the pronoun they in different surface forms.
In addition to the traditional channels like direct sales, traders, integrated or independent Steel Service Centers, two new forms of distribution emerged in the 1990's; steel supermarkets and e-Commerce. As SSCs have developed their processing business, they have steadily increased their integration downstream, meeting the outsourcing trend of their customer.
The second example shows a different variation. Here the name of the event is misspelled, i.e. there is an additional space in the name.
ObjectSwitch Corporation, the emerging leader in fault-tolerant infrastructure software for scaleable e-Services, announced today its participation in the upcoming GS M World Congress (booth #B43). ObjectSwitch will demonstrate its latest technology solutions including the Wireless Application Protocol (WAP) Adapter providing access to applications through a simple WAP browser for network service providers, integrators and network equipment providers. In addition, ObjectSwitch will announce a new strategic partner alliance with NetToll, a provider of innovative new billing and management solutions. The GSM World Congress takes place February 2-4 in Cannes, France.
This example shows a typical convention. A person is referred to first with full name and later with mere surname.
The channel is struggling to keep up with the changing mobile market, said Tommy Morris, branch manager for Computer Tech, Dallas. "The channel needs VARs to handle the little idiosyncrasies," Morris said.
Resolving coreference is essential, when one tries to build a general picture of the single facts one has extracted from the texts. In BRIEFS the user can make complex inquiries into the database and these often include differentiating new knowledge from old. If a name is not tied up to an already existing entity, it seems like a new entity has arisen and all information attached to it is new information. Another scenario is when some event or entity is being followed for a longer period. If the names in different documents are not tied together, one may miss some crucial information of the developments, for example the closing of a deal that has been in preparation over a period of time.
Anaphoric expressions can disturb the system even more. If for example the system was searching for facts with the pattern subject phrase with head noun that is a name of a company + verb group with main verb "increase" + object, the phrase "they have steadily increased their integration" in the first example would not be found unless the anaphora was resolved.
In this section I introduce the BRIEFS project and some of the tools utilized in BRIEFS.
The Brief Driven Information Retrieval and Extraction for Strategy (BRIEFS) prototype system is developed in cooperation between the TAI Research Center of the Helsinki University of Technology, the Department of General Linguistics in the University of Helsinki and the State Research Center (VTT). There are also participating business partners (Keijola 2002).
The main objective in BRIEFS is to extract computable data from vast masses of text. Besides this the system supports the conceptual modeling of the knowledge domain and evaluates the relevance of the documents for the domain. Once the knowledge domain has been created, it can be used for knowledge discovery.
Figure 2.1. shows the general architecture of the system.
There are two types of input to the system. The operator provides the
system with a document (brief), which describes the domain of the
business. This is exploited in the creation of a domain specific
ontology. On the other hand, the system receives text documents that
contain the knowledge to be discovered. The input is
analyzed both linguistically and statistically, and the
results are used when the relevance of the document is decided and the
information is extracted. The extracted facts are used as evidence of
entities, events and relations, and the final output for the user is inferred on
that evidence.
Figure 2.1. The BRIEFS System architecture (Keijola 2002)
The project is still in progress. Some parts of the system have not yet been implemented completely.
We have adopted a system called Gate (General Architecture for Text Engineering) from the University of Sheffield as our development platform. The system follows the TIPSTER architecture as defined by DARPA's TIPSTER project.
The Gate system administers collections of texts together with their associated annotations. The annotations are generated
and associated with the original text in the course of processing of the text by e.g. the BRIEFS programs. Gate offers a variety
of viewers into the original text and its annotations. It allows for creation of alternative processing pipelines so it is easy to
develop and test alternative algorithms (Keijola 2002). It also allows the running and undoing of single program modules. Figure 2.2. shows the main
window of Gate with the pipelines created for BRIEFS and figure
2.3. shows one of these pipelines.
Figure 2.2. The BRIEFS processing pipelines in Gate (Keijola 2002)
Figure 2.3. The BRIEFS processing pipeline IE with FDG and
CPSL. The boxes represent different program modules and the directed arrows
show the precedence between the modules. (Keijola 2002)
The linguistic modules are the same in all the processing
pipelines for English text.
We use a commercially available parser, En-FDG from Conexor Oy, for basic linguistic processing.
Conexor (2002a) tells that an FDG parser enriches text (plain text, xml, sgml, html) with functional dependencies that tell about sentence-level relations and functions between words and linguistic structures. Also a light morphosyntactic mark-up is provided (base forms, morphology and phrasal tags).
Here is an example from one of the input documents in BRIEFS.
Input:
"Scott's appointment to CEO of the WAP Forum comes at a time when WAP has reached a new level of maturity and global acceptance," said Gregory Williams, board chairman of the WAP Forum.
Output:
| 1 | " | |||
|---|---|---|---|---|
| 2 | Scott's | scott | attr:>3 | @A> %>N N GEN SG +ind |
| 3 | appointment | appointment | subj:>9 | @SUBJ %NH N NOM SG |
| 4 | to | to | mod:>3 | @<NOM %N< PREP |
| 5 | CEO | ceo | pcomp:>4 | @<P %NH ABBR NOM SG +role |
| 6 | of | of | mod:>5 | @<NOM-OF %N< PREP |
| 7 | the | the | det:>8 | @DN> %>N DET |
| 8 | WAP Forum | wap forum | pcomp:>6 | @<P %NH N NOM SG +org |
| 9 | comes | come | main:>0 | @+FMAINV %VA V PRES SG3 |
| 10 | at | at | tmp:>9 | @ADVL %EH PREP |
| 11 | a | a | det:>12 | @DN> %>N DET SG +temp |
| 12 | time | time | dur:>9 | @ADVL %EH N NOM SG +temp |
| 13 | when | when | tmp:>16 | @ADVL %EH ADV WH |
| 14 | WAP | wap | subj:>15 | @SUBJ %NH N NOM SG +prod |
| 15 | has | have | v-ch:>16 | @+FAUXV %AUX V PRES SG3 |
| 16 | reached | reach | mod:>12 | @-FMAINV %VA EN |
| 17 | a | a | det:>19 | @DN> %>N DET SG |
| 18 | new | new | attr:>19 | @A> %>N A ABS |
| 19 | level | level | obj:>16 | @OBJ %NH N NOM SG |
| 20 | of | of | mod:>19 | @<NOM-OF %N< PREP |
| 21 | maturity | maturity | pcomp:>20 | @<P %NH N NOM SG |
| 22 | and | and | cc:>21 | @CC %CC CC |
| 23 | global | global | attr:>24 | @A> %>N A ABS |
| 24 | acceptance | acceptance | cc:>21 | @<P %NH N NOM SG |
| 25 | , | , | ||
| 26 | " | |||
| 27 | said | say | @+FMAINV %VA V PAST | |
| 28 | Gregory Williams | gregory williams | subj:>27 | @SUBJ %NH N NOM SG +ind |
| 29 | , | , | ||
| 30 | board chairman | board chairman | mod:>28 | @APP %NH N NOM SG +role |
| 31 | of | of | mod:>30 | @<NOM-OF %N< PREP |
| 32 | the | the | det:>33 | @DN> %>N DET |
| 33 | WAP Forum | wap forum | pcomp:>31 | @<P %NH N NOM SG +org |
| 34 | . | . | ||
| 35 | <s> | <s> |
En-FDG reads the text and splits it into tokens. The tokens are numbered by their ordinal position inside a sentence and these numbers are used in the dependencies. The next two fields show the surface form and the base form of the word. These are followed by the token's dependency function and the number of the head of the token. These may be missing, but when they are there they should be reliable. The last field consists of the token's functional tag and knowledge on the token's morphosyntactic mark-up (Conexor 2002b).
The version of En-FDG used in BRIEFS includes also a name recognition engine, which affects the tokenization so that names that include several words are not split. The names are also given a type (i.e. +org, +ind).
The English FDG has a custom lexicon mechanism, which gives the user a way to affect the working of the parser. The mechanism is used in BRIEFS to improve the recognition and classification of names.
In this section I define the structure and workings of the Briefs name database.
The known names are stored in a database. It represents the knowledge we have on the entities that exist in the domain and what they can be called. The database is not created or updated automatically, but by humans. To ease the maintenance of the name database a tool was created by Lauri Seitsonen (2002a). With BRIEFS Name Tool names can be added, updated and removed from the database. Name Tool can also be used for compiling the custom lexicon of the FDG parser in the BRIEFS system.
The user can give the following attributes to a name being added to the database:
The last attribute is defined automatically, depending on where the new name is saved.
There are also other fields in the actual database but they are filled and updated automatically by the system.
Fifteen categories of names have been defined in the BRIEFS ontology. The types are City, Company, Country, Event, Exchange, Nationality, Organization, Person, Product, Region, Site, Symbol, Technology, Title and Unknown.
The following BRIEFS classes can be given in the Name Tool as the primary class and can be conveyed to FDG: Technology, Company, Organization, Person, Product, Event, Title, Site, City, Country and Region. In addition to these Name Tool has classes for strings of nouns (Compound) and abbreviations (Abbreviation). Compounds and Abbreviations are not considered to be names. The BRIEFS classification has the following classes that cannot be given as Name Type in the Name Tool or given in FDG's custom lexicon: Nationality, Exchange and Symbol. Instead names of type Exchange and Symbol are given the Name type Company, with their accurate class presented in the Subtype. Nationalities are quite comprehensively recognized by FDG, so they pose no problem.
Names are added only in nominative form as the FDG returns words in the text to their basic forms.
The more comprehensive the name database is, the better the result will be. Although the chain of processing can often unify names that differ only a little, it should be treated as a backup system for unknown names and entities.
Several files that are used in the processing can be created of the name lists in the database with the Name Tool. We will return to the actual use of the name lists later in the chapters on the processing of the documents.
The lists are first used to make a custom lexicon for the parser out of the fields Name and Name type. The lexicon can help FDG to parse the text. The names are not modified in any way when building the custom lexicon, so only the names that match the lexicon exactly are recognized and classified. When FDG compares the text against the custom lexicon, lower case letters in the lexicon match to both lower and upper case letters in the text, and upper case letters in the lexicon match only to upper case letters in the text. Therefore the value of field "Name" should be non-capitalized.
A file is made of the fields Name and Subtype. Briefs Guess fdg uses this file to add the subtype as an attribute to each custom name annotation. If the subtype is Exchange or Symbol, it will be promoted to be the main type of the annotation.
The Name fields in the lists are also used with Unifier. If a name has not been found with the help of custom lexicon and the rules in Briefs Coref fdg have failed to unify it with known names, it is matched to the name list by Unifier to detect possible misspellings.
The lists are used for the last time when the correct writing forms are fetched to unify names between documents. This process is not so strict on the exact form of the name in the text, but is very vulnerable to inaccuracies in the correct writing forms. Therefore it is better to leave a name out of the list than to give it a canonical form, that differs from the correct forms of the other surface forms of the same entity.
It should be pointed out that it is not always easy to make a clear-cut decision about the scope or the type of a name. Neither is it easy to decide whether a short and common name should be added at all. If a name can refer to several different entities, it may be best to leave it out all together. If a name can be used also as a common noun or verb, it should not be added to the list. FDG recognizes the custom names when the text is tokenized, before the parsing, and if for example a verb is misinterpreted as a name, the parsing of the sentence fails.
As an example take a look at the noun phrase "The Mobile Phone unit of Nokia Corporation". "Nokia corporation" is easy to classify as a name of type Company. How about the company unit the phrase refers to? Simplest answer would be "Mobile Phone unit" but that alone does not refer to some specific unit. Many companies have Mobile Phone units. "Mobile Phone unit of Nokia Corporation" is the least likely candidate to go wrong as it has a very clear and hopefully unique referent. But then we should probably also add names "Mobile Phones unit of Nokia Corporation", "Nokia Corporation Mobile Phone unit", "Nokia Mobile Phone unit", "Nokia's Mobile Phone unit" etc.
A good example of the names best left out is "Ericsson", which can refer to a person named Ericsson, but in the business world often refers to the company L.M. Ericsson Telefon AB.
In the texts many names are appended with endings such as "-based", "-compatible", "-enabled". If these endings are glued to a preceding name, FDG doesn't recognize the name even if it is part of the custom lexicon. Therefore the preprocessing of the texts should add a space between the name and the ending, before the hyphen.
In this section I explain how names are detected and classified in BRIEFS.
According to Carlson (1988) names are often treated in logic as if they were direct and unique references to some entity. This differs completely from the way other nouns are treated. In practice the difference isn't so total. Names are not unique, but one name can refer to many different entities depending on the context and the same entity can be referred to with several different names.
Proper nouns form an open word class. This means that we occasionally run into names we don't yet know and therefore lexical knowledge of names will not be enough. In general one can get pretty good results with few simple rules, but the greater precision and comprehensive results we pursue, the harder the task gets.
There are many ways to detect proper nouns. Orthography is the way that comes to mind first. Names are often written with initial upper case letter. Common names can be listed in the lexicon as proper nouns. On the other hand if some word can not be found in the lexicon it may be because it is a previously unknown name. Context can also give clues to classifying some word as a proper noun. Some verbs, for example "love", may demand their subject to be living and if the subject happens to be a noun like "Stone", it is probable that the subject is not a rock, but the name of a person. The rules on context can be learned automatically or they can be coded manually.
In BRIEFS the recognition of names is based on the parser's name recognition (the FDG from Conexor Oy). According to Voutilainen (2002) FDG recognizes and classifies names based on orthographic, lexical and contextual (syntactic-semantic) information. The starting point for recognition is the parsed text and recognition and classification are founded on an empirically tested linguistic language model.
Apart from its own name recognition engine FDG recognizes names that are enumerated in its custom lexicon. This gives the user the means to improve the efficiency of the name recognition by adding domain specific names and their classes into the lexicon. In BRIEFS this lexicon is made out of the name database.
The parser's output is a list of token, sentence and paragraph annotations. The token annotations that are names can be singled out by their attributes. These attributes include knowledge of the class of the name and whether it was found with the help of the custom lexicon. The custom found names get their class directly from the custom lexicon.
The FDG tags the names it finds that are not mentioned in the custom lexicon with one of the following classes: +money, +time, +person, +product, +location, +organization, +nationality or +role. If the parser can't decide between the classes, the token is given the underspecified tag "+name". Out of these the "+time" and "+money" classes are not treated as names in BRIEFS at this stage. Where the two different classifications coincide, the name annotation is given the BRIEFS class.
The module Briefs Name fdg (figure 2.3.) produces name annotations of these chosen token annotations.
As could be seen in the previous chapters, the recognition and classification of names are closely connected. If we really know a name, we know which entity it refers to and consequently usually know the type of the entity. On the other hand if we recognize a name by its context, its type may also be limited by the same context.
There are many ways to deal with the common problem of classification: for example decision trees and rule sets, neural networks and bayesian networks. The rules by which the tokens are classified can be constructed by hand or be learned automatically with some learning algorithm if one has a training set available.
As the classifications given in the previous chapter don't quite match, the name's type is refined in the module Briefs Guess fdg (figure 2.3.). Guess uses a set of heuristic rules and word lists to decide on the Briefs class of the underspecified names. If the class of a name can not be decided, it is given the underspecified type "Unknown". Once the type of the name has been selected, it may get some additional attributes.
The subtype saved in the name database is added to each custom found name as an attribute. If the subcategory is Exchange or Symbol, it is promoted to be the main class of the name.
The word lists used are:
Examples of the rules used:
Once we have come this far in the chain of processing a text, we should have the names recognized and classified. Each name has an annotation of type "name", that is tied to the right spot in the text and has attributes indicating the name's type, subtype and other characteristics.
In this section I explain the process of finding chains of corefering expressions in the texts.
In the texts the name of a single entity can be written in several different forms. The module Briefs Coref fdg (figure 2.3.) tries to normalize the different corefering expressions into a single consistent form. As the coreferences are found, and the different forms of a name are unified, one of the attributes of the name annotations, namely "root", is used as the unifying element, i.e. two name annotations with matching root attributes refer to the same entity.
First the names that were not found in the custom lexicon are matched to the other names in the document by using the Unifier (from VTT), a fuzzy matcher that can manage typing errors. If a good enough match is found, Coref will change the root of the name to match the found referent.
The next step is to use heuristic rules to find the coreferences between names. As the chains of different names for the same entity are discovered the longest name in the chain is considered to be the root name.
Examples of the rules used:
If these do not give result to some name, Unifier is used again to match the name to the name lists in the database. If a good enough match is found, Coref will change the root form of the name to match the found referent.
According to Ruslan Mitkov (1999) plenty of different algorithms for anaphora resolution have been developed, but all the approaches, ranging from traditional (from purely syntactic to highly semantic and pragmatic ones) to alternative (statistic, uncertainty-reasoning etc.) and knowledge-poor, offer only approximate solutions. Most of the systems deal with anaphors which have noun phrases as their antecedents. Identifying anaphors which have verb phrases, clauses, sentences or even discourse segments as antecedents is a more complicated task. Typically the system first regards all NP's preceding the anaphor at some predefined scope as possible candidates for antecedents. This set is then reduced by eliminating some candidates with various constraints (such as gender and number agreement, c-command constraints etc.) and the remaining set is rated, giving more preference to some candidates and less to others. For example the center of attention is preferred over other candidates, and candidates that are syntactically or semantically parallel to the anaphor over others.
All the different algorithms have different demands on the available knowledge. The BRIEFS system knows the syntactical structure of the text, but has have very little semantic knowledge. Therefore it was easy to choose the general approach to be syntax-based. Of the many available algorithms we chose Lappin and Leass's (1994) RAP to be implemented.
Lappin and Leass's algorithm for anaphora resolution (RAP) is used to resolve the third person pronoun anaphora. It relies on measures of salience derived from the syntactic structure and a simple dynamic model of attentional state to select the antecedent NP of a pronoun from a list of candidates. It does not employ real world knowledge or semantic conditions beyond those implicit in grammatical number and gender.
The algorithm has four major parts
Once the antecedent has been selected, a check is made to see if the chosen NP has a named entity as its main token. If this is the case, a new name annotation is made with the span and the attributes of the pronoun except for the root attribute, which is inherited from the named entity. In every case an attribute that identifies the antecedent NP is added to the pronoun's token annotation.
After this the next obvious thing to do would be to find the noun phrases that refer to a named entity. This however is more easily said than done. We would first have to know which NP's bring new referents to the discourse and which refer back to some earlier mentioned referent or some referent that can be known by world knowledge. Once an appropriate NP was found, a lot of semantic knowledge on the NP and all the possible antecedents would be needed to rule out the semantically incompatible antecedents. In addition to this, the antecedent may be something that is merely implied by the context and not really mentioned in the text. For these reasons there usually are no general mechanisms to calculate this in the existing IE systems. However recently there has been some attempts to handle this task. For example Vieira & Poesio (2000) have presented a system that processes definite noun phrases.
In BRIEFS we have some tentative and experimental rules on the reference between NP's. Most notably, the words "company", "firm" and "corporation" can in some cases be taken to refer to the nearest company name in the text.
At this point of processing all the corefering name annotations we have managed to find in a text should have identical root attributes. In addition there are coref annotations with incontinuous (or multiple) spans, showing the chains of corefering expressions more clearly.
All the above processing normalizes names inside a document. In order to accumulate information from masses of documents it is necessary to recognize which entity the names refer to even if the names differ between documents. As this task usually demands real world knowledge, it relies almost solely on the name lists created in Name Tool.
Two tables are made of the name database. The basic table has the names in the lower case as keys and the correct writing forms as values, the second table is similar except the keys are checked for corporate endings and these are dropped from the name. If the keys become ambiguous, they are deleted from the table. The name to be normalized is tested against the lists and step by step modified until a normalized form is found or the following steps have been run through.
The correct form is saved in the annotation as root attribute and the old root is saved as an alias attribute. If the name lists have been built correctly, the root attributes are now unambiguous references to known entities.
Now the references to named entities should have been tied to the system's general knowledge of the entities in the domain, making it possible to tie the information found in different documents into a more compact and complete form. But there is still the problem of the names and entities that are not listed in the database, e.g. new entities come to existence all the time. At the current state the system has no recollection of the names that it has encountered in previous texts, only the names listed in the database and the ones found in the current document. Therefore the new names can only be tied together inside a document but not between documents.
Picture 5.1. illustrates the problem of unknown names and multiple sources. In the picture there are two documents describing the same deal between two companies. One of them is known to BRIEFS and the names in the different documents are unified, one is not and therefore the names remain separate. As the extraction rules find these deals, they are recorded to the evidence database as two different deals between the known partner and two new companies.
Figure 5.1. The problem of unknown names and multiple sources.
In BRIEFS all text is processed in the same way. In practice this often works poorly with special cases like quoted speech, headlines, tables and pictures.
Another name database of the guesses we have made earlier and the names we have encountered could be created. Summaries of these could be presented to the user as suggestions to be manually verified and added to the main name database. Even without human interference these could be used to normalize unknown names between documents. Information found this way could be treated as less reliable when the facts are pondered as evidence in the inference machine.
A more complete classification of underspecified names could improve results and is implementable. For example Cucchiarelly and Velardi (2002) have presented an algorithm that classifies unknown names based on the classification of known names and their contexts. If the classification of names was sound, the rules limited by class could be made more aggressive. On the other hand errors tend to accumulate and an unreliable classification would probably deteriorate the precision of the system. For the same reason improving the beginning of the processing chain could be profitable.
It would be useful to know the referents of noun phrases. However implementing a general system like Vieira & Poesio (2000) may be too laborious with respect to the resources of the BRIEFS project.
The FDG and its name recognition engine are under constant development.
There has been no evaluation of BRIEFS yet as the project is still going on. Generally we have aimed to precision more than high recall and speed, and to things that work rather than linguistically competent theory.
It took 32 hours 21 minutes to run a collection with 2472 documents (total of 53413 lines of text) through the processing pipeline shown in figure 2.3. This averages about 47 seconds per document (Seitsonen, Lauri, p.c.). Compared to other IE systems BRIEFS is not very fast but neither is it embarrassingly slow. For comparison SRI's text understanding system TACITUS (Hobbs & al 1996) took 36 hours to process 100 documents in MUC-3. The same people created later a very effective IE system FAUSTUS, which took 12 minutes to do the same. One reason why BRIEFS doesn't quite reach the same speed may be that unlike most IE systems BRIEFS has a full parser.
Recognizing names and what they refer to is not an easy task. One only needs to think how often one has to ask in everyday life questions like "Mary who?", "You mean the redhead?" or "Is she the same...?" to realize this. Without asking and being answered we can only guess. Sometimes we go wrong in our guesses and so does the machine.
In this document I use the following concepts as they are defined in the TIPSTER Text Program architecture (TIPSTER: Glossary of terms) and/or Message Understanding Conferences (MUC: Definitions of terms used in Information Extraction).
Annotation - The additional information associated with a document or a collection. Under the TIPSTER concept annotations are the principal way components pass data between them. Annotations are usually the result of extraction processes; however, users may also create annotations. In BRIEFS each annotation spans some piece of the text, a word, a sentence, a paragraph or some linguistic or semantic structure.
Attribute - A characteristic of a collection, document or annotation represented by a single value or set of values. In BRIEFS attributes are properties of annotations, e.g. a word has a surface form and a base form, a verb group has tempus and voice etc.
Collection - A group of documents, usually with some characteristic(s) in common. Under TIPSTER the implementation of a Collection is broad and a Collection may be the actual documents (text) or a list of document identifiers (ID). A document may appear in more than one Collection.
Entity - an object of interest such as a person or organization.
Event - an activity or occurrence of interest such as a terrorist act or an airline crash.
Fact - a relationship held between two or more entities.
Information Extraction - Same as 'text extraction'. The selection of specific types of information from text, e.g., person names, place names, companies, organizations, temporal data, currency data, other entities, co-references, relationships between entities. The latter two items are more difficult.
Named Entity - a named object of interest such as a person, organization, or location
Pattern - is an expression of a specific form that is used for matching text (or annotations) during the extraction process. TIPSTER has a Pattern Specification Language which describes how to write rules to control extraction engines. In BRIEFS Lauri Seitsonen (2002b) has created Briefs Common Pattern Specification Language (Briefs CPSL), which implements a subset of the CPSL language as defined in Cowie and Appelt (1998) and Appelt (1998) with some modifications.
As terms anaphor, reference and coreference are often used inconsistently and interchangeably, I present here Rodger Kibble's and Kees van Deemter's definitions on them.
Anaphor - An item with little or no intrinsic meaning or reference which takes its interpretation from another item in the same sentence or discourse, its antecedent. For example, in I asked Lisa to check the proofs, and she did it., the items she and did it are anaphors, taking their interpretations from their antecedents Lisa and check the proofs respectively.
Coreference - The relation which obtains between two NP's (usually two NP's in a single sentence) both of which are interpreted as referring to the same extralinguistic entity. In linguistic representations, coreference is conventionally denoted by coindexing.
Reference - The phenomenon by which some noun phrase in a particular utterance or sentence is associated with some entity in the real or conceptual world, its referent.
APPELT, D. 15.6.1999. "The Complete TextPro Reference Manual". http://www.ai.sri.com/~appelt/TextPro (read 2.5.2002)
CARLSON, L. 1988. "Questions of identity in discourse". Questions and Questioning. Ed. M. Meyer. Walter de Gruyter, Berlin/New York 1988, pp. 144-181.
CONEXOR OY. "Conexor Functional Dependency Grammar". Helsinki: CONEXOR OY. 2002a. http://www.conexoroy.com/fdg.htm (read 20.2.2002)
CONEXOR OY. "Output of Conexor analysers for English". Helsinki: CONEXOR OY. 2002b. http://www.conexoroy.com//docs/en-tags.html (read 20.2.2002)
COWIE, J. and APPELT, D. 16.3.1998. "Pattern Specification Language". TIPSTER Change Request, http://www-nlpir.nist.gov/related_projects/tipster/rfcs/rfc10 (read 26.3.2002)
CUCCHIARELLY, ALESSANDRO and VELARDI, PAOLA. 2002. "Unsupervised Named Entity Recognition Using Syntactic and Semantic Contextual Evidence". Computational Linguistics, MIT Press 2002, pp. 123-131
GATE homepage. University of Sheffield: Natural language processing group. http://gate.ac.uk/ (read 2.5.2002)
GRISHMAN, RALPH. 1997. "Information Extraction: Techniques and Challenges, Materials for Information Extraction" International Summer School SCIE-97, Ed. Maria Teresa Pazienza, Springer-Verlag. http://www.ling.helsinki.fi/courses/ctl310/IR/papers/grishman.ps (read 26.3.2002)
HOBBS, JERRY and APPELT, DOUGLAS and BEAR, JOHN and ISRAEL, DAVID and KAMEYAMA, MEGUMI and STICKEL, MARK and TYSON, MABRY. 1996. "FASTUS: A Cascaded Finite-State Transducer for Extracting Information from Natural-Language Text". Finite State Devices for Natural Language Processing, MIT Press. http://www.ai.sri.com/pubs/files/356.pdf (read 18.4.2002)
HOBBS, JERRY R. 31.7.2000. "Generic Information Extraction System". United States: National Institute of Standards and Technology. http://www-nlpir.nist.gov/related_projects/tipster/gen_ie.htm (read 20.2.2002)
KEIJOLA, MATTI 12.2.2002. "BRIEFS General Description". Espoo: HUT/TAI Research Centre. http://briefs.cs.hut.fi/phase4/General/index.html (read 3.5.2002)
KIBBLE, RODGER and VAN DEEMTER, KEES. 2000. "Proceedings of LREC2000: Coreference Annotation: Whither?". ELRA - European Language Resources Association. ftp://ftp.itri.bton.ac.uk/reports/ITRI-00-06.ps.gz (read 26.3.2002)
LAPPIN, SHALOM & LEASS, HERBERT. 1994. "An algorithm for pronominal anaphora resolution". Computational Linguistics, 20(4), pp. 535-561. http://www.wlv.ac.uk/~le1825/anaphora_resolution_papers/lappin_leass_94.ps (read 2.5.2002)
MITKOV, RUSLAN 1999. "Anaphora resolution: the state of the art". Working paper (Based on the COLING'98/ACL'98 tutorial on anaphora resolution). University of Wolverhampton, Wolverhampton. http://www.wlv.ac.uk/~le1825/anaphora_resolution_papers/state.ps (read 2.5.2002)
MUC-7, "Definitions of terms used in Information Extraction". 12.1.2001. United States: National Institute of Standards and Technology. http://www.itl.nist.gov/iaui/894.02/related_projects/muc/info/definitions.html (read 26.3.2002)
MUC-7 (Message Understanding Conference 7), 1998. "MUC 7 Proceedings". United States: National Institute of Standards and Technology. http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/muc_7_toc.html (read 26.3.2002)
SEITSONEN, LAURI. 3.1.2002a. "NameTool". Espoo: HUT/TAI Research Centre. http://briefs.cs.hut.fi/phase4/Named_Entities/nametool.html (read 2.5.2002)
SEITSONEN, LAURI. 3.1.2002b. "BRIEFS Common Pattern Specification Language Grammar". Espoo: HUT/TAI Research Centre. http://briefs.cs.hut.fi/phase4/Information_Extraction/appendix_a.html (read 2.5.2002)
TIPSTER, "Tipster Glossary of terms". United States: National Institute of Standards and Technology. http://www-nlpir.nist.gov/related_projects/tipster/gloss.htm (read 20.2.2002)
TIPSTER homepage. United States: National Institute of Standards and Technology. http://www.itl.nist.gov/iaui/894.02/related_projects/tipster/ (read 2.5.2002)
VIEIRA, R. and POESIO, M. 2000. "An empirically-based system for processing definite descriptions". Computational Linguistics v. 26, n.4. http://www.cogsci.ed.ac.uk/~poesio/publications/cl2000.ps (read 28.5.2002)
VOUTILAINEN, ATRO. 15.4.2002. "On recognition and classification of names in FDG". private e-mail.