Contents Previous   Next  

Chapter 6


Speech Recognition: javax.speech.recognition
 

A speech recognizer is a speech engine that converts speech to text. The javax.speech.recognition package defines the Recognizer interface to support speech recognition plus a set of supporting classes and interfaces. The basic functional capabilities of speech recognizers, some of the uses of speech recognition and some of the limitations of speech recognizers are described in Section 2.2.

As a type of speech engine, much of the functionality of a Recognizer is inherited from the Engine interface in the javax.speech package and from other classes and interfaces in that package. The javax.speech package and generic speech engine functionality are described in Chapter 4.

The Java Speech API is designed to keep simple speech applications simple Þ and to make advanced speech applications possible for non-specialist developers. This chapter covers both the simple and advanced capabilities of the javax.speech.recognition package. Where appropriate, some of the more advanced sections are marked so that you can choose to skip them. We begin with a simple code example, and then review the speech recognition capabilities of the API in more detail through the following sections:

 


 

6.1     "Hello World!"

The following example shows a simple application that uses speech recognition. For this application we need to define a grammar of everything the user can say, and we need to write the Java software that performs the recognition task.

A grammar is provided by an application to a speech recognizer to define the words that a user can say, and the patterns in which those words can be spoken. In this example, we define a grammar that allows a user to say "Hello World" or a variant. The grammar is defined using the Java Speech Grammar Format. This format is documented in the Java Speech Grammar Format Specification.

Place this grammar into a file.


grammar javax.speech.demo; public <sentence> = hello world | good morning | hello mighty computer;

This trivial grammar has a single public rule called "sentence". A rule defines what may be spoken by a user. A public rule is one that may be activated for recognition.

The following code shows how to create a recognizer, load the grammar, and then wait for the user to say something that matches the grammar. When it gets a match, it deallocates the engine and exits.


import javax.speech.*; import javax.speech.recognition.*; import java.io.FileReader; import java.util.Locale; public class HelloWorld extends ResultAdapter { static Recognizer rec; // Receives RESULT_ACCEPTED event: print it, clean up, exit public void resultAccepted(ResultEvent e) { Result r = (Result)(e.getSource()); ResultToken tokens[] = r.getBestTokens(); for (int i = 0; i < tokens.length; i++) System.out.print(tokens[i].getSpokenText() + " "); System.out.println(); // Deallocate the recognizer and exit rec.deallocate(); System.exit(0); } public static void main(String args[]) { try { // Create a recognizer that supports English. rec = Central.createRecognizer( new EngineModeDesc(Locale.ENGLISH)); // Start up the recognizer rec.allocate(); // Load the grammar from a file, and enable it FileReader reader = new FileReader(args[0]); RuleGrammar gram = rec.loadJSGF(reader); gram.setEnabled(true); // Add the listener to get results rec.addResultListener(new HelloWorld()); // Commit the grammar rec.commitChanges(); // Request focus and start listening rec.requestFocus(); rec.resume(); } catch (Exception e) { e.printStackTrace(); } } }

This example illustrates the basic steps which all speech recognition applications must perform. Let's examine each step in detail.

 


 

6.2     Recognizer as an Engine

The basic functionality provided by a Recognizer includes grammar management and the production of results when a user says things that match active grammars. The Recognizer interface extends the Engine interface to provide this functionality.

The following is a list of the functionality that the javax.speech.recognition package inherits from the javax.speech package and outlines some of the ways in which that functionality is specialized.

 


 

6.3     Recognizer State Systems

6.3.1     Inherited States

As mentioned above, a Recognizer inherits the basic state systems defined in the javax.speech package, particularly through the Engine interface. The basic engine state systems are described in Section 4.4. In this section the two state systems added for recognizers are described. These two states systems represent the status of recognition processing of audio input against grammars, and the recognizer focus.

As a summary, the following state system functionality is inherited from the javax.speech package.

The recognizer adds two sub-state systems to the ALLOCATED state: that's in addition to the inherited pause and resume sub-state system. The two new sub- state systems represent the current activities of the recognizer's internal processing (the LISTENING, PROCESSING and SUSPENDED states) and the current recognizer focus (the FOCUS_ON and FOCUS_OFF states).

These new sub-state systems are parallel states to the PAUSED and RESUMED states and operate nearly independently as shown in Figure 6-1 (an extension of Figure 4-2).

6.3.2     Recognizer Focus

The FOCUS_ON and FOCUS_OFF states indicate whether this instance of the Recognizer currently has the speech focus. Recognizer focus is a major determining factor in grammar activation, which, in turn, determines what the recognizer is listening for at any time. The role of recognizer focus in activation and deactivation of grammars is described in Section 6.4.3.

A change in engine focus is indicated by a RecognizerEvent (which extends EngineEvent) being issued to RecognizerListeners. A FOCUS_LOST event indicates a change in state from FOCUS_ON to FOCUS_OFF. A FOCUS_GAINED event indicates a change in state from FOCUS_OFF to FOCUS_ON.

When a Recognizer has focus, the FOCUS_ON bit is set in the engine state. When a Recognizer does not have focus, the FOCUS_OFF bit is set. The following code examples monitor engine state:


Recognizer rec; if (rec.testEngineState(Recognizer.FOCUS_ON)) { // we have focus so release it rec.releaseFocus(); } // wait until we lose it rec.waitEngineState(Recognizer.FOCUS_OFF);

Recognizer focus is relevant to computing environments in which more than one application is using an underlying recognition. For example, in a desktop environment a user might be running a single speech recognition product (the underlying engine), but have multiple applications using the speech recognizer as a resource. These applications may be a mixture of Java and non-Java applications. Focus is not usually relevant in a telephony environment or in other speech application contexts in which there is only a single application processing the audio input stream.

The recognizer's focus should track the application to which the user is currently talking. When a user indicates that it wants to talk to an application (e.g., by selecting the application window, or explicitly saying "switch to application X"), the application requests speech focus by calling the requestFocus method of the Recognizer.

When speech focus is no longer required (e.g., the application has been iconized) it should call releaseFocus method to free up focus for other applications.

Both methods are asynchronous Þ- the methods may return before the focus is gained or lost - since focus change may be deferred. For example, if a recognizer is in the middle of recognizing some speech, it will typically defer the focus change until the result is completed. The focus events and the engine state monitoring methods can be used to determine when focus is actually gained or lost.

The focus policy is determined by the underlying recognition engine - it is not prescribed by the java.speech.recognition package. In most operating environments it is reasonable to assume a policy in which the last application to request focus gets the focus.

Well-behaved applications adhere to the following convention to maximize recognition performance, to minimize their impact upon other applications and to maintain a satisfactory user interface experience. An application should only request focus when it is confident that the user's speech focus (attention) is directed towards it, and it should release focus when it is not required.

6.3.3     Recognition States

The most important (and most complex) state system of a recognizer represents the current recognition activity of the recognizer. An ALLOCATED Recognizer is always in one of the following three states:

This sub-state system is shown in Figure 6-1. The typical state cycle of a recognizer is triggered by user speech. The recognizer starts in the LISTENING state, moves to the PROCESSING state while a user speaks, moves to the SUSPENDED state once recognition of that speech is completed and while grammars are updates in response to user input, and finally returns to the LISTENING state.

In this first event cycle a Result is typically produced that represents what the recognizer heard. Each Result has a state system and the Result state system is closely coupled to this Recognizer state system. The Result state system is discussed in Section 6.7. Many applications (including the "Hello World!" example) do not care about the recognition state but do care about the simpler Result state system.

The other typical event cycle also starts in the LISTENING state. Upon receipt of a non-speech event (e.g., keyboard event, mouse click, timer event) the recognizer is suspended temporarily while grammars are updated in response to the event, and then the recognizer returns to listening.

Applications in which grammars are affected by more than speech events need to be aware of the recognition state system.

The following sections explain these event cycles in more detail and discuss why speech input events are different in some respects from other event types.

6.3.3.1     Speech Events vs. Other Events

A keyboard event, a mouse event, a timer event, a socket event are all instantaneous in time - there is a defined instant at which they occur. The same is not true of speech for two reasons.

Firstly, speech is a temporal activity. Speaking a sentence takes time. For example, a short command such as "reload this web page" will take a second or two to speak, thus, it is not instantaneous. At the start of the speech the recognizer changes state, and as soon as possible after the end of the speech the recognizer produces a result containing the spoken words.

Secondly, recognizers cannot always recognize words immediately when they are spoken and cannot determine immediately when a user has stopped speaking. The reasons for these technical constraints upon recognition are outside the scope of this guide, but knowing about them is helpful in using a recognizer. (Incidentally, the same principals are generally true of human perception of speech.)

A simple example of why recognizers cannot always respond might be listening to a currency amount. If the user says "two dollars" or says "two dollars, fifty seconds" with a short pause after the word "dollars" the recognizer can't know immediately whether the user has finished speaking after the "dollars". What a recognizer must do is wait a short period - usually less than a second Þ- to see if the user continues speaking. A second is a long time for a computer and complications can arise if the user clicks a mouse or does something else in that waiting period. (Section 6.8 explains the time-out parameters that affect this delay.)

A further complication is introduced by the input audio buffering described in Section 6.3.

Putting all this together, there is a requirement for the recognizers to explicitly represent internal state through the LISTENING, PROCESSING and SUSPENDED states.

6.3.3.2     Speech Input Event Cycle

The typical recognition state cycle for a Recognizer occurs as speech input occurs. Technically speaking, this cycle represents the recognition of a single Result. The result state system and result events are described in detail in Section 6.7. The cycle described here is a clockwise trip through the LISTENING, PROCESSING and SUSPENDED states of an ALLOCATED recognizer as shown in Figure 6-1.

The Recognizer starts in the LISTENING state with a certain set of grammars enabled and active. When incoming audio is detected that may match an active grammar, the Recognizer transitions from the LISTENING state to the PROCESSING state with a RECOGNIZER_PROCESSING event.

The Recognizer then creates a new Result object and issues a RESULT_CREATED event (a ResultEvent) to provide the result to the application. At this point the result is usually empty: it does not contain any recognized words. As recognition proceeds words are added to the result along with other useful information.

The Recognizer remains in the PROCESSING state until it completes recognition of the result. While in the PROCESSING state the Result may be updated with new information.

The recognizer indicates completion of recognition by issuing a RECOGNIZER_SUSPENDED event to transition from the PROCESSING state to the SUSPENDED state. Once in that state, the recognizer issues a result finalization event to ResultListeners (RESULT_ACCEPTED or RESULT_REJECTED event) to indicate that all information about the result is finalized (words, grammars, audio etc.).

The Recognizer remains in the SUSPENDED state until processing of the result finalization event is completed. Applications will often make grammar changes during the result finalization because the result causes a change in application state or context.

In the SUSPENDED state the Recognizer buffers incoming audio. This buffering allows a user to continue speaking without speech data being lost. Once the Recognizer returns to the LISTENING state the buffered audio is processed to give the user the perception of real-time processing.

Once the result finalization event has been issued to all listeners, the Recognizer automatically commits all grammar changes and issues a CHANGES_COMMITTED event to return to the LISTENING state. (It also issues GRAMMAR_CHANGES_COMMITTED events to GrammarListeners of changed grammars.) The commit applies all grammar changes made at any point up to the end of result finalization, such as changes made in the result finalization events.

The Recognizer is now back in the LISTENING state listening for speech that matches the new grammars.

In this event cycle the first two recognizer state transitions (marked by RECOGNIZER_PROCESSING and RECOGNIZER_SUSPENDED events) are triggered by user actions: starting and stopping speaking. The third state transition (CHANGES_COMMITTED event) is triggered programmatically some time after the RECOGNIZER_SUSPENDED event.

The SUSPENDED state serves as a temporary state in which recognizer configuration can be updated without loosing audio data.

6.3.3.3     Non-Speech Event Cycle

For applications that deal only with spoken input the state cycle described above handles most normal speech interactions. For applications that handle other asynchronous input, additional state transitions are possible. Other types of asynchronous input include graphical user interface events (e.g., AWTEvent), timer events, multi-threading events, socket events and so on.

The cycle described here is temporary transition from the LISTENING state to the SUSPENDED and back as shown in Figure 6-1.

When a non-speech event occurs which changes the application state or application data it may be necessary to update the recognizer's grammars. The suspend and commitChanges methods of a Recognizer are used to handle non- speech asynchronous events. The typical cycle for updating grammars in response to a non-speech asynchronous events is as follows.

Assume that the Recognizer is in the LISTENING state (the user is not currently speaking). As soon as the event is received, the application calls suspend to indicate that it is about to change grammars. In response, the recognizer issues a RECOGNIZER_SUSPENDED event and transitions from the LISTENING state to the SUSPENDED state.

With the Recognizer in the SUSPENDED state, the application makes all necessary changes to the grammars. (The grammar changes affected by this event cycle and the pending commit are described in Section 6.4.2.)

Once all grammar changes are completed the application calls the commitChanges method. In response, the recognizer applies the new grammars and issues a CHANGES_COMMITTED event to transition from the SUSPENDED state back to the LISTENING state. (It also issues GRAMMAR_CHANGES_COMMITTED events to all changed grammars.)

Finally, the Recognizer resumes recognition of the buffered audio and then live audio with the new grammars.

The suspend and commit process is designed to provide a number of features to application developers which help give users the perception of a responsive recognition system.

Because audio is buffered from the time of the asynchronous event to the time at which the CHANGES_COMMITTED occurs, the audio is processed as if the new grammars were applied exactly at the time of the asynchronous event. The user has the perception of real-time processing.

Although audio is buffered in the SUSPENDED state, applications should make grammar changes and call commitChanges as quickly as possible. This minimizes the amount of data in the audio buffer and hence the amount of time it takes for the recognizer to "catch up". It also minimizes the possibility of a buffer overrun.

Technically speaking, an application is not required to call suspend prior to calling commitChanges. If the suspend call is committed the Recognizer behaves as if suspend had been called immediately prior to calling commitChanges. However, an application that does not call suspend risks a commit occurring unexpectedly while it updates grammars with the effect of leaving grammars in an inconsistent state.

6.3.4     Interactions of State Systems

The three sub-state systems of an allocated recognizer (shown in Figure 6-1) normally operate independently. There are, however, some indirect interactions.

When a recognizer is paused, audio input is stopped. However, recognizers have a buffer between audio input and the internal process that matches audio against grammars, so recognition can continue temporarily after a recognizer is paused. In other words, a PAUSED recognizer may be in the PROCESSING state.

Eventually the audio buffer will empty. If the recognizer is in the PROCESSING state at that time then the result it is working on is immediately finalized and the recognizer transitions to the SUSPENDED state. Since a well-behaved application treats SUSPENDED state as a temporary state, the recognizer will eventually leave the SUSPENDED state by committing grammar changes and will return to the LISTENING state.

The PAUSED/RESUMED state of an engine is shared by multiple applications, so it is possible for a recognizer to be paused and resumed because of the actions of another application. Thus, an application should always leave its grammars in a state that would be appropriate for a RESUMED recognizer.

The focus state of a recognizer is independent of the PAUSED and RESUMED states. For instance, it is possible for a paused Recognizer to have FOCUS_ON. When the recognizer is resumed, it will have the focus and its grammars will be activated for recognition.

The focus state of a recognizer is very loosely coupled with the recognition state. An application that has no GLOBAL grammars (described in Section 6.4.3) will not receive any recognition results unless it has recognition focus.

 


 

6.4     Recognition Grammars

A grammar defines what a recognizer should listen for in incoming speech. Any grammar defines the set of tokens a user can say (a token is typically a single word) and the patterns in which those words are spoken.

The Java Speech API supports two types of grammars: rule grammars and dictation grammars. These grammars differ in how patterns of words are defined. They also differ in their programmatic use: a rule grammar is defined by an application, whereas a dictation grammar is defined by a recognizer and is built into the recognizer.

A rule grammar is provided by an application to a recognizer to define a set of rules that indicates what a user may say. Rules are defined by tokens, by references to other rules and by logical combinations of tokens and rule references. Rule grammars can be defined to capture a wide range of spoken input from users by the progressive combination of simple grammars and rules.

A dictation grammar is built into a recognizer. It defines a set of words (possibly tens of thousands of words) which may be spoken in a relatively unrestricted way. Dictation grammars are closest to the goal of unrestricted natural speech input to computers. Although dictation grammars are more flexible than rule grammars, recognition of rule grammars is typically faster and more accurate.

Support for a dictation Þgrammar is optional for a recognizer. As Section 4.2 explains, an application that requires dictation functionality can request it when creating a recognizer.

A recognizer may have many rule grammars loaded at any time. However, the current Recognizer interface restricts a recognizer to a single dictation grammar. The technical reasons for this restriction are outside the scope of this guide.

6.4.1     Grammar Interface

The Grammar interface is the root interface that is extended by all grammars. The grammar functionality that is shared by all grammars is presented through this interface.

The RuleGrammar interface is an extension of the Grammar interface to support rule grammars. The DictationGrammar interface is an extension of the Grammar interface to support dictation grammars.

The following are the capabilities presented by the grammar interface:

6.4.2     Committing Changes

The Java Speech API supports dynamic grammars; that is, it supports the ability for an application to modify grammars at runtime. In the case of rule grammars any aspect of any grammar can be changed at any time.

After making any change to a grammar through the Grammar, RuleGrammar or DictationGrammar interfaces an application must commit the changes. This applies to changes in definitions of rules in a RuleGrammar, to changing context for a DictationGrammar, to changing the enabled state, or to changing the activation mode. (It does not apply to adding or removing a GrammarListener or ResultListener.)

Changes are committed by calling the commitChanges method of the Recognizer. The commit is required for changes to affect the recognition process: that is, the processing of incoming audio.

The commit changes mechanism has two important properties:

There is one instance in which changes are committed without an explicit call to the commitChanges method. Whenever a recognition result is finalized (completed), an event is issued to ResultListeners (it is either a RESULT_ACCEPTED or RESULT_REJECTED event). Once processing of that event is completed changes are normally committed. This supports the common situation in which changes are often made to grammars in response to something a user says.

The event-driven commit is closely linked to the underlying state system of a Recognizer. The state system for recognizers is described in detail in Section 6.3.

6.4.3     Grammar Activation

A grammar is active when the recognizer is matching incoming audio against that grammar to determine whether the user is saying anything that matches that grammar. When a grammar is inactive it is not being used in the recognition process.

Applications to do not directly activate and deactivate grammars. Instead they provided methods for (1) enabling and disabling a grammar, (2) setting the activation mode for each grammar, and (3) requesting and releasing the speech focus of a recognizer (as described in Section 6.3.2.)

The enabled state of a grammar is set with the setEnabled method and tested with the isEnabled method. For programmers familiar with AWT or Swing, enabling a speech grammar is similar to enabling a graphical component.

Once enabled, certain conditions must be met for a grammar to be activated. The activation mode indicates when an application wants the grammar to be active. There are three activation modes: RECOGNIZER_FOCUS, RECOGNIZER_MODAL and GLOBAL. For each mode a certain set of activation conditions must be met for the grammar to be activated for recognition. The activation mode is managed with the setActivationMode and getActivationMode methods.

The enabled flag and the activation mode are both parameters of a grammar that need to be committed to take effect. As Section 6.4.2 described, changes need to be committed to affect the recognition processes.

Recognizer focus is a major determining factor in grammar activation and is relevant in computing environments in which more than one application is using an underlying recognition (e.g., desktop computing with multiple speech-enabled applications). Section 6.3.2 describes how applications can request and release focus and monitor focus through RecognizerEvents and the engine state methods.

Recognizer focus is used to turn on and off activation of grammars. The roll of focus depends upon the activation mode. The three activation modes are described here in order from highest priority to lowest. An application should always use the lowest priority mode that is appropriate to its user interface functionality.

The current activation state of a grammar can be tested with the isActive method. Whenever a grammar's activation changes either a GRAMMAR_ACTIVATED or GRAMMAR_DEACTIVATED event is issued to each attached GrammarListener. A grammar activation event typically follows a RecognizerEvent that indicates a change in focus (FOCUS_GAINED or FOCUS_LOST), or a CHANGES_COMMMITTED RecognizerEvent that indicates that a change in the enabled setting of a grammar has been applied to the recognition process.

An application may have zero, one or many grammars enabled at any time. Thus, an application may have zero, one or many grammars active at any time. As the conventions below indicate, well-behaved applications always minimize the number of active grammars.

The activation and deactivation of grammars is independent of PAUSED and RESUMED states of the Recognizer. For instance, a grammar can be active even when a recognizer is PAUSED. However, when a Recognizer is paused, audio input to the Recognizer is turned off, so speech won't be detected. This is useful, however, because when the recognizer is resumed, recognition against the active grammars immediately (and automatically) resumes.

Activating too many grammars and, in particular, activating multiple complex grammars has an adverse impact upon a recognizer's performance. In general terms, increasing the number of active grammars and increasing the complexity of those grammars can both lead to slower recognition response time, greater CPU load and reduced recognition accuracy (i.e., more mistakes).

Well-behaved applications adhere to the following conventions to maximize recognition performance and minimize their impact upon other applications:

 


 

6.5     Rule Grammars

6.5.1     Rule Definitions

A rule grammar is defined by a set of rules. These rules are defined by logical combinations of tokens to be spoken and references to other rules. The references may refer to other rules defined in the same rule grammar or to rules imported from other grammars.

Rule grammars follow the style and conventions of grammars in the Java Speech Grammar Format (defined in the Java Speech Grammar Format Specification). Any grammar defined in the JSGF can be converted to a RuleGrammar object. Any RuleGrammar object can be printed out in JSGF. (Note that conversion from JSGF to a RuleGrammar and back to JSGF will preserve the logic of the grammar but may lose comments and may change formatting.)

Since the RuleGrammar interface extends the Grammar interface, a RuleGrammar inherits the basic grammar functionality described in the previous sections (naming, enabling, activation etc.).

The easiest way to load a RuleGrammar, or set of RuleGrammar objects is from a Java Speech Grammar Format file or URL. The loadJSGF methods of the Recognizer perform this task. If multiple grammars must be loaded (where a grammar references one or more imported grammars), importing by URL is most convenient. The application must specify the base URL and the name of the root grammar to be loaded.


Recognizer rec; URL base = new URL("http://www.acme.com/app"); String grammarName = "com.acme.demo"; Grammar gram = rec.loadURL(base, grammarName);

The recognizer converts the base URL and grammar name to a URL using the same conventions as ClassLoader (the Java platform mechanism for loading class files). By converting the periods in the grammar name to slashes ('/'), appending a ".gram" suffix and combining with the base URL, the location is "http:// www.acme.com/app/com/acme/demo.gram".

If the demo grammar imports sub-grammars, they will be loaded automatically using the same location mechanism.

Alternatively, a RuleGrammar can be created by calling the newRuleGrammar method of a Recognizer. This method creates an empty grammar with a specified grammar name.

Once a RuleGrammar has been loaded, or has been created with the newRuleGrammar method, the following methods of a RuleGrammar are used to create, modify and manage the rules of the grammar.

Table 6-1 RuleGrammar methods for Rule management
Name  Description  
setRule  Assign a Rule object to a rulename.  
getRule  Return the Rule object for a rulename.  
getRuleInternal  Return a reference to the recognizer's internal Rule object for a rulename (for fast, read-only access).  
listRuleNames  List known rulenames.  
isRulePublic  Test whether a rulename is public.  
deleteRule  Delete a rule.  
setEnabled  Enable and disable this RuleGrammar or rules of the grammar.  
isEnabled  Test whether a RuleGrammar or a specified rule is enabled.  

Any of the methods of RuleGrammar that affect the grammar (setRule, deleteRule, setEnabled etc.) take effect only after they are committed (as described in Section 6.4.2).

The rule definitions of a RuleGrammar can be considered as a collection of named Rule objects. Each Rule object is referenced by its rulename (a String). The different types of Rule object are described in Section 6.5.3.

Unlike most collections in Java, the RuleGrammar is a collection that does not share objects with the application. This is because recognizers often need to perform special processing of the rule objects and store additional information internally. The implication for applications is that a call to setRule is required to change any rule. The following code shows an example where changing a rule object does not affect the grammar.


RuleGrammar gram; // Create a rule for the word blue // Add the rule to the RuleGrammar and make it public RuleToken word = new RuleToken("blue"); gram.setRule("ruleName", word, true); // Change the word word.setText("green"); // getRule returns blue (not green) System.out.println(gram.getRule("ruleName"));

To ensure that the changed "green" token is loaded into the grammar, the application must call setRule again after changing the word to "green". Furthermore, for either change to take effect in the recognition process, the changes need to be committed (see Section 6.4.2).

6.5.2     Imports

Complex systems of rules are most easily built by dividing the rules into multiple grammars. For example, a grammar could be developed for recognizing numbers. That grammar could then be imported into two separate grammars that defines dates and currency amounts. Those two grammars could then be imported into a travel booking application and so on. This type of hierarchical grammar construction is similar in many respects to object oriented and shares the advantage of easy reusage of grammars.

An import declaration in JSGF and an import in a RuleGrammar are most similar to the import statement of the Java programming language. Unlike a "#include" in the C programming language, the imported grammar is not copied, it is simply referencable. (A full specification of import semantics is provided in the Java Speech Grammar Format specification.)

The RuleGrammar interface defines three methods for handling imports as shown in Table 6-2.

Table 6-2 RuleGrammar import methods
Name  Description  
addImport  Add a grammar or rule for import.  
removeImport  Remove the import of a rule or grammar.  
getImports  Return a list of all imported grammars or all rules imported from a specific grammar.  

The resolve method of the RuleGrammar interface is useful in managing imports. Given any rulename, the resolve method returns an object that represents the fully-qualified rulename for the rule that it references.

6.5.3     Rule Classes

A RuleGrammar is primarily a collection of defined rules. The programmatic rule structure used to control Recognizers follows exactly the definition of rules in the Java Speech Grammar Format. Any rule is defined by a Rule object. It may be any one of the Rule classes described Table 6-3. The exceptions are the RuleParse class, which is returned by the parse method of RuleGrammar, and the Rule class which is an abstract class and the parent of all other Rule objects.

Table 6-3 Rule objects
Name  Description  
Rule  Abstract root object for rules.  
RuleName  Rule that references another defined rule. JSGF example: <ruleName>  
RuleToken  Rule consisting of a single speakable token (e.g. a word). JSGF examples: elephant, "New York"  
RuleSequence  Rule consisting of a sequence of sub-rules. JSGF example: buy <number> shares of <company>  
RuleAlternatives  Rule consisting of a set of alternative sub-rules. JSGF example: green | red | yellow  
RuleCount  Rule containing a sub-rule that may be spoken optionally, zero or more times, or one or more times. JSGF examples: <color>*, [optional]  
RuleTag  Rule that attaches a tag to a sub-rule. JSGF example: {action=open}  
RuleParse  Special rule object used to represent results of a parse.  

The following is an example of a grammar in Java Speech Grammar Format. The "Hello World!" example shows how this JSGF grammar can be loaded from a text file. Below we consider how to create the same grammar programmatically.


grammar com.sun.speech.test; public <test> = [a] test {TAG} | another <rule>; <rule> = word;

The following code shows the simplest way to create this grammar. It uses the ruleForJSGF method to convert partial JSGF text to a Rule object. Partial JSGF is defined as any legal JSGF text that may appear on the right hand side of a rule definition - technically speaking, any legal JSGF rule expansion.


Recognizer rec; // Create a new grammar RuleGrammar gram = rec.newRuleGrammar("com.sun.speech.test"); // Create the <test> rule Rule test = gram.ruleForJSGF("[a] test {TAG} | another <rule>"); gram.setRule("test", // rulename test, // rule definition true); // true -> make it public // Create the <rule> rule gram.setRule("rule", gram.ruleForJSGF("word"), false); // Commit the grammar rec.commitChanges();

6.5.3.1     Advanced Rule Programming

In advanced programs there is often a need to define rules using the set of Rule objects described above. For these applications, using rule objects is more efficient than creating a JSGF string and using the ruleForJSGF method.

To create a rule by code, the detailed structure of the rule needs to be understood. At the top level of our example grammar, the <test> rule is an alternative: the user may say something that matches "[a] test {TAG}" or say something matching "another <rule>". The two alternatives are each sequences containing two items. In the first alternative, the brackets around the token "a" indicate it is optional. The "{TAG}" following the second token ("test") attaches a tag to the token. The second alternative is a sequence with a token ("another") and a reference to another rule ("<rule>").

The code to construct this Grammar follows (this code example is not compact - it is written for clarity of details).


Recognizer rec; RuleGrammar gram = rec.newRuleGrammar("com.sun.speech.test"); // Rule we are building RuleAlternatives test; // Temporary rules RuleCount r1; RuleTag r2; RuleSequence seq1, seq2; // Create "[a]" r1 = new RuleCount(new RuleToken("a"), RuleCount.OPTIONAL); // Create "test {TAG}" - a tagged token r2 = new RuleTag(new RuleToken("test"), "TAG"); // Join "[a]" and "test {TAG}" into a sequence "[a] test {TAG}" seq1 = new RuleSequence(r1); seq1.append(r2); // Create the sequence "another <rule>"; seq2 = new RuleSequence(new RuleToken("another")); seq2.append(new RuleName("rule")); // Build "[a] test {TAG} | another <rule>" test = new RuleAlternatives(seq1); test.append(seq2); // Add <test> to the RuleGrammar as a public rule gram.setRule("test", test, true); // Provide the definition of <rule>, a non-public RuleToken gram.setRule("rule", new RuleToken("word"), false); // Commit the grammar changes rec.commitChanges();

6.5.4     Dynamic Grammars

Grammars may be modified and updated. The changes allow an application to account for shifts in the application's context, changes in the data available to it, and so on. This flexibility allows application developers considerable freedom in creating dynamic and natural speech interfaces.

For example, in an email application the list of known users may change during the normal operation of the program. The <sendEmail> command,

<sendEmail> = send email to <user>;

references the <user> rule which may need to be changed as new email arrives. This code snippet shows the update and commit of a change in users.


Recognizer rec; RuleGrammar gram; String names[] = {"amy", "alan", "paul"}; Rule userRule = new RuleAlternatives(names); gram.setRule("user", userRule); // apply the changes rec.commitChanges();

Committing grammar changes can, in certain cases, be a slow process. It might take a few tenths of seconds or up to several seconds. The time to commit changes depends on a number of factors. First, recognizers have different mechanisms for committing changes making some recognizers faster than others. Second, the time to commit changes may depend on the extent of the changes - more changes may require more time to commit. Thirdly, the time to commit may depend upon the type of changes. For example, some recognizers optimize for changes to lists of tokens (e.g. name lists). Finally, faster computers make changes more quickly.

The other factor which influences dynamic changes is the timing of the commit. As Section 6.4.2 describes, grammar changes are not always committed instantaneously. For example, if the recognizer is busy recognizing speech (in the PROCESSING state), then the commit of changes is deferred until the recognition of that speech is completed.

6.5.5     Parsing

Parsing is the process of matching text to a grammar. Applications use parsing to break down spoken input into a form that is more easily handled in software. Parsing is most useful when the structure of the grammars clearly separates the parts of spoken text that an application needs to process. Examples are given below of this type of structuring.

The text may be in the form of a String or array of String objects (one String per token), or in the form of a FinalRuleResult object that represents what a recognizer heard a user say. The RuleGrammar interface defines three forms of the parse method - one for each form of text.

The parse method returns a RuleParse object (a descendent of Rule) that represents how the text matches the RuleGrammar. The structure of the RuleParse object mirrors the structure of rules defined in the RuleGrammar. Each Rule object in the structure of the rule being parsed against is mirrored by a matching Rule object in the returned RuleParse object.

The difference between the structures comes about because the text being parsed defines a single phrase that a user has spoken whereas a RuleGrammar defines all the phrases the user could say. Thus the text defines a single path through the grammar and all the choices in the grammar (alternatives, and rules that occur optionally or occur zero or more times) are resolvable.

The mapping between the objects in the rules defined in the RuleGrammar and the objects in the RuleParse structure is shown in Table 6-4. Note that except for the RuleCount and RuleName objects, the object in the parse tree are of the same type as rule object being parsed against (marked with "**"), but the internal data may differ.

Table 6-4 Matching Rule definitions and RuleParse objects
Object in definition  Matching object in RuleParse  
RuleToken  Maps to an identical RuleToken object.  
RuleTag  Maps to a RuleTag object with the same tag and with the contained rule mapped according to its rule type.  
RuleSequence  Maps to a RuleSequence object with identical length and with each rule in the sequence mapped according to its rule type.  
RuleAlternatives  Maps to a RuleAlternatives object containing a single item which is the one rule in the set of alternatives that was spoken.  
RuleCount **  Maps to a RuleSequence object containing an item for each time the rule contained by the RuleCOunt object is spoken. The sequence may have a length of zero, one or more.  
RuleName **  Maps to a RuleParse object with the name in the RuleName object being the fully-qualified version of the original rulename, and with the Rule object contained by the RuleParse object being an appropriate match of the definition of RuleName.  

As an example, take the following simple extract from a grammar. The public rule, <command>, may be spoken in many ways. For example, "open", "move that door" or "close that door please".


public <command> = <action> [<object>] [<polite>]; <action> = open {OP} | close {CL} | move {MV}; <object> = [<this_that_etc>] window | door; <this_that_etc> = a | the | this | that | the current; <polite> = please | kindly;

Note how the rules are defined to clearly separate the segments of spoken input that an application must process. Specifically, the <action> and <object> rules indicate how an application must respond to a command. Furthermore, anything said that matches the <polite> rule can be safely ignored, and usually the <this_that_etc> rule can be ignored too.

The parse for "open" against <command> has the following structure which matches the structure of the grammar above.


RuleParse(<command> = RuleSequence( RuleParse(<action> = RuleAlternatives( RuleTag( RuleToken("open"), "OP")))))

The match of the <command> rule is represented by a RuleParse object. Because the definition of <command> is a sequence of 3 items (2 of which are optional), the parse of <command> is a sequence. Because only one of the 3 items is spoken (in "open"), the sequence contains a single item. That item is the parse of the <action> rule.

The reference to <action> in the definition of <command> is represented by a RuleName object in the grammar definition, and this maps to a RuleParse object when parsed. The <action> rule is defined by a set of three alternatives (RuleAlternatives object) which maps to another RuleAlternatives object in the parse but with only the single spoken alternative represented. Since the phrase spoken was "open", the parse matches the first of the three alternatives which is a tagged token. Therefore the parse includes a RuleTag object which contains a RuleToken object for "open".

The following is the parse for "close that door please".


RuleParse(<command> = RuleSequence( RuleParse(<action> = RuleAlternatives( RuleTag( RuleToken("close"), "CL"))) RuleSequence( RuleParse(<object> = RuleSequence( RuleSequence( RuleParse(<this_that_etc> = RuleAlternatives( RuleToken("that")))) RuleAlternatives( RuleToken("door")))) RuleSequence( RuleParse(<polite> = RuleAlternatives( RuleToken("please")))) ))

There are three parsing issues that application developers should consider.

 


 

6.6     Dictation Grammars

Dictation grammars come closest to the ultimate goal of a speech recognition system that takes natural spoken input and transcribes it as text. Dictation grammars are used for free text entry in applications such as email and word processing.

A Recognizer that supports dictation provides a single DictationGrammar which is obtained from the recognizer's getDictationGrammar method. A recognizer that supports the Java Speech API is not required to provide a DictationGrammar. Applications that require a recognizer with dictation capability can explicitly request dictation when creating a recognizer by setting the DictationGrammarSupported property of the RecognizerModeDesc to true (see Section 4.2 for details).

A DictationGrammar is more complex than a rule grammar, but fortunately, a DictationGrammar is often easier to use than an rule grammar. This is because the DictationGrammar is built into the recognizer so most of the complexity is handled by the recognizer and hidden from the application. However, recognition of a dictation grammar is typically more computationally expensive and less accurate than that of simple rule grammars.

The DictationGrammar inherits its basic functionality from the Grammar interface. That functionality is detailed in Section 6.4 and includes grammar naming, enabling, activation, committing and so on.

As with all grammars, changes to a DictationGrammar need to be committed before they take effect. Commits are described in Section 6.4.2.

In addition to the specific functionality described below, a DictationGrammar is typically adaptive. In an adaptive system, a recognizer improves its performance (accuracy and possibly speed) by adapting to the style of language used by a speaker. The recognizer may adapt to the specific sounds of a speaker (the way they say words). Equally importantly for dictation, a recognizer can adapt to a user's normal vocabulary and to the patterns of those words. Such adaptation (technically known as language model adaptation) is a part of the recognizer's implementation of the DictationGrammar and does not affect an application. The adaptation data for a dictation grammar is maintained as part of a speaker profile (see Section 6.9).

The DictationGrammar extends and specializes the Grammar interface by adding the following functionality:

The following methods provided by the DictationGrammar interface allow an application to manage word lists and text context.

Table 6-5 DictationGrammar interface methods
Name  Description  
setContext  Provide the recognition engine with the preceding and following textual context.  
addWord  Add a word to the DictationGrammar.  
removeWord  Remove a word from the DictationGrammar.  
listAddedWords  List the words that have been added to the DictationGrammar.  
listRemovedWords  List the words that have been removed from the DictationGrammar.  

6.6.1     Dictation Context

Dictation recognizers use a range of information to improve recognition accuracy. Learning the words a user speaks and the patterns of those words can substantially improve accuracy.

Because patterns of words are important, context is important. The context of a word is simply the set of surrounding words. As an example, consider the following sentence "If I have seen further it is by standing on the shoulders of Giants" (Sir Isaac Newton). If we are editing this sentence and place the cursor after the word "standing" then the preceding context is "...further it is by standing" and the following context is "on the shoulders of Giants...".

Given this context, the recognizer is able to more reliably predict what a user might say, and greater predictability can improve recognition accuracy. In this example, the user might insert the word "up" but is less likely to insert the word "JavaBeans".

Through the setContext method of the DictationGrammar interface, an application should tell the recognizer the current textual context. Furthermore, if the context changes (for example, due to a mouse click to move the cursor) the application should update the context.

Different recognizers process context differently. The main consideration for the application is the amount of context to provide to the recognizer. As a minimum, a few words of preceding and following context should be provided. However, some recognizers may take advantage of several paragraphs or more.

There are two setContext methods:

void setContext(String preceding, String following);
void setContext(String preceding[], String following[]);

The first form takes plain text context strings. The second version should be used when the result tokens returned by the recognizer are available. Internally, the recognizer processes context according to tokens so providing tokens makes the use of context more efficient and more reliable because it does not have to guess the tokenization.

 


 

6.7     Recognition Results

A recognition result is provided by a Recognizer to an application when the recognizer "hears" incoming speech that matches an active grammar. The result tells the application what words the user said and provides a range of other useful information, including alternative guesses and audio data.

In this section, both the basic and advanced capabilities of the result system in the Java Speech API are described. The sections relevant to basic rule grammar-based applications are those that cover result finalization (Section 6.7.1), the hierarchy of result interfaces (Section 6.7.2), the data provided through those interfaces (Section 6.7.3), and common techniques for handling finalized rule results (Section 6.7.9).

For dictation applications the relevant sections include those listed above plus the sections covering token finalization (Section 6.7.8), handling of finalized dictation results (Section 6.7.10) and result correction and training (Section 6.7.12).

For more advanced applications relevant sections might include the result life cycle (Section 6.7.4), attachment of ResultListeners (Section 6.7.5), the relationship of recognizer and result states (Section 6.7.6), grammar finalization (Section 6.7.7), result audio (Section 6.7.11), rejected results (Section 6.7.13), result timing (Section 6.7.14), and the loading and storing of vendor formatted results (Section 6.7.15).

6.7.1     Result Finalization

The "Hello World!" example illustrates the simplest way to handle results. In that example, a RuleGrammar was loaded, committed and enabled, and a ResultListener was attached to a Recognizer to receive events associated with every result that matched that grammar. In other words, the ResultListener was attached to receive information about words spoken by a user that is heard by the recognizer.

The following is a modified extract of the "Hello World!" example to illustrate the basics of handling results. In this case, a ResultListener is attached to a Grammar (instead of a Recognizer) and it prints out every thing the recognizer hears that matches that grammar. (There are, in fact, three ways in which a ResultListener can be attached: see Section 6.7.5.)


import javax.speech.*; import javax.speech.recognition.*; public class MyResultListener extends ResultAdapter { // Receives RESULT_ACCEPTED event: print it public void resultAccepted(ResultEvent e) { Result r = (Result)(e.getSource()); ResultToken tokens[] = r.getBestTokens(); for (int i = 0; i < tokens.length; i++) System.out.print(tokens[i].getSpokenText() + " "); System.out.println(); } // somewhere in app, add a ResultListener to a grammar { RuleGrammar gram = ...; gram.addResultListener(new MyResultListener()); } }

The code shows the MyResultListener class which is as an extension of the ResultAdapter class. The ResultAdapter class is a convenience implementation of the ResultListener interface (provided in the javax.speech.recognition package). When extending the ResultAdapter class we simply implement the methods for the events that we care about.

In this case, the RESULT_ACCEPTED event is handled. This event is issued to the resultAccepted method of the ResultListener and is issued when a result is finalized. Finalization of a result occurs after a recognizer completed processing of a result. More specifically, finalization occurs when all information about a result has been produced by the recognizer and when the recognizer can guarantee that the information will not change. (Result finalization should not be confused with object finalization in the Java programming language in which objects are cleaned up before garbage collection.)

There are actually two ways to finalize a result which are signalled by the RESULT_ACCEPTED and RESULT_REJECTED events. A result is accepted when a recognizer is confidently that it has correctly heard the words spoken by a user (i.e., the tokens in the Result exactly represent what a user said).

Rejection occurs when a Recognizer is not confident that it has correctly recognized a result: that is, the tokens and other information in the result do not necessarily match what a user said. Many applications will ignore the RESULT_REJECTED event and most will ignore the detail of a result when it is rejected. In some applications, a RESULT_REJECTED event is used simply to provide users with feedback that something was heard but no action was taken, for example, by displaying "???" or sounding an error beep. Rejected results and the differences between accepted and rejected results are described in more detail in Section 6.7.13 .

An accepted result is not necessarily a correct result. As is pointed out in Section 2.2.3, recognizers make errors when recognizing speech for a range of reasons. The implication is that even for an accepted result, application developers should consider the potential impact of a misrecognition. Where a misrecognition could cause an action with serious consequences or could make changes that can't be undone (e.g., "delete all files"), the application should check with users before performing the action. As recognition systems continue to improve the number of errors is steadily decreasing, but as with human speech recognition there will always be a chance of a misunderstanding.

6.7.2     Result Interface Hierarchy

A finalized result can include a considerable amount of information. This information is provided through four separate interfaces and through the implementation of these interfaces by a recognition system.


// Result: the root result interface interface Result; // FinalResult: info on all finalized results interface FinalResult extends Result; // FinalRuleResult: a finalized result matching a RuleGrammar interface FinalRuleResult extends FinalResult; // FinalDictationResult: a final result for a DictationGrammar interface FinalDictationResult extends FinalResult; // A result implementation provided by a Recognizer public class EngineResult implements FinalRuleResult, FinalDictationResult;

At first sight, the result interfaces may seem complex. The reasons for providing several interfaces are as follows:

The multitude of interfaces is, in fact, designed to simplify application programming and to minimize the chance of introducing bugs into code by allowing compile-time checking of result calls. The two basic principles for calling the result interfaces are the following:

  1. If it is safe to call the methods of a particular interface then it is safe to call the methods of any of the parent interfaces. For example, for a finalized result matching a RuleGrammar, the methods of the FinalRuleResult interface are safe, so the methods of the FinalResult and Result interfaces are also safe. Similarly, for a finalized result matching a DictationGrammar, the methods of FinalDictationResult, FinalResult and Result can all be called safely.
  2. Use type casting of a result object to ensure compile-time checks of method calls. For example, in events to an unfinalized result, cast the result object to the Result interface. For a RESULT_ACCEPTED finalization event with a result that matches a DictationGrammar, cast the result to the FinalDictationResult interface.

In the next section the different information available through the different interfaces is described. In all the following sections that deal with result states and result events, details are provided on the appropriate casting of result objects.

6.7.3     Result Information

As the previous section describes, different information is available for a result depending upon the state of the result and, for finalized results, depending upon the type of grammar it matches (RuleGrammar or DictationGrammar).

6.7.3.1     Result Interface

The information available through the Result interface is available for any result in any state - finalized or unfinalized - and matching any grammar.

In addition to the information detailed above, the Result interface provides the addResultListener and removeResultListener methods which allow a ResultListener to be attached to and removed from an individual result. ResultListener attachment is described in more detail in Section 6.7.5.

6.7.3.2     FinalResult Interface

The information available through the FinalResult interface is available for any finalized result, including results that match either a RuleGrammar or DictationGrammar.

6.7.3.3     FinalDictationResult Interface

The FinalDictationResult interface contains a single method.

6.7.3.4     FinalRuleResult Interface

Like the FinalDictationResult interface, the FinalRuleResult interface provides alternative guesses. The FinalRuleResult interface also provides some additional information that is useful in processing results that match a RuleGrammar.

6.7.4     Result Life Cycle

A Result is produced in response to a user's speech. Unlike keyboard input, mouse input and most other forms of user input, speech is not instantaneous (see Section 6.3.3.1 for more detail). As a consequence, a speech recognition result is not produced instantaneously. Instead, a Result is produced through a sequence of events starting some time after a user starts speaking and usually finishing some time after the user stops speaking.

Figure 6-2 shows the state system of a Result and the associated ResultEvents. As in the recognizer state diagram (Figure 6-1), the blocks represent states, and the labelled arcs represent transitions that are signalled by ResultEvents.

Every result starts in the UNFINALIZED state when a RESULT_CREATED event is issued. While unfinalized, the recognizer provides information including finalized and unfinalized tokens and the identity of the grammar matched by the result. As this information is added, the RESULT_UPDATED and GRAMMAR_FINALIZED events are issued

Once all information associated with a result is finalized, the entire result is finalized. As Section 6.7.1 explained, a result is finalized with either a RESULT_ACCEPTED or RESULT_REJECTED event placing it in either the ACCEPTED or REJECTED state. At that point all information associated with the result becomes available including the best guess tokens and the information provided through the three final result interfaces (see Section 6.7.3).

Once finalized the information available through all the result interfaces is fixed. The only exceptions are for the release of audio data and training data. If audio data is released, an AUDIO_RELEASED event is issued (see detail in Section 6.7.11). If training information is released, an TRAINING_INFO_RELEASED event is issued (see detail in Section 6.7.12).

Applications can track result states in a number of ways. Most often, applications handle result in ResultListener implementation which receives ResultEvents as recognition proceeds.

As Section 6.7.3 explains, a recognizer conveys a range of information to an application through the stages of producing a recognition result. However, as the example in Section 6.7.1 shows, many applications only care about the last step and event in that process - the RESULT_ACCEPTED event.

The state of a result is also available through the getResultState method of the Result interface. That method returns one of the three result states: UNFINALIZED, ACCEPTED or REJECTED.

6.7.5     ResultListener Attachment

A ResultListener can be attached in one of three places to receive events associated with results: to a Grammar, to a Recognizer or to an individual Result. The different places of attachment give an application some flexibility in how they handle results.

To support ResultListeners the Grammar, Recognizer and Result interfaces all provide the addResultListener and removeResultListener methods.

Depending upon the place of attachment a listener receives events for different results and different subsets of result events.

6.7.6     Recognizer and Result States

The state system of a recognizer is tied to the processing of a result. Specifically, the LISTENING, PROCESSING and SUSPENDED state cycle described in Section 6.3.3 and shown in Figure 6-1 follows the production of an event.

The transition of a Recognizer from the LISTENING state to the PROCESSING state with a RECOGNIZER_PROCESSING event indicates that a recognizer has started to produce a result. The RECOGNIZER_PROCESSING event is followed by the RESULT_CREATED event to ResultListeners.

The RESULT_UPDATED and GRAMMAR_FINALIZED events are issued to ResultListeners while the recognizer is in the PROCESSING state.

As soon as the recognizer completes recognition of a result, it makes a transition from the PROCESSING state to the SUSPENDED state with a RECOGNIZER_SUSPENDED event. Immediately following that recognizer event, the result finalization event (either RESULT_ACCEPTED or RESULT_REJECTED) is issued. While the result finalization event is processed, the recognizer remains suspended. Once result finalization event is completed, the recognizer automatically transitions from the SUSPENDED state back to the LISTENING state with a CHANGES_COMMITTED event. Once back in the LISTENING state the recognizer resumes processing of audio input with the grammar committed with the CHANGES_COMMITTED event.

6.7.6.1     Updating Grammars

In many applications, grammar definitions and grammar activation need to be updated in response to spoken input from a user. For example, if speech is added to a traditional email application, the command "save this message" might result in a window being opened in which a mail folder can be selected. While that window is open, the grammars that control that window need to be activated. Thus during the event processing for the "save this message" command grammars may need be created, updated and enabled. All this would happen during processing of the RESULT_ACCEPTED event.

For any grammar changes to take effect they must be committed (see Section 6.4.2). Because this form of grammar update is so common while processing the RESULT_ACCEPTED event (and sometimes the RESULT_REJECTED event), recognizers implicitly commit grammar changes after either result finalization event has been processed.

This implicit is indicated by the CHANGES_COMMITTED event that is issued when a Recognizer makes a transition from the SUSPENDED state to the LISTENING state following result finalization and the result finalization event processing (see Section 6.3.3 for details).

One desirable effect of this form of commit becomes useful in component systems. If changes in multiple components are triggered by a finalized result event, and if many of those components change grammars, then they do not each need to call the commitChanges method. The downside of multiple calls to the commitChanges method is that a syntax check be performed upon each. Checking syntax can be computationally expensive and so multiple checks are undesirable. With the implicit commit once all components have updated grammars computational costs are reduced.

6.7.7     Grammar Finalization

At any time during processing a result a GRAMMAR_FINALIZED event can be issued for that result indicating the Grammar matched by the result has been determined. This event is issued is issued only once. It is required for any ACCEPTED result, but is optional for result that is eventually rejected.

As Section 6.7.5 describes, the GRAMMAR_FINALIZED event is the first event received by a ResultListener attached to a Grammar.

The GRAMMAR_FINALIZED event behaves the same for results that match either a RuleGrammar or a DictationGrammar.

Following the GRAMMAR_FINALIZED event, the getGrammar method of the Result interface returns a non-null reference to the matched grammar. By issuing a GRAMMAR_FINALIZED event the Recognizer guarantees that the Grammar will not change.

Finally, the GRAMMAR_FINALIZED event does not change the result's state. A GRAMMAR_FINALIZED event is issued only when a result is in the UNFINALIZED state, and leaves the result in that state.

6.7.8     Token Finalization

A result is a dynamic object a it is being recognized. One way in which a result can be dynamic is that tokens are updated and finalized as recognition of speech proceeds. The result events allow a recognizer to inform an application of changes in the either or both the finalized and unfinalized tokens of a result.

The finalized and unfinalized tokens can be updated on any of the following result event types: RESULT_CREATED, RESULT_UPDATED, RESULT_ACCEPTED, RESULT_REJECTED.

Finalized tokens are accessed through the getBestTokens and getBestToken methods of the Result interface. The unfinalized tokens are accessed through the getUnfinalizedTokens method of the Result interface. (See Section 6.7.3 for details.)

A finalized token is a ResultToken in a Result that has been recognized in the incoming speech as matching a grammar. Furthermore, when a recognizer finalizes a token it indicates that it will not change the token at any point in the future. The numTokens method returns the number of finalized tokens.

Many recognizers do not finalize tokens until recognition of an entire result is complete. For these recognizers, the numTokens method returns zero for a result in the UNFINALIZED state.

For recognizers that do finalize tokens while a Result is in the UNFINALIZED state, the following conditions apply:

A result in the UNFINALIZED state may also have unfinalized tokens. An unfinalized token is a token that the recognizer has heard, but which it is not yet ready to finalize. Recognizers are not required to provide unfinalized tokens, and applications can safely choose to ignore unfinalized tokens.

For recognizers that provide unfinalized tokens, the following conditions apply:

Unfinalized tokens are highly changeable, so why are they useful? Many applications can provide users with visual feedback of unfinalized tokens - particularly for dictation results. This feedback informs users of the progress of the recognition and helps the user to know that something is happening. However, because these tokens may change and are more likely than finalized tokens to be incorrect, the applications should visually distinguish the unfinalized tokens by using a different font, different color or even a different window.

The following is an example of finalized tokens and unfinalized tokens for the sentence "I come from Australia". The lines indicate the token values after the single RESULT_CREATED event, the multiple RESULT_UPDATED events and the final RESULT_ACCEPTED event. The finalized tokens are in bold, the unfinalized tokens are in italics.

  1. RESULT_CREATED: I come
  2. RESULT_UPDATED: I come from
  3. RESULT_UPDATED: I come from
  4. RESULT_UPDATED: I come from a strange land
  5. RESULT_UPDATED: I come from Australia
  6. RESULT_ACCEPTED: I come from Australia

Recognizers can vary in how they support finalized and unfinalized tokens in a number of ways. For an unfinalized result, a recognizer may provide finalized tokens, unfinalized tokens, both or neither. Furthermore, for a recognizer that does support finalized and unfinalized tokens during recognition, the behavior may depend upon the number of active grammars, upon whether the result is for a RuleGrammar or DictationGrammar, upon the length of spoken sentences, and upon other more complex factors. Fortunately, unless there is a functional requirement to display or otherwise process intermediate result, an application can safely ignore all but the RESULT_ACCEPTED event.

6.7.9     Finalized Rule Results

The are some common design patterns for processing accepted finalized results that match a RuleGrammar. First we review what we know about these results.

6.7.9.1     Result Tokens

A ResultToken in a result matching a RuleGrammar contains the same information as the RuleToken object in the RuleGrammar definition. This means that the tokenization of the result follows the tokenization of the grammar definition including compound tokens. For example, consider a grammar with the following Java Speech Grammar Format fragment which contains four tokens:

<rule> = I went to "San Francisco";

If the user says "I went to New York" then the result will contain the four tokens defined by JSGF: "I", "went", "to", "San Francisco".

The ResultToken interface defines more advanced information. Amongst that information the getStartTime and getEndTime methods may optionally return time-stamp values (or -1 if the recognizer does not provide time-alignment information).

The ResultToken interface also defines several methods for a recognizer to provide presentation hints. Those hints are ignored for RuleGrammar results Þ- they are only used for dictation results (see Section 6.7.10.2).

Furthermore, the getSpokenText and getWrittenText methods will return an identical string which is equal to the string defined in the matched grammar.

6.7.9.2     Alternative Guesses

In a FinalRuleResult, alternative guesses are alternatives for the entire result, that is, for a complete utterance spoken by a user. (A FinalDictationResult can provide alternatives for single tokens or sequences of tokens.) Because more than one RuleGrammar can be active at a time, an alternative token sequence may match a rule in a different RuleGrammar than the best guess tokens, or may match a different rule in the same RuleGrammar as the best guess. Thus, when processing alternatives for a FinalRuleResult, an application should use the getRuleGrammar and getRuleName methods to ensure that they analyze the alternatives correctly.

Alternatives are numbered from zero up. The 0th alternative is actually the best guess for the result so FinalRuleResult.getAlternativeTokens(0) returns the same array as Result.getBestTokens(). (The duplication is for programming convenience.) Likewise, the FinalRuleResult.getRuleGrammar(0) call will return the same result as Result.getGrammar().

The following code is an implementation of the ResultListener interface that processes the RESULT_ACCEPTED event. The implementation assumes that a Result being processed matches a RuleGrammar.


class MyRuleResultListener extends ResultAdapter { public void resultAccepted(ResultEvent e) { // Assume that the result matches a RuleGrammar. // Cast the result (source of event) appropriately FinalRuleResult res = (FinalRuleResult) e.getSource(); // Print out basic result information PrintStream out = System.out; out.println("Number guesses: " + res.getNumberGuesses()); // Print out the best result and all alternatives for (int n=0; n < res.getNumberGuesses(); n++) { // Extract the n-best information String gname = res.getRuleGrammar(n).getName(); String rname = res.getRuleName(n); ResultToken[] tokens = res.getAlternativeTokens(n); out.print("Alt " + n + ": "); out.print("<" + gname + "." + rname + "> :"); for (int t=0; t < tokens.length; t++) out.print(" " + tokens[t].getSpokenText()); out.println(); } } }

For a grammar with commands to control a windowing system (shown below), a result might look like:


Number guesses: 3 Alt 0: <com.acme.actions.command>: move the window to the back Alt 1: <com.acme.actions.command>: move window to the back Alt 2: <com.acme.actions.command>: open window to the front

If more than one grammar or more than one public rule was active, the <grammarName.ruleName> values could vary between the alternatives.

6.7.9.3     Result Tags

Processing commands generated from a RuleGrammar becomes increasingly difficult as the complexity of the grammar rises. With the Java Speech API, speech recognizers provide two mechanisms to simplify the processing of results: tags and parsing.

A tag is a label attached to an entity within a RuleGrammar. The Java Speech Grammar Format and the RuleTag class define how tags can be attached to a grammar. The following is a grammar for very simple control of windows which includes tags attached to the important words in the grammar.


grammar com.acme.actions; public <command> = <action> <object> [<where>] <action> = open {ACT_OP}| close {ACT_CL} | move {ACT_MV}; <object> = [a | an | the] (window {OBJ_WIN} | icon {OBJ_ICON}); <where> = [to the] (back {WH_BACK} | front {WH_FRONT});

This grammar allows users to speak commands such as

	open window
move the icon
move the window to the back
move window back

The italicized words are the ones that are tagged in the grammar - these are the words that the application cares about. For example, in the third and fourth example commands, the spoken words are different but the tagged words are identical. Tags allow an application to ignore trivial words such as "the" and "to".

The com.acme.actions grammar can be loaded and enabled using the code in the "Hello World!" example. Since the grammar has a single public rule, <command>, the recognizer will listen for speech matching that rule, such as the example results given above.

The tags for the best result are available through the getTags method of the FinalRuleResult interface. This method returns an array of tags associated with the tokens (words) and other grammar entities matched by the result. If the best sequence of tokens is "move the window to the front", the list of tags is the following String array:

	String tags[] = {"ACT_MV", "OBJ_WIN", "WH_FRONT"};

Note how the order of the tags in the result is preserved (forward in time). These tags are easier for most applications to interpret than the original text of what the user said.

Tags can also be used to handle synonyms - multiple ways of saying the same thing. For example, "programmer", "hacker", "application developer" and "computer dude" could all be given the same tag, say "DEV". An application that looks at the "DEV" tag will not care which way the user spoke the title.

Another use of tags is for internationalization of applications. Maintaining applications for multiple languages and locales is easier if the code is insensitive to the language being used. In the same way that the "DEV" tag isolated an application from different ways of saying "programmer", tags can be used to provide an application with similar input irrespective of the language being recognized.

The following is a grammar for French with the same functionality as the grammar for English shown above.


grammar com.acme.actions.fr; public <command> = <action> <object> [<where>] <action> = ouvrir {ACT_OP}| fermer {ACT_CL} | deplacer {ACT_MV}; <object> = fenetre {OBJ_WIN} | icone {OBJ_ICON}; <where> = au-dessous {WH_BACK} | au-dessus {WH_FRONT};

For this simple grammar, there are only minor differences in the structure of the grammar (e.g. the "[to the]" tokens in the <where> rule for English are absent in French). However, in more complex grammars the syntactic differences between languages become significant and tags provide a clearer improvement.

Tags do not completely solve internationalization problems. One issue to be considered is word ordering. A simple command like "open the window" can translate to the form "the window open" in some languages. More complex sentences can have more complex transformations. Thus, applications need to be aware of word ordering, and thus tag ordering when developing international applications.

6.7.9.4     Result Parsing

More advanced applications parse results to get even more information than is available with tags. Parsing is the capability to analyze how a sequence of tokens matches a RuleGrammar. Parsing of text against a RuleGrammar is discussed in Section 6.5.5 .

Parsing a FinalRuleResult produces a RuleParse object. The getTags method of a RuleParse object provides the same tag information as the getTags method of a FinalRuleResult. However, the FinalRuleResult provides tag information for only the best-guess result, whereas parsing can be applied to the alternative guesses.

An API requirement that simplifies parsing of results that match a RuleGrammar is that for a such result to be ACCEPTED (not rejected) it must exactly match the grammar - technically speaking, it must be possible to parse a FinalRuleResult against the RuleGrammar it matches. This is not guaranteed, however, if the result was rejected or if the RuleGrammar has been modified since it was committed and produced the result.

6.7.10     Finalized Dictation Results

The are some common design patterns for processing accepted finalized results that match a DictationGrammar. First we review what we know about these results.

The ResultTokens provided in a FinalDictationResult contain specialized information that includes hints on textual presentation of tokens. Section 6.7.10.2 discusses the presentation hints in detail. In this section the methods for obtaining and using alternative tokens are described.

6.7.10.1     Alternative Guesses

Alternative tokens for a dictation result are most often used by an application for display to users for correction of dictated text. A typical scenario is that a user speaks some text - perhaps a few words, a few sentences, a few paragraphs or more. The user reviews the text and detects a recognition error. This means that the best guess token sequence is incorrect. However, very often the correct text is one of the top alternative guesses. Thus, an application will provide a user the ability to review a set of alternative guesses and to select one of them if it is the correct text. Such a correction mechanism is often more efficient than typing the correction or dictating the text again. If the correct text is not amongst the alternatives an application must support other means of entering the text.

The getAlternativeTokens method is passed a starting and an ending ResultToken. These tokens must have been obtained from the same result either through a call to getBestToken or getBestTokens in the Result interface, or through a previous call to getAlternativeTokens.

ResultToken[][] getAlternativeTokens(
ResultToken fromToken,
ResultToken toToken,
int max);

To obtain alternatives for a single token (rather than alternatives for a sequence), set toToken to null.

The int parameter allows the application to specify the number of alternatives it wants. The recognizer may choose to return any number of alternatives up to the maximum number including just one alternative (the original token sequence). Applications can indicate in advance the number of alternatives it may request by setting the NumResultAlternatives parameter through the recognizer's RecognizerProperties object.

The two-dimensional array returned by the getAlternativeTokens method is the most difficult aspect of dictation alternatives to understand. The following example illustrates the major features of the return value.

Let's consider a dictation example where the user says "he felt alienated today" but the recognizer hears "he felt alien ate Ted today". The user says four words but the recognizer hears six words. In this example, the boundaries of the spoken words and best-guess align nicely: "alienated" aligns with "alien ate Ted" (incorrect tokens don't always align smoothly with the correct tokens).

Users are typically better at locating and fixing recognition errors than recognizers or applications - they provided the original speech. In this example, the user will likely identify the words "alien ate Ted" as incorrect (tokens 2 to 4 in the best-guess result). By an application-provided method such as selection by mouse and a pull-down menu, the user will request alternative guesses for the three incorrect tokens. The application calls the getAlternativeTokens method of the FinalDictationResult to obtain the recognizer's guess at the alternatives.


// Get 6 alternatives for for tokens 2 through 4. FinalDictationResult r = ...; ResultToken tok2 = r.getBestToken(2); ResultToken tok4 = r.getBestToken(4); String[][] alt = r.getAlternativeTokens(tok2, tok4, 6);

The return array might look like the following. Each line represents a sequence of alternative tokens to "alien ate Ted". Each word in each alternative sequence represents a ResultToken object in an array.


alt[0] = alien ate Ted // the best guess alt[1] = alienate Ted // the 1st alternative alt[2] = alienated // the 2nd alternative alt[3] = alien hated // the 3rd alternative alt[4] = a lion ate Ted // the 4th alternative

The points to note are:

A complex issue to understand is that the alternatives vary according to how the application (or user) requests them. The 1st alternative to "alien ate Ted" is "alienate Ted". However, the 1st alternative to "alien" might be "a lion", the 1st alternative to "alien ate" might be "alien eight", and the 1st alternative to "alien ate Ted today" might be "align ate Ted to day".

Fortunately for application developers, users learn to select sequences that are likely to give reasonable alternatives, and recognizers are developed to make the alternatives as useful and accurate as possible.

6.7.10.2     Result Tokens

A ResultToken object represents a single token in a result. A token is most often a single word, but multi-word tokens are possible (e.g., "New York") as well as formatting characters and language-specific constructs. For a DictationGrammar the set of tokens is built into the recognizer.

Each ResultToken in a FinalDictationResult provides the following information.

The presentation hints in a ResultToken are important for the processing of dictation results. Dictation results are typically displayed to the user, so using the written form and the capitalization and spacing hints for formatting is important. For example, when dictation is used in word processing, the user will want the printed text to be correctly formatted.

The capitalization hint indicates how the written form of the following token should be formatted. The capitalization hint takes one of four mutually exclusive values. CAP_FIRST indicates that the first character of the following token should be capitalized. The UPPERCASE and LOWERCASE values indicate that the following token should be either all uppercase or lowercase. CAP_AS_IS indicates that there should be no change in capitalization of the following token.

The spacing hint deals with spacing around a token. It is an int value containing three flags which are or'ed together (using the '|' operator). If none of the three spacing hint flags are set true, then getSpacingHint method returns the value SEPARATE which is the value zero.

Every language has conventions for textual representation of a spoken language. Since recognizers are language-specific and understand many of these presentation conventions, they provide the presentation hints (written form, capitalization hint and spacing hint) to simplify applications. However, applications may choose to override the recognizer's hints or may choose to do additional processing.

Table 6-6 shows examples of tokens in which the spoken and written forms are different:

Table 6-6 Spoken and written forms for some English tokens
Spoken Form  Written Form  Capitalization  Spacing  
twenty  20  CAP_AS_IS  SEPARATE  
new line  '\n' '\u000A'  CAP_FIRST  ATTACH_PREVIOUS & ATTACH_FOLLOWING  
new paragraph  '\u2029'  CAP_FIRST  ATTACH_PREVIOUS & ATTACH_FOLLOWING  
no space  null  CAP_AS_IS  ATTACH_PREVIOUS & ATTACH_FOLLOWING  
Space bar   ' ' '\u0020'  CAP_AS_IS  ATTACH_PREVIOUS & ATTACH_FOLLOWING  
Capitalize next  null  CAP_FIRST  SEPARATE  
Period  '.' '\u002E'  CAP_FIRST  ATTACH_PREVIOUS  
Comma  ',' '\u002C'  CAP_AS_IS  ATTACH_PREVIOUS  
Open parentheses  '(' '\u0028'  CAP_AS_IS  ATTACH_FOLLOWING  
Exclamation mark  '!' '\u0021'  CAP_FIRST  ATTACH_PREVIOUS  
dollar sign  '$' '\u0024'  CAP_AS_IS  ATTACH_FOLLOWING & ATTACH_GROUP  
pound sign  '£' '\u00A3'  CAP_AS_IS  ATTACH_FOLLOWING & ATTACH_GROUP  
yen sign  '¥' '\u00A5'  CAP_AS_IS  ATTACH_PREVIOUS & ATTACH_GROUP  

"New line", "new paragraph", "space bar", "no space" and "capitalize next" are all examples of conversion of an implicit command (e.g. "start a new paragraph"). For three of these, the written form is a single Unicode character. Most programmers are familiar with the new-line character '\n' and space ' ', but fewer are familiar with the Unicode character for new paragraph '\u2029'. For convenience and consistency, the ResultToken includes static variables called NEW_LINE and NEW_PARAGRAPH.

Some applications will treat a paragraph boundary as two new-line characters, others will treat it differently. Each of these commands provides hints for capitalization. For example, in English the first letter of the first word of a new paragraph is typically capitalized.

The punctuation characters, "period", "comma", "open parentheses", "exclamation mark" and the three currency symbols convert to a single Unicode character and have special presentation hints.

An important feature of the written form for most of the examples is that the application does not need to deal with synonyms (multiple ways of saying the same thing). For example, "open parentheses" may also be spoken as "open paren" or "begin paren" but in all cases the same written form is generated.

The following is an example sequence of result tokens.

Table 6-7 Sample sequence of result tokens
Spoken Form  Written Form  Capitalization  Spacing  
new line  "\n"  CAP_FIRST  ATTACH_PREVIOUS & ATTACH_FOLLOWING  
the  "the"  CAP_AS_IS  SEPARATE  
uppercase next  null  UPPERCASE  SEPARATE  
index  "index"  CAP_AS_IS  SEPARATE  
is  "is"  CAP_AS_IS  SEPARATE  
seven  "7"  CAP_AS_IS  ATTACH_GROUP  
dash  "-"  CAP_AS_IS  ATTACH_GROUP  
two  "2"  CAP_AS_IS  ATTACH_GROUP  
period  "."  CAP_FIRST  ATTACH_PREVIOUS  

This sequence of tokens should be converted to the following string:

	"\nThe INDEX is 7-2."

Conversion of spoken text to a written form is a complex task and is complicated by the different conventions of different languages and often by different conventions for the same language. The spoken form, written form and presentation hints of the ResultToken interface handle most simple conversions. Advanced applications should consider filtering the results to process more complex patterns, particularly cross-token patterns. For example "nineteen twenty eight" is typically converted to "1928" and "twenty eight dollars" to "$28" (note the movement of the dollar sign to before the numbers).

6.7.11     Result Audio

If requested by an application, some recognizers can provide audio data for results. Audio data has a number of uses. In dictation applications, providing audio feedback to users aids correction of text because the audio reminds users of what they said (it's not always easy to remember exactly what you dictate, especially in long sessions). Audio data also allows storage for future evaluation and debugging.

Audio data is provided for finalized results through the following methods of the FinalResult interface.

Table 6-8 FinalResult interface: audio methods
Name  Description  
getAudio  Get an AudioClip for a token, a sequence of tokens or for an entire result.  
isAudioAvailable  Tests whether audio data is available for a result.  
releaseAudio  Release audio data for a result.  

There are two getAudio methods in the FinalResult interface. One method accepts no parameters and returns an AudioClip for an entire result or null if audio data is not available for this result. The other getAudio method takes a start and end ResultToken as input and returns an AudioClip for the segment of the result including the start and end token or null if audio data is not available.

In both forms of the getAudio method, the recognizer will attempt to return the specified audio data. However, it is not always possible to exactly determine the start and end of words or even complete results. Sometimes segments are "clipped" and sometimes surrounding audio is included in the AudioClip.

Not all recognizers provide access to audio for results. For recognizers that do provide audio data, it is not necessarily provided for all results. For example, a recognizer might only provide audio data for dictation results. Thus, applications should always check for a null return value on a getAudio call.

The storage of audio data for results potentially requires large amounts of memory, particularly for long sessions. Thus, result audio requires special management. An application that wishes to use result audio should:

A recognizer may choose to release audio data for a result if it is necessary to reclaim memory or other system resources.

When audio is released by either a call to releaseAudio or by the recognizer a AUDIO_RELEASED event is issued to the audioReleased method of the ResultListener.

6.7.12     Result Correction

Recognition results are not always correct. Some recognizers can be trained by informing of the correct tokens for a result - usually when a user corrects a result.

Recognizers are not required to support correction capabilities. If a recognizer does support correction, it does not need to support correction for every result. For example, some recognizers support correction only for dictation results.

Applications are not required to provide recognizers with correction information. However, if the information is available to an application and the recognizer supports correction then it is good practice to inform the recognizer of the correction so that it can improve its future recognition performance.

The FinalResult interface provides the methods that handle correction.

Table 6-9 FinalResult interface: correction methods
Name  Description  
tokenCorrection  Inform the recognizer of a correction in which zero or more tokens replace a token or sequence of tokens.  
MISRECOGNITION
USER_CHANGE
DONT_KNOW
 
Indicate the type of correction.  
isTrainingInfoAvailable  Tests whether the recognizer has information available to allow it to learn from a correction.  
releaseTrainingInfo  Release training information for a result.  

Often, but certainly not always, a correction is triggered when a user corrects a recognizer by selecting amongst the alternative guesses for a result. Other instances when an application is informed of the correct result are when the user types a correction to dictated text, or when a user corrects a misrecognized command with a follow-up command.

Once an application has obtained the correct result text, it should inform the recognizer. The correction information is provided by a call to the tokenCorrection method of the FinalResult interface. This method indicates a correction of one token sequence to another token sequence. Either token sequence may contain one or more tokens. Furthermore, the correct token sequence may contain zero tokens to indicate deletion of tokens.

The tokenCorrection method accepts a correctionType parameter that indicates the reason for the correction. The legal values are defined by constants of the FinalResult interface:

Why is it useful to tell a recognizer about a USER_CHANGE? Recognizers adapt to both the sounds and the patterns of words of users. A USER_CHANGE correction allows the recognizer to learn about a user's word patterns. A MISRECOGNITION correction allows the recognizer to learn about both the user's voice and the word patterns. In both cases, correcting the recognizer requests it to re-train itself based on the new information.

Training information needs to be managed because it requires substantial memory and possibly other system resources to maintain it for a result. For example, in long dictation sessions, correction data can begin to use excessive amounts of memory.

Recognizers maintain training information only when the recognizer's TrainingProvided parameter is set to true through the RecognizerProperties interface. Recognizers that do not support correction will ignore calls to the setTrainingProvided method.

If the TrainingProvided parameter is set to true, a result may include training information when it is finalized. Once an application believes the training information is no longer required for a specific FinalResult, it should call the releaseTrainingInfo method of FinalResult to indicate the recognizer can release the resources used to store the information.

At any time, the availability of training information for a result can be tested by calling the isTrainingInfoAvailable method.

Recognizers can choose to release training information even without a request to do so by the application. This does not substantially affect an application because performing correction on a result which does not have training information is not an error.

A TRAINING_INFO_RELEASED event is issued to the ResultListener when the training information is released. The event is issued identically whether the application or recognizer initiated the release.

6.7.13     Rejected Results

First, a warning: ignore rejected results unless you really understand them!

Like humans, recognizers don't have perfect hearing and so they make mistakes (recognizers still tend to make more mistakes than people). An application should never completely trust a recognition result. In particular, applications should treat important results carefully, for example, "delete all files".

Recognizers try to determine whether they have made a mistake. This process is known as rejection. But recognizers also make mistakes in rejection! In short, a recognizer cannot always tell whether or not it has made a mistake.

A recognizer may reject incoming speech for a number of reasons:

Rejection is controlled by the ConfidenceLevel parameter of RecognizerProperties (see Section 6.8). The confidence value is a floating point number between 0.0 and 1.0. A value of 0.0 indicates weak rejection - the recognizer doesn't need to be very confident to accept a result. A value of 1.0 indicates strongest rejection, implying that the recognizer will reject a result unless it is very confident that the result is correct. A value of 0.5 is the recognizer's default.

6.7.13.1     Rejection Timing

A result may be rejected with a RESULT_REJECTED event at any time while it is UNFINALIZED: that is, any time after a RESULT_CREATED event but without a RESULT_ACCEPTED event occurring. (For a description of result events see Section 6.7.4.)

This means that the sequence of result events that produce a REJECTED result:

When a result is rejected, there is a strong probability that the information about a result normally provided through Result, FinalResult, FinalRuleResult and FinalDictationResult interfaces is inaccurate, or more typically, not available.

Some possibilities that an application must consider:

Finally, a repeat of the warning. Only use rejected results if you really know what you are doing!

6.7.14     Result Timing

Recognition of speech is not an instant process. There are intrinsic delays between the time the user starts or ends speaking a word or sentence and the time at which the corresponding result event is issued by the speech recognizer.

The most significant delay for most applications is the time between when the user stops speaking and the RESULT_ACCEPTED or RESULT_REJECTED event that indicates the recognizer has finalized the result.

The minimum finalization time is determined by the CompleteTimeout parameter that is set through the RecognizerProperties interface. This time-out indicates the period of silence after speech that the recognizer should process before finalizing a result. If the time-out is too long, the response of the recognizer (and the application) is unnecessarily delayed. If the time-out is too short, the recognizer may inappropriately break up a result (e.g. finalize a result while the user is taking a quick breath). Typically values are less than a second, but not usually less than 0.3sec.

There is also an IncompleteTimeout parameter that indicates the period of silence a recognizer should process if the user has said something that may only partially matches an active grammar. This time-out indicates how long a recognizer should wait before rejecting an incomplete sentence. This time-out also indicates how long a recognizer should wait mid-sentence if a result could be accepted, but could also be continued and accepted after more words. The IncompleteTimeout is usually longer than the complete time-out.

Latency is the overall delay between a user finishing speaking and a result being produced. There are many factors that can affect latency. Some effects are temporary, others reflect the underlying design of speech recognizers. Factors that can increase latency include:

6.7.15     Storing Results

Result objects can be stored for future processing. This is particularly useful for dictation applications in which the correction information, audio data and alternative token information is required in future sessions on the same document because that stored information can assist document editing.

The Result object is recognizer-specific. This is because each recognizer provides an implementation of the Result interface. The implications are that (a) recognizers do not usually understand each other's results, and (b) a special mechanism is required to store and load result objects (standard Java object serialization is not sufficient).

The Recognizer interface defines the methods writeVendorResult and readVendorResult to perform this function. These methods write to an OutputStream and read from an InputStream respectively. If the correction information and audio data for a result are available, then they will be stored by this call. Applications that do not need to store this extra data should explicitly release it before storing a result.


{ Recognizer rec; OutputStream stream; Result result; ... try { rec.writeVendorResult(stream, result); } catch (Exception e) { e.printStackTrace(); } }

A limitation of storing vendor-specific results is that a compatible recognizer must be available to read the file. Applications that need to ensure a file containing a result can be read, even if no recognizer is available, should wrap the result data when storing it to the file. When re-loading the file at a later time, the application will unwrap the result data and provide it to a recognizer only if a suitable recognizer is available. One way to perform the wrapping is to provide the writeVendorResult method with a ByteArrayOutputStream to temporarily place the result in a byte array before storing to a file.

 


 

6.8     Recognizer Properties

A speech engine has both persistent and run-time adjustable properties. The persistent properties are defined in the RecognizerModeDesc which includes properties inherited from EngineModeDesc (see Section 4.2). The persistent properties are used in the selection and creation of a speech recognizer. Once a recognizer has been created, the same property information is available through the getEngineModeDesc method of a Recognizer (inherited from the Engine interface).

A recognizer also has seven run-time adjustable properties. Applications get and set these properties through RecognizerProperties which extends the EngineProperties interface. The RecognizerProperties for a recognizer are provided by the getEngineProperties method that the Recognizer inherits from the Engine interface. For convenience a getRecognizerProperties method is also provided in the Recognizer interface to return a correctly cast object.

The get and set methods of EngineProperties and RecognizerProperties follow the JavaBeans conventions with the form:

	Type getPropertyName();
void setPropertyName(Type);

A recognizer can choose to ignore unreasonable values provided to a set method, or can provide upper and lower bounds.

Table 6-10 Run-time Properties of a Recognizer
Property  Description  
ConfidenceLevel  float value in the range 0.0 to 1.0. Results are rejected if the engine is not confident that it has correctly determined the spoken text. A value of 1.0 requires a recognizer to have maximum confidence in every result so more results are likely to be rejected. A value of 0.0 requires low confidence indicating fewer rejections. 0.5 is the recognizer's default.  
Sensitivity  float value between 0.0 and 1.0. A value of 0.5 is the default for the recognizer. 1.0 gives maximum sensitivity, making the recognizer sensitive to quiet input but more sensitive to noise. 0.0 gives minimum sensitivity, requiring the user to speak loudly and making the recognizer less sensitive to background noise. Note: some recognizers set the gain automatically during use, or through a setup "Wizard". On these recognizers the sensitivity adjustment should be used only in cases where the automatic settings are not adequate.  
SpeedVsAccuracy  float value between 0.0 and 1.0. 0.0 provides the fastest response. 1.0 maximizes recognition accuracy. 0.5 is the default value for the recognizer which the manufacturer determines as the best compromise between speed and accuracy.  
CompleteTimeout  float value in seconds that indicates the minimum period between when a speaker stops speaking (silence starts) and the recognizer finalizing a result. The complete time-out is applied when the speech prior to the silence matches an active grammar (c.f. IncompleteTimeout). A long complete time-out value delays the result and makes the response slower. A short time-out may lead to an utterance being broken up inappropriately (e.g. when the user takes a breath). Complete time- out values are typically in the range of 0.3 seconds to 1.0 seconds.  
IncompleteTimeout  float value in seconds that indicates the minimum period between when a speaker stops speaking (silence starts) and the recognizer finalizing a result. The incomplete time-out is applied when the speech prior to the silence does not match an active grammar (c.f. CompleteTimeout). In effect, this is the period the recognizer will wait before rejecting an incomplete utterance. The IncompleteTimeout is typically longer than the CompleteTimeout.  
ResultNumAlternatives  integer value indicating the preferred maximum number of N-best alternatives in FinalDictationResult and FinalRuleResult objects (see Section 6.7.9). Returning alternatives requires additional computation. Recognizers do not always produce the maximum number of alternatives (for example, because some alternatives are rejected), and the number of alternatives may vary between results and between tokens. A value of 0 or 1 requests that no alternatives be provided - only a best guess.  
ResultAudioProvided  boolean value indicating whether the application wants the recognizer to audio with FinalResult objects. Recognizers that do provide result audio can ignore this call. (See Result Audio for details.)  
TrainingProvided  boolean value indicating whether the application wants the recognizer to support training with FinalResult objects.  

 


 

6.9     Speaker Management

A Recognizer may, optionally, provide a SpeakerManager object. The SpeakerManager allows an application to manage the SpeakerProfiles of that Recognizer. The SpeakerManager for is obtained through getSpeakerManager method of the Recognizer interface. Recognizers that do not maintain speaker profiles - known as speaker-independent recognizers - return null for this method.

A SpeakerProfile object represents a single enrollment to a recognizer. One user may have multiple SpeakerProfiles in a single recognizer, and one recognizer may store the profiles of multiple users.

The SpeakerProfile class is a reference to data stored with the recognizer. A profile is identified by three values all of which are String objects:

The SpeakerProfile object is a handle to all the stored data the recognizer has about a speaker in a particular enrollment. Except for the three values defined above, the speaker data stored with a profile is internal to the recognizer.

Typical data stored by a recognizer with the profile might include:

The primary role of stored profiles is in maintaining information that enables a recognition to adapt to characteristics of the speaker. The goal of this adaptation is to improve the performance of the speech recognizer including both recognition accuracy and speed.

The SpeakerManager provides management of all the profiles stored in the recognizer. Most often, the functionality of the SpeakerManager is used as a direct consequence of user actions, typically by providing an enrollment window to the user. The functionality provided includes:

An individual speaker profile may be large (perhaps several MByte) so storing, loading, creating and otherwise manipulating these objects can be slow.

The SpeakerManager is one of the capabilities of a Recognizer that is available in the deallocated state. The purpose is to allow an application to indicate the initial speaker profile to be loaded when the recognizer is allocated. To achieve this, the listKnownSpeakers, getCurrentSpeaker and setCurrentSpeaker methods can be called before calling the allocate method.

To facilitate recognizer selection, the list of speaker profiles is also a property of a recognizer presented through the RecognizerModeDesc class. This allows an application to select a recognizer that has already been trained by a user, if one is available.

In most cases, a Recognizer persistently restores the last used speaker profile when allocating a recognizer, unless asked to do otherwise.

 


 

6.10     Recognizer Audio

The current audio functionality of the Java Speech API is incompletely specified. Once a standard mechanism is established for streaming input and output audio on the Java platform the API will be revised to incorporate that functionality.

In this release of the API, the only established audio functionality is provided through the RecognizerAudioListener interface and the RecognizerAudioEvent class. Audio events issued by a recognizer are intended to support simple feedback mechanisms for a user. The three types of RecognizerAudioEvent are as follows:

All the RecognizerAudioEvents are produced as audio reaches the input to the recognizer. Because recognizers use internal buffers between audio input and the recognition process, the audio events can run ahead of the recognition process.


Contents Previous   Next  


JavaTM Speech API Programmer's Guide
Copyright © 1997-1998 Sun Microsystems, Inc. All rights reserved
Send comments or corrections to javaspeech-comments@sun.com