Contents Previous   Next  

Chapter 2

Speech Technology

As an emerging technology, not all developers are familiar with speech technology. While the basic functions of both speech synthesis and speech recognition take only minutes to understand (after all, most people learn to speak and listen by age two), there are subtle and powerful capabilities provided by computerized speech that developers will want to understand and utilize.

Despite very substantial investment in speech technology research over the last 40 years, speech synthesis and speech recognition technologies still have significant limitations. Most importantly, speech technology does not always meet the high expectations of users familiar with natural human-to-human speech communication. Understanding the limitations - as well as the strengths - is important for effective use of speech input and output in a user interface and for understanding some of the advanced features of the Java Speech API.

An understanding of the capabilities and limitations of speech technology is also important for developers in making decisions about whether a particular application will benefit from the use of speech input and output. Chapter 3 expands on this issue by considering when and where speech input and output can enhance human-to-computer communication.



2.1     Speech Synthesis

A speech synthesizer converts written text into spoken language. Speech synthesis is also referred to as text-to-speech (TTS) conversion.

The major steps in producing speech from text are as follows:

The result of these first two steps is a spoken form of the written text. The following are examples of the difference between written and spoken text.

St. Mathews hospital is on Main St. -> "Saint Mathews hospital is on Main street" Add $20 to account 55374. -> "Add twenty dollars to account five five, three seven four." Leave at 5:30 on 5/15/99. -> "Leave at five thirty on May fifteenth nineteen ninety nine."

The remaining steps convert the spoken text to speech.

2.1.1     Speech Synthesis Limitations

Speech synthesizers can make errors in any of the processing steps described above. Human ears are well-tuned to detecting these errors, so careful work by developers can minimize errors and improve the speech output quality.

The Java Speech API and the Java Speech Markup Language (JSML) provide many ways for an application developer to improve the output quality of a speech synthesizer. Chapter 5 describes programming techniques for controlling a synthesis through the Java Speech API.

The Java Synthesis Markup Language defines how to markup text input to a speech synthesizer with information that enables the synthesizer to enhance the speech output quality. It is described in detail in the Java Synthesis Markup Language Specification. In brief, some of its features which enhance quality include:

These features allow a developer or user to override the behavior of a speech synthesizer to correct most of the potential errors described above. The following is a description of some of the sources of errors and how to minimize problems.

2.1.2     Speech Synthesis Assessment

The major feature of a speech synthesizer that affects its understandability, its acceptance by users and its usefulness to application developers is its output quality. Knowing how to evaluate speech synthesis quality and knowing the factors that influence the output quality are important in the deployment of speech synthesis.

Humans are conditioned by a lifetime of listening and speaking. The human ear (and brain) are very sensitive to small changes in speech quality. A listener can detect changes that might indicate a user's emotional state, an accent, a speech problem or many other factors. The quality of current speech synthesis remains below that of human speech, so listeners must make more effort than normal to understand synthesized speech and must ignore errors. For new users, listening to a speech synthesizer for extended periods can be tiring and unsatisfactory.

The two key factors a developer must consider when assessing the quality of a speech synthesizer are its understandability and its naturalness. Understandability is an indication of how reliably a listener will understand the words and sentences spoken by the synthesizer. Naturalness is an indication of the extent to which the synthesizer sounds like a human - a characteristic that is desirable for most, but not all, applications.

Understandability is affected by the ability of a speech synthesizer to perform all the processing steps described above because any error by the synthesizer has the potential to mislead a listener. Naturalness is affected more by the later stages of processing, particularly the processing of prosody and the generation of the speech waveform.

Though it might seem counter-intuitive, it is possible to have an artificial- sounding voice that is highly understandable. Similarly, it is possible to have a voice that sounds natural but is not always easy to understand (though this is less common).



2.2     Speech Recognition

Speech recognition is the process of converting spoken language to written text or some similar form. The basic characteristics of a speech recognizer supporting the Java Speech API are:

The major steps of a typical speech recognizer are:

Most of the processes of a speech recognizer are automatic and are not controlled by the application developer. For instance, microphone placement, background noise, sound card quality, system training, CPU power and speaker accent all affect recognition performance but are beyond an application's control.

The primary way in which an application controls the activity of a recognizer is through control of its grammars.

A grammar is an object in the Java Speech API which indicates what words a user is expected to say and in what patterns those words may occur. Grammars are important to speech recognizers because they constrain the recognition process. These constraints makes recognition faster and more accurate because the recognizer does not have to check for bizarre sentences, for example, "pink is recognizer speech my".

The Java Speech API supports two basic grammar types: rule grammars and dictation grammars. These grammar types differ in the way in which applications set up the grammars, the types of sentences they allow, the way in which results are provided, the amount of computational resources required, and the way in which they are effectively used in application design. The grammar types are describe in more detail below. The programmatic control of grammars is detailed in Chapter 6.

Other speech recognizer controls available to a Java application include pausing and resuming the recognition process, direction of result events and other events relating to the recognition processes, and control of the recognizer's vocabulary.

2.2.1     Rule Grammars

In a rule-based speech recognition system, an application provides the recognizer with rules that define what the user is expected to say. These rules constrain the recognition process. Careful design of the rules, combined with careful user interface design, will produce rules that allow users reasonable freedom of expression while still limiting the range of things that may be said so that the recognition process is as fast and accurate as possible.

Any speech recognizer that supports the Java Speech API must support rule grammars.

The following is an example of a simple rule grammar. It is represented in the Java Speech Grammar Format (JSGF) which is defined in detail in the Java Speech Grammar Format Specification.

#JSGF V1.0; // Define the grammar name grammar SimpleCommands; // Define the rules public <Command> = [<Polite>] <Action> <Object> (and <Object>)*; <Action> = open | close | delete; <Object> = the window | the file; <Polite> = please;

Rule names are surrounded by angle brackets. Words that may be spoken are written as plain text. This grammar defines one public rule, <Command>, that may be spoken by users. This rule is a combination of three sub-rules, <Action>, <Object> and <Polite>. The square brackets around the reference to <Polite> mean that it is optional. The parentheses around "and <Object>" group the word and the rule reference together. The asterisk following the group indicates that it may occur zero or more times.

The grammar allows a user to say commands such as "Open the window" and "Please close the window and the file".

The Java Speech Grammar Format Specification defines the full behavior of rule grammars and discusses how complex grammars can be constructed by combining smaller grammars. With JSGF application developers can reuse grammars, can provide Javadoc-style documentation and can use the other facilities that enable deployment of advanced speech systems.

2.2.2     Dictation Grammars

Dictation grammars impose fewer restrictions on what can be said, making them closer to providing the ideal of free-form speech input. The cost of this greater freedom is that they require more substantial computing resources, require higher quality audio input and tend to make more errors.

A dictation grammar is typically larger and more complex than rule-based grammars. Dictation grammars are typically developed by statistical training on large collections of written text. Fortunately, developers don't need to know any of this because a speech recognizer that supports a dictation grammar through the Java Speech API has a built-in dictation grammar. An application that needs to use that dictation grammar simply requests a reference to it and enables it when the user might say something matching the dictation grammar.

Dictation grammars may be optimized for particular kinds of text. Often a dictation recognizer may be available with dictation grammars for general purpose text, for legal text, or for various types of medical reporting. In these different domains, different words are used, and the patterns of words also differ.

A dictation recognizer in the Java Speech API supports a single dictation grammar for a specific domain. The application and/or user selects an appropriate dictation grammar when the dictation recognizer is selected and created.

2.2.3     Limitations of Speech Recognition

The two primary limitations of current speech recognition technology are that it does not yet transcribe free-form speech input, and that it makes mistakes. The previous sections discussed how speech recognizers are constrained by grammars. This section considers the issue of recognition errors.

Speech recognizers make mistakes. So do people. But recognizers usually make more. Understanding why recognizers make mistakes, the factors that lead to these mistakes, and how to train users of speech recognition to minimize errors are all important to speech application developers.

The reliability of a speech recognizer is most often defined by its recognition accuracy. Accuracy is usually given as a percentage and is most often the percentage of correctly recognized words. Because the percentage can be measured differently and depends greatly upon the task and the testing conditions it is not always possible to compare recognizers simply by their percentage recognition accuracy. A developer must also consider the seriousness of recognition errors: misrecognition of a bank account number or the command "delete all files" may have serious consequences.

The following is a list of major factors that influence recognition accuracy.

While these factors can all be significant, their impact can vary between recognizers because each speech recognizer optimizes its performance by trading off various criteria. For example, some recognizers are designed to work reliably in high-noise environments (e.g. factories and mines) but are restricted to very simple grammars. Dictation systems have complex grammars but require good microphones, quieter environments, clearer speech from users and more powerful computers. Some recognizers adapt their process to the voice of a particular user to improve accuracy, but may require training by the user. Thus, users and application developers often benefit by selecting an appropriate recognizer for a specific task and environment.

Only some of these factors can be controlled programmatically. The primary application-controlled factor that influences recognition accuracy is grammar complexity. Recognizer performance can degrade as grammars become more complex, and can degrade as more grammars are active simultaneously. However, making a user interface more natural and usable sometimes requires the use of more complex and flexible grammars. Thus, application developers often need to consider a trade-off between increased usability with more complex grammars and the decreased recognition accuracy this might cause. These issues are discussed in more detail in Chapter 3 which discusses the effective design of user interfaces with speech technology.

Most recognition errors fall into the following categories:

Table 2-1 lists some of the common causes of the three types of recognition errors.

Table 2-1 Speech recognition errors and possible causes
Problem  Cause  
Rejection or Misrecognition  User speaks one or more words not in the vocabulary.  
User's sentence does not match any active grammar.  
User speaks before system is ready to listen.  
Words in active vocabulary sound alike and are confused (e.g., "too", "two").  
User pauses too long in the middle of a sentence.  
User speaks with a disfluency (e.g., restarts sentence, stumbles, "umm", "ah").  
User's voice trails off at the end of the sentence.  
User has an accent or cold.  
User's voice is substantially different from stored "voice models" (often a problem with children).  
Computer's audio is not configured properly.  
User's microphone is not properly adjusted.  
Misfire  Non-speech sound (e.g., cough, laugh).  
Background speech triggers recognition.  
User is talking with another person.  

Chapter 6 describes in detail the use of speech recognition through the Java Speech API. Ways of improving recognition accuracy and reliability are discussed further. Chapter 3 looks at how developers should account for possible recognition errors in application design to make the user interface more robust and predictable.

Contents Previous   Next  

JavaTM Speech API Programmer's Guide
Copyright © 1997-1998 Sun Microsystems, Inc. All rights reserved
Send comments or corrections to