As an emerging technology, not all developers are familiar with speech technology. While the basic functions of both speech synthesis and speech recognition take only minutes to understand (after all, most people learn to speak and listen by age two), there are subtle and powerful capabilities provided by computerized speech that developers will want to understand and utilize.
Despite very substantial investment in speech technology research over the last 40 years, speech synthesis and speech recognition technologies still have significant limitations. Most importantly, speech technology does not always meet the high expectations of users familiar with natural human-to-human speech communication. Understanding the limitations - as well as the strengths - is important for effective use of speech input and output in a user interface and for understanding some of the advanced features of the Java Speech API.
An understanding of the capabilities and limitations of speech technology is also important for developers in making decisions about whether a particular application will benefit from the use of speech input and output. Chapter 3 expands on this issue by considering when and where speech input and output can enhance human-to-computer communication.
2.1 Speech Synthesis
A speech synthesizer converts written text into spoken language. Speech synthesis is also referred to as text-to-speech (TTS) conversion.
The major steps in producing speech from text are as follows:
- Structure analysis: process the input text to determine where paragraphs, sentences and other structures start and end. For most languages, punctuation and formatting data are used in this stage.
- Text pre-processing: analyze the input text for special constructs of the language. In English, special treatment is required for abbreviations, acronyms, dates, times, numbers, currency amounts, email addresses and many other forms. Other languages need special processing for these forms and most languages have other specialized requirements.
The result of these first two steps is a spoken form of the written text. The following are examples of the difference between written and spoken text.
St. Mathews hospital is on Main St. -> "Saint Mathews hospital is on Main street" Add $20 to account 55374. -> "Add twenty dollars to account five five, three seven four." Leave at 5:30 on 5/15/99. -> "Leave at five thirty on May fifteenth nineteen ninety nine."
The remaining steps convert the spoken text to speech.
- Text-to-phoneme conversion: convert each word to phonemes. A phoneme is a basic unit of sound in a language. US English has around 45 phonemes including the consonant and vowel sounds. For example, "times" is spoken as four phonemes "t ay m s". Different languages have different sets of sounds (different phonemes). For example, Japanese has fewer phonemes including sounds not found in English, such as "ts" in "tsunami".
- Prosody analysis: process the sentence structure, words and phonemes to determine appropriate prosody for the sentence. Prosody includes many of the features of speech other than the sounds of the words being spoken. This includes the pitch (or melody), the timing (or rhythm), the pausing, the speaking rate, the emphasis on words and many other features. Correct prosody is important for making speech sound right and for correctly conveying the meaning of a sentence.
- Waveform production: finally, the phonemes and prosody information are used to produce the audio waveform for each sentence. There are many ways in which the speech can be produced from the phoneme and prosody information. Most current systems do it in one of two ways: concatenation of chunks of recorded human speech, or formant synthesis using signal processing techniques based on knowledge of how phonemes sound and how prosody affects those phonemes. The details of waveform generation are not typically important to application developers.
2.1.1 Speech Synthesis Limitations
Speech synthesizers can make errors in any of the processing steps described above. Human ears are well-tuned to detecting these errors, so careful work by developers can minimize errors and improve the speech output quality.
The Java Speech API and the Java Speech Markup Language (JSML) provide many ways for an application developer to improve the output quality of a speech synthesizer. Chapter 5 describes programming techniques for controlling a synthesis through the Java Speech API.
The Java Synthesis Markup Language defines how to markup text input to a speech synthesizer with information that enables the synthesizer to enhance the speech output quality. It is described in detail in the Java Synthesis Markup Language Specification. In brief, some of its features which enhance quality include:
- Ability to specify pronunciations for any word, acronym, abbreviation or other special text representation.
- Explicit control of pauses, boundaries, emphasis, pitch, speaking rate and loudness to improve the output prosody.
These features allow a developer or user to override the behavior of a speech synthesizer to correct most of the potential errors described above. The following is a description of some of the sources of errors and how to minimize problems.
- Structure analysis: punctuation and formatting do not consistently indicate where paragraphs, sentences and other structures start and end. For example, the final period in "U.S.A." might be misinterpreted as the end of a sentence.
Try: Explicitly marking paragraphs and sentences in JSML reduces the number of structural analysis errors.
- Text pre-processing: it is not possible for a synthesizer to know all the abbreviations and acronyms of a language. It is not always possible for a synthesizer to determine how to process dates and times, for example, is "8/5" the "eighth of May" of the "fifth of August"? Should "1998" be read as "nineteen ninety eight" (as a year), as "one thousand and ninety eight" (a regular number) or as "one nine nine eight" (part of a telephone number). Special constructs such as email addresses are particularly difficult to interpret, for example, should a synthesizer say "email@example.com" as "Ted Wards", as "T. Edwards", as "Cat dot com" or as "C. A. T. dot com"?
SAYASelement of JSML supports substitutions of text for abbreviations, acronyms and other idiosyncratic textual forms.
- Text-to-phoneme conversion: most synthesizers can pronounce tens of thousands or even hundreds of thousands of words correctly. However, there are always new words which it must guess for (especially proper names for people, companies, products, etc.), and words for which the pronunciation is ambiguous (for example, "object" as "OBject" or "obJECT", or "row" as a line or as a fight).
SAYASelement of JSML is used to provide phonetic pronunciations for unusual and ambiguous words.
- Prosody analysis: to correctly phrase a sentence, to produce the correct melody for a sentence, and to correctly emphasize words ideally requires an understanding of the meaning of languages that computers do not possess. Instead, speech synthesizers must try to guess what a human might produce and at times, the guess is artificial and unnatural.
PROSelements of JSML can be used to indicate preferred emphasis, pausing, and prosodic rendering respectively for text.
- Waveform production: without lips, mouths, lungs and the other apparatus of human speech, a speech synthesizer will often produce speech which sounds artificial, mechanical or otherwise different from human speech. In some circumstances a robotic sound is desirable, but for most applications speech that sounds as close to human as possible is easier to understand and easier to listen to for long periods of time.
Try: The Java Speech API and JSML do not directly address this issue.
2.1.2 Speech Synthesis Assessment
The major feature of a speech synthesizer that affects its understandability, its acceptance by users and its usefulness to application developers is its output quality. Knowing how to evaluate speech synthesis quality and knowing the factors that influence the output quality are important in the deployment of speech synthesis.
Humans are conditioned by a lifetime of listening and speaking. The human ear (and brain) are very sensitive to small changes in speech quality. A listener can detect changes that might indicate a user's emotional state, an accent, a speech problem or many other factors. The quality of current speech synthesis remains below that of human speech, so listeners must make more effort than normal to understand synthesized speech and must ignore errors. For new users, listening to a speech synthesizer for extended periods can be tiring and unsatisfactory.
The two key factors a developer must consider when assessing the quality of a speech synthesizer are its understandability and its naturalness. Understandability is an indication of how reliably a listener will understand the words and sentences spoken by the synthesizer. Naturalness is an indication of the extent to which the synthesizer sounds like a human - a characteristic that is desirable for most, but not all, applications.
Understandability is affected by the ability of a speech synthesizer to perform all the processing steps described above because any error by the synthesizer has the potential to mislead a listener. Naturalness is affected more by the later stages of processing, particularly the processing of prosody and the generation of the speech waveform.
Though it might seem counter-intuitive, it is possible to have an artificial- sounding voice that is highly understandable. Similarly, it is possible to have a voice that sounds natural but is not always easy to understand (though this is less common).
2.2 Speech Recognition
Speech recognition is the process of converting spoken language to written text or some similar form. The basic characteristics of a speech recognizer supporting the Java Speech API are:
The major steps of a typical speech recognizer are:
- Grammar design: recognition grammars define the words that may be spoken by a user and the patterns in which they may be spoken. A grammar must be created and activated for a recognizer to know what it should listen for in incoming audio. Grammars are described below in more detail.
- Phoneme recognition: compare the spectrum patterns to the patterns of the phonemes of the language being recognized. (A brief description of phonemes is provided in the "Speech Synthesis" section in the discussion of text-to-phoneme conversion.)
- Word recognition: compare the sequence of likely phonemes against the words and patterns of words specified by the active grammars.
- Result generation: provide the application with information about the words the recognizer has detected in the incoming audio. The result information is always provided once recognition of a single utterance (often a sentence) is complete, but may also be provided during the recognition process. The result always indicates the recognizer's best guess of what a user said, but may also indicate alternative guesses.
Most of the processes of a speech recognizer are automatic and are not controlled by the application developer. For instance, microphone placement, background noise, sound card quality, system training, CPU power and speaker accent all affect recognition performance but are beyond an application's control.
The primary way in which an application controls the activity of a recognizer is through control of its grammars.
A grammar is an object in the Java Speech API which indicates what words a user is expected to say and in what patterns those words may occur. Grammars are important to speech recognizers because they constrain the recognition process. These constraints makes recognition faster and more accurate because the recognizer does not have to check for bizarre sentences, for example, "pink is recognizer speech my".
The Java Speech API supports two basic grammar types: rule grammars and dictation grammars. These grammar types differ in the way in which applications set up the grammars, the types of sentences they allow, the way in which results are provided, the amount of computational resources required, and the way in which they are effectively used in application design. The grammar types are describe in more detail below. The programmatic control of grammars is detailed in Chapter 6.
Other speech recognizer controls available to a Java application include pausing and resuming the recognition process, direction of result events and other events relating to the recognition processes, and control of the recognizer's vocabulary.
2.2.1 Rule Grammars
In a rule-based speech recognition system, an application provides the recognizer with rules that define what the user is expected to say. These rules constrain the recognition process. Careful design of the rules, combined with careful user interface design, will produce rules that allow users reasonable freedom of expression while still limiting the range of things that may be said so that the recognition process is as fast and accurate as possible.
Any speech recognizer that supports the Java Speech API must support rule grammars.
The following is an example of a simple rule grammar. It is represented in the Java Speech Grammar Format (JSGF) which is defined in detail in the Java Speech Grammar Format Specification.
#JSGF V1.0; // Define the grammar name grammar SimpleCommands; // Define the rules public <Command> = [<Polite>] <Action> <Object> (and <Object>)*; <Action> = open | close | delete; <Object> = the window | the file; <Polite> = please;
Rule names are surrounded by angle brackets. Words that may be spoken are written as plain text. This grammar defines one public rule,
<Command>, that may be spoken by users. This rule is a combination of three sub-rules,
<Polite>. The square brackets around the reference to
<Polite>mean that it is optional. The parentheses around "
and <Object>" group the word and the rule reference together. The asterisk following the group indicates that it may occur zero or more times.
The grammar allows a user to say commands such as "Open the window" and "Please close the window and the file".
The Java Speech Grammar Format Specification defines the full behavior of rule grammars and discusses how complex grammars can be constructed by combining smaller grammars. With JSGF application developers can reuse grammars, can provide Javadoc-style documentation and can use the other facilities that enable deployment of advanced speech systems.
2.2.2 Dictation Grammars
Dictation grammars impose fewer restrictions on what can be said, making them closer to providing the ideal of free-form speech input. The cost of this greater freedom is that they require more substantial computing resources, require higher quality audio input and tend to make more errors.
A dictation grammar is typically larger and more complex than rule-based grammars. Dictation grammars are typically developed by statistical training on large collections of written text. Fortunately, developers don't need to know any of this because a speech recognizer that supports a dictation grammar through the Java Speech API has a built-in dictation grammar. An application that needs to use that dictation grammar simply requests a reference to it and enables it when the user might say something matching the dictation grammar.
Dictation grammars may be optimized for particular kinds of text. Often a dictation recognizer may be available with dictation grammars for general purpose text, for legal text, or for various types of medical reporting. In these different domains, different words are used, and the patterns of words also differ.
A dictation recognizer in the Java Speech API supports a single dictation grammar for a specific domain. The application and/or user selects an appropriate dictation grammar when the dictation recognizer is selected and created.
2.2.3 Limitations of Speech Recognition
The two primary limitations of current speech recognition technology are that it does not yet transcribe free-form speech input, and that it makes mistakes. The previous sections discussed how speech recognizers are constrained by grammars. This section considers the issue of recognition errors.
Speech recognizers make mistakes. So do people. But recognizers usually make more. Understanding why recognizers make mistakes, the factors that lead to these mistakes, and how to train users of speech recognition to minimize errors are all important to speech application developers.
The reliability of a speech recognizer is most often defined by its recognition accuracy. Accuracy is usually given as a percentage and is most often the percentage of correctly recognized words. Because the percentage can be measured differently and depends greatly upon the task and the testing conditions it is not always possible to compare recognizers simply by their percentage recognition accuracy. A developer must also consider the seriousness of recognition errors: misrecognition of a bank account number or the command "delete all files" may have serious consequences.
The following is a list of major factors that influence recognition accuracy.
- Applications with less confusable grammars typically get better accuracy. Similar sounding words are harder to distinguish.
While these factors can all be significant, their impact can vary between recognizers because each speech recognizer optimizes its performance by trading off various criteria. For example, some recognizers are designed to work reliably in high-noise environments (e.g. factories and mines) but are restricted to very simple grammars. Dictation systems have complex grammars but require good microphones, quieter environments, clearer speech from users and more powerful computers. Some recognizers adapt their process to the voice of a particular user to improve accuracy, but may require training by the user. Thus, users and application developers often benefit by selecting an appropriate recognizer for a specific task and environment.
Only some of these factors can be controlled programmatically. The primary application-controlled factor that influences recognition accuracy is grammar complexity. Recognizer performance can degrade as grammars become more complex, and can degrade as more grammars are active simultaneously. However, making a user interface more natural and usable sometimes requires the use of more complex and flexible grammars. Thus, application developers often need to consider a trade-off between increased usability with more complex grammars and the decreased recognition accuracy this might cause. These issues are discussed in more detail in Chapter 3 which discusses the effective design of user interfaces with speech technology.
Most recognition errors fall into the following categories:
- Rejection: the user speaks but the recognizer cannot understand what was said. The outcome is that the recognizer does not produce a successful recognition result. In the Java Speech API, applications receive an event that indicates the rejection of a result.
- Misrecognition: the recognizer returns a result with words that are different from what the user spoke. This is the most common type of recognition error.
Table 2-1 lists some of the common causes of the three types of recognition errors.
Table 2-1 Speech recognition errors and possible causes Problem Cause Rejection or Misrecognition User speaks one or more words not in the vocabulary. User's sentence does not match any active grammar. User speaks before system is ready to listen. Words in active vocabulary sound alike and are confused (e.g., "too", "two"). User pauses too long in the middle of a sentence. User speaks with a disfluency (e.g., restarts sentence, stumbles, "umm", "ah"). User's voice trails off at the end of the sentence. User has an accent or cold. User's voice is substantially different from stored "voice models" (often a problem with children). Computer's audio is not configured properly. User's microphone is not properly adjusted. Misfire Non-speech sound (e.g., cough, laugh). Background speech triggers recognition. User is talking with another person.
Chapter 6 describes in detail the use of speech recognition through the Java Speech API. Ways of improving recognition accuracy and reliability are discussed further. Chapter 3 looks at how developers should account for possible recognition errors in application design to make the user interface more robust and predictable.
Speech API Programmer's Guide
Copyright © 1997-1998 Sun Microsystems, Inc. All rights reserved
Send comments or corrections to firstname.lastname@example.org