Contents Previous   Next  

Chapter 5


Speech Synthesis: javax.speech.synthesis
 

A speech synthesizer is a speech engine that converts text to speech. The javax.speech.synthesis package defines the Synthesizer interface to support speech synthesis plus a set of supporting classes and interfaces. The basic functional capabilities of speech synthesizers, some of the uses of speech synthesis and some of the limitations of speech synthesizers are described in Section 2.1.

As a type of speech engine, much of the functionality of a Synthesizer is inherited from the Engine interface in the javax.speech package and from other classes and interfaces in that package. The javax.speech package and generic speech engine functionality are described in Chapter 4.

This chapter describes how to write Java applications and applets that use speech synthesis. We begin with a simple example, and then review the speech synthesis capabilities of the API in more detail.

 


 

5.1     "Hello World!"

The following code shows a simple use of speech synthesis to speak the string "Hello World".


import javax.speech.*; import javax.speech.synthesis.*; import java.util.Locale; public class HelloWorld { public static void main(String args[]) { try { // Create a synthesizer for English Synthesizer synth = Central.createSynthesizer( new SynthesizerModeDesc(Locale.ENGLISH)); // Get it ready to speak synth.allocate(); synth.resume(); // Speak the "Hello world" string synth.speakPlainText("Hello, world!", null); // Wait till speaking is done synth.waitEngineState(Synthesizer.QUEUE_EMPTY); // Clean up synth.deallocate(); } catch (Exception e) { e.printStackTrace(); } } }

This example illustrates the four basic steps which all speech synthesis applications must perform. Let's examine each step in detail.

 


 

5.2     Synthesizer as an Engine

The basic functionality provided by a Synthesizer is speaking text, management of a queue of text to be spoken and producing events as these functions proceed. The Synthesizer interface extends the Engine interface to provide this functionality.

The following is a list of the functionality that the javax.speech.synthesis package inherits from the javax.speech package and outlines some of the ways in which that functionality is specialized.

 


 

5.3     Speaking Text

The Synthesizer interface provides four methods for submitting text to a speech synthesizer to be spoken. These methods differ according to the formatting of the provided text, and according to the type of object from which the text is produced. All methods share one feature; they all allow a listener to be passed that will receive notifications as output of the text proceeds.

The simplest method - speakPlainText - takes text as a String object. This method is illustrated in the "Hello World!" example at the beginning of this chapter. As the method name implies, this method treats the input text as plain text without any of the formatting described below.

The remaining three speaking methods - all named speak - treat the input text as being specially formatted with the Java Speech Markup Language (JSML). JSML is an application of XML (eXtensible Markup Language), a data format for structured document interchange on the internet. JSML allows application developers to annotate text with structural and presentation information to improve the speech output quality. JSML is defined in detail in a separate technical document, "The Java Speech Markup Language Specification."

The three speak methods retrieve the JSML text from different Java objects. The three methods are:

void speak(Speakable text, SpeakableListener listener);
void speak(URL text, SpeakableListener listener);
void speak(String text, SpeakableListener listener);

The first version accepts an object that implements the Speakable interface. The Speakable interface is a simple interface defined in the javax.speech.synthesis package that contains a single method: getJSMLText. This method should return a String containing text formatted with JSML.

Virtually any Java object can implement the Speakable interface by implementing the getJSMLText method. For example, the cells of spread-sheet, the text of an editing window, or extended AWT classes could all implement the Speakable interface.

The Speakable interface is intended to provide the spoken version of the toString method of the Object class. That is, Speakable allows an object to define how it should be spoken. For example:


public class MyAWTObj extends Component implements Speakable { ... public String getJSMLText() { ... } } { MyAWTObj obj = new MyAWTObj(); synthesizer.speak(obj, null); }

The second variant of the speak method allows JSML text to be loaded from a URL to be spoken. This allows JSML text to be loaded directly from a web site and be spoken.

The third variant of the speak method takes a JSML string. Its use is straight- forward.

For each of the three speak methods that accept JSML formatted text, a JSMLException is thrown if any formatting errors are detected. Developers familiar with editing HTML documents will find that XML is strict about syntax checks. It is generally advisable to check XML documents (such as JSML) with XML tools before publishing them.

The following sections describe the speech output onto which objects are placed with calls to the speak methods and the mechanisms for monitoring and managing that queue.

 


 

5.4     Speech Output Queue

Each call to the speak and speakPlainText methods places an object onto the synthesizer's speech output queue. The speech output queue is a FIFO queue: first-in-first-out. This means that objects are spoken in the order in which they are received.

The top of queue item is the head of the queue. The top of queue item is the item currently being spoken or is the item that will be spoken next when a paused synthesizer is resumed.

The Synthesizer interface provides a number of methods for manipulating the output queue. The enumerateQueue method returns an Enumeration object containing a SynthesizerQueueItem for each object on the queue. The first object in the enumeration is the top of queue. If the queue is empty the enumerateQueue method returns null.

Each SynthesizerQueueItem in the enumeration contains four properties. Each property has a accessor method:

The state of the queue is an explicit state of the Synthesizer. The Synthesizer interface defines a state system for QUEUE_EMPTY and QUEUE_NOT_EMPTY. Any Synthesizer in the ALLOCATED state must be in one and only one of these two states.

The QUEUE_EMPTY and QUEUE_NOT_EMPTY states are parallel states to the PAUSED and RESUMED states. These two state systems operate independently as shown in Figure 5-1 (an extension of Figure 4-2).

The SynthesizerEvent class extends the EngineEvent class with the QUEUE_UPDATED and QUEUE_EMPTIED events which indicate changes in the queue state.

The "Hello World!" example shows one use of the queue status. It calls the waitEngineState method to test when the synthesizer returns to the QUEUE_EMPTY state. This test determines when the synthesizer has completed output of all objects on the speech output queue.

The queue status and transitions in and out of the ALLOCATED state are linked. When a Synthesizer is newly ALLOCATED it always starts in the QUEUE_EMPTY state since no objects have yet been placed on the queue. Before a synthesizer is deallocated (before leaving the ALLOCATED state) a synthesizer must return to the QUEUE_EMPTY state. If the speech output queue is not empty when the deallocate method is called, all objects on the speech output queue are automatically cancelled by the synthesizer. By contrast, the initial and final states for PAUSED and RESUMED are not defined because the pause/resume state may be shared by multiple applications.

The Synthesizer interface defines three cancel methods that allow an application to request that one or more objects be removed from the speech output queue:

void cancel();
void cancel(Object source);
void cancelAll();

The first of these three methods cancels the object at the top of the speech output queue. If that object is currently being spoken, the speech output is stopped and then the object is removed from the queue. The SpeakableListener for the item receives a SPEAKABLE_CANCELLED event. The SynthesizerListener receives a QUEUE_UPDATED event, unless the item was the last one on the queue in which case a QUEUE_EMPTIED event is issued.

The second cancel method requires that a source object be specified. The object should be one of the items currently on the queue: a Speakable, a URL, or a String. The actions are much the same as for the first cancel method except that if the item is not top-of-queue, then speech output is not affected.

The final cancel method - cancelAll - removes all items from the speech output queue. Each item receives a SPEAKABLE_CANCELLED event and the SynthesizerListener receives a QUEUE_EMPTIED event. The SPEAKABLE_CANCELLED events are issued to items in the order of the queue.

 


 

5.5     Monitoring Speech Output

All the speak and speakPlainText methods accept a SpeakableListener as the second input parameter. To request notification of events as the speech object is spoken an application provides a non-null listener.

Unlike a SynthesizerListener that receives synthesizer-level events, a SpeakableListener receives events associated with output of individual text objects: output of Speakable objects, output of URLs, output of JSML strings, or output of plain text strings.

The mechanism for attaching a SpeakableListener through the speak and speakPlainText methods is slightly different from the normal attachment and removal of listeners. There are, however, addSpeakableListener and removeSpeakableListener methods on the Synthesizer interface. These add and remove methods allow listeners to be provided to receive notifications of events associated with all objects being spoken by the Synthesizer.

The SpeakableEvent class defines eight events that indicate progress of spoken output of a text object. For each of these eight event types, there is a matching method in the SpeakableListener interface. For convenience, a SpeakableAdapter implementation of the SpeakableListener interface is provided with trivial (empty) implementations of all eight methods.

The normal sequence of events as an object is spoken is as follows:

The remaining event types are modifications to the normal event sequence.

The following is an example of the use of the SpeakableListener interface to monitor the progress of speech output. It shows how a training application could synchronize speech synthesis with animation.

It places two JSML string objects onto the output queue and requests notifications to itself. The speech output will be:

"First, use the mouse to open the file menu.
Then, select the save command."

At the start of the output of each string the speakableStarted method will be called. By checking the source of the event we determine which text is being spoken and so the appropriate animation code can be triggered.


public class TrainingApp extends SpeakableAdapter { String openMenuText = "First, use the mouse to open the file menu."; // The EMP element indicates emphasis of a word String selectSaveText = "Then, select the <EMP>save</EMP> command."; public void sendText(Synthesizer synth) { // Insert the two objects into the speech queue // specifying self as recipient of SpeakableEvents. synth.speak(openMenuText, this); synth.speak(selectSaveText, this); } // Override the empty method in SpeakableAdapter public void speakableStarted(SpeakableEvent e) { if (e.getSource() == openMenuText) { // animate the opening of the file menu } else if (e.getSource() == selectSaveText) { // animate the selection of 'save' } } }

 


 

5.6     Synthesizer Properties

The SynthesizerProperties interface extends the EngineProperties interface described in Section 4.6.1. The JavaBeans property mechanisms, the asynchronous application of property changing, and the property change event notifications are all inherited engine behavior and are described in that section.

The SynthesizerProperties object is obtained by calling the getEngineProperties method (inherited from the Engine interface) or the getSynthesizerProperties method. Both methods return the same object instance, but the latter is more convenient since it is an appropriately cast object.

The SynthesizerProperties interface defines five synthesizer properties that can be modified during operation of a synthesizer to effect speech output.

The voice property is used to control the speaking voice of the synthesizer. The set of voices supported by a synthesizer can be obtained by the getVoices method of the synthesizer's SynthesizerModeDesc object. Each voice is defined by a voice name, gender, age and speaking style. Selection of voices is described in more detail in Selecting Voices.

The remaining four properties control prosody. Prosody is a set of features of speech including the pitch and intonation, rhythm and timing, stress and other characteristics which affect the style of the speech. The prosodic features controlled through the SynthesizerProperties interface are:

The following code shows how to increase the speaking rate for a synthesizer by 30 words per minute.


float increaseSpeakingRate(Synthesizer synth) { SynthesizerProperties props = synth.getEngineProperties(); float newSpeakingRate = props.getSpeakingRate() + 30.0; props.setSpeakingRate(newSpeakingRate); return newSpeakingRate; }

As with all engine properties, changes to synthesizer properties are not necessarily instant. The change should take effect as soon as the synthesizer can apply it. Depending on the underlying technology, a property change may take effect immediately, or at the next phoneme, word, phrase or sentence boundary, or at the beginning of output of the next item in the synthesizer's queue.

So that an application knows when the change has actual taken effect, the synthesizer generates a property change event for each call to a set method in the SynthesizerProperties interface.

5.6.1     Selecting Voices

Most speech synthesizers are able to produce a number of voices. In most cases voices attempt to sound natural and human, but some voices may be deliberately mechanical or robotic.

The Voice class is used to encapsulate the four features that describe each voice: voice name, gender, age and speaking style. The voice name and speaking style are both String objects and the contents of those strings are determined by the synthesizer. Typical voice names might be "Victor", "Monica", "Ahmed", "Jose", "My Robot" or something completely different. Speaking styles might include "casual", "business", "robotic" or "happy" (or similar words in other languages) but the API does not impose any restrictions upon the speaking styles. For both voice name and speaking style, synthesizers are encouraged to use strings that are meaningful to users so that they can make sensible judgements when selecting voices.

By contrast the gender and age are both defined by the API so that programmatic selection is possible. The gender of a voice can be GENDER_FEMALE, GENDER_MALE, GENDER_NEUTRAL or GENDER_DONT_CARE. Male and female are hopefully self-explanatory. Gender neutral is intended for voices that are not clearly male or female such as some robotic or artificial voices. The "don't care" values are used when selecting a voice and the feature is not relevant.

The age of a voice can be AGE_CHILD (up to 12 years), AGE_TEENAGER (13-19), AGE_YOUNGER_ADULT (20-40), AGE_MIDDLE_ADULT (40-60), AGE_OLDER_ADULT (60+), AGE_NEUTRAL, and AGE_DONT_CARE.

Both gender and age are OR'able values for both applications and engines. For example, an engine could specify a voice as:

Voice("name", GENDER_MALE, AGE_CHILD | AGE_TEENAGER, "style");

In the same way that mode descriptors are used by engines to describe themselves and by applications to select from amongst available engines, the Voice class is used both for description and selection. The match method of Voice allows an application to test whether an engine-provided voice has suitable properties.

The following code shows the use of the match method to identify voices of a synthesizer that are either male or female voices and that are younger or middle adults (between 20 and 60). The SynthesizerModeDesc object may be one obtained through the Central class or through the getEngineModeDesc method of a created Synthesizer.


SynthesizerModeDesc desc = ...; Voice[] voices = desc.getVoices(); // Look for male or female voices that are young/middle adult Voice myVoice = new Voice(); myVoice.setGender(GENDER_MALE | GENDER_FEMALE); myVoice.setAge(AGE_YOUNGER_ADULT | AGE_MIDDLE_ADULT); for (int i = 0; i < voices.length; i++) if (voices[i].match(myVoice)) doAction(voices[i]);

The Voice object can also be used in the selection of a speech synthesizer. The following code illustrates how to create a synthesizer with a young female Japanese voice.


SynthesizerModeDesc required = new SynthesizerModeDesc(); Voice voice = new Voice(null, GENDER_FEMALE, AGE_CHILD | AGE_TEENAGER, null); required.addVoice(voice); required.setLocale(Locale.JAPAN); Synthesizer synth = Central.createSynthesizer(required);

5.6.2     Property Changes in JSML

In addition to control of speech output through the SynthesizerProperties interface, all five synthesizer properties can be controlled in JSML text provided to a synthesizer. The advantage of control through JSML text is that property changes can be finely controlled within a text document. By contrast, control of the synthesizer properties through the SynthesizerProperties interface is not appropriate for word-level changes but is instead useful for setting the default configuration of the synthesizer. Control of the SynthesizerProperties interface is often presented to the user as a graphical configuration window.

Applications that generate JSML text should respect the default settings of the user. To do this, relative settings of parameters such as pitch and speaking rate should be used rather than absolute settings.

For example, users with vision impairments often set the speaking rate extremely high - up to 500 words per minute - so high that most people do not understand the synthesized speech. If a document uses an absolute speaking rate change (to say 200 words per minute which is fast for most users), then the user will be frustrated.

Changes made to the synthesizer properties through the SynthesizerProperties interface are persistent: they affect all succeeding speech output. Changes in JSML are explicitly localized (all property changes in JSML have both start and end tags).

5.6.3     Controlling Prosody

The prosody and voice properties can be used within JSML text to substantially improve the clarity and naturalness of the speech output. For example, one time to change prosodic settings is when providing new, important or detailed information. In this instance it is typical for a speaker to slow down, emphasise more words and often add extra pauses. Putting equivalent changes into synthetic speech will help a listener understand the message.

For example, in response to the question "How many Acme shares do I have?", the answer might be "You currently have 1,500 Acme shares." The number will spoken more slowly because it is new information. To represent this in JSML text the <PROS> element is used:

You currently have <PROS RATE="-20%">1500</PROS> Acme shares.

The following example illustrates how an email message header object can implement the Speakable interface and generate JSML text with prosodic controls to improve understandability.


public class MailHeader implements Speakable { public String subject; public String sender; // sender of the message, eg John Doe public String date; /** getJSMLText is the only method of Speakable */ public String getJSMLText() { StringBuffer buf = new StringBuffer(); // Speak the sender's name slower to be clearer buf.append("Message from " + "<PROS RATE=-30>" + sender + ",</PROS>"); // Make sure the date is interpreted correctly // But we don't need it slow - it's not so important buf.append(" delivered " + "<SAYAS class=\"date\">" + date + "</SAYAS>"); // Subject slower too buf.append(", with subject: " + "<PROS RATE=-30>" + subject + "</PROS>"); return buf.toString(); } } public class myMailApp { ... void newMessageRecieved(MailHeader header) { synth.speakPlainText("You have new mail!"); synth.speak(header, mySpeakableListener); } }

Contents Previous   Next  


JavaTM Speech API Programmer's Guide
Copyright © 1997-1998 Sun Microsystems, Inc. All rights reserved
Send comments or corrections to javaspeech-comments@sun.com