Microsoft SAPI 5 compliance
Vocalizer provides support for the Microsoft Speech API (SAPI), version 5. For information on the functions documented below, see the Microsoft documentation for the Microsoft Speech SDK 5.1. For information on Vocalizer support for Microsoft SAPI 5 phoneme sets, see the Vocalizer user guide for each Vocalizer language.
SAPI 5 text-to-speech engine interface
Vocalizer implements the SAPI 5 Text-To-Speech Engine interface. Applications do not directly call this interface; they instead call into SAPI 5 components that are implemented on top of this interface.
The following table lists the SAPI 5 Text-To-Speech Engine interfaces and methods, describing which of these are supported by Vocalizer.
Interface |
Function name |
Availability |
---|---|---|
ISpTTSEngine |
Speak |
Supported |
GetOutputFormat |
Supported |
|
ISpTTSEngineSite |
ISpEventSink |
Supported |
GetActions |
Supported |
|
Write |
Supported |
|
GetRate |
Supported |
|
GetVolume |
Supported |
|
GetSkipInfo |
Supported |
|
CompleteSkip |
Supported |
SAPI 5 text-to-speech application interface
Applications directly call this interface (ISpVoice), which is implemented by SAPI 5 components. Those Microsoft components then call into the Vocalizer SAPI 5 Text-To-Speech Engine implementation.
ISpVoice is the only public SAPI 5 interface for application access to the Text-To-Speech engine. The ISpVoice interface enables an application to perform text synthesis operations. Applications can speak text strings and text files, or play audio files through this interface. All of these can be done synchronously or asynchronously.
Applications can choose a specific TTS voice using ISpVoice::SetVoice. The state of the voice (for example, rate, pitch, and volume), can be modified using SAPI XML tags that are embedded into the spoken text. Some attributes, like rate and volume, can also be changed in real time using SAPI API methods such as ISpVoice::SetRate and ISpVoice::SetVolume. Voices can be set to different priorities using ISpVoice::SetPriority.
ISpVoice inherits from the ISpEventSource interface. An ISpVoice object forwards events back to the application when the corresponding audio data has been rendered to the output device.
The following table lists the SAPI 5 Text-To-Speech Engine interfaces and methods, describing which of these are supported when the Vocalizer engine is used. It also includes notes on Vocalizer-specific extensions or limitations. For a description of each member function, see the Microsoft Speech SDK v5.1 Reference chapter on Text-To-Speech interfaces (ISpVoice).
ISpVoice function name |
Availability |
Vocalizer-specific notes |
---|---|---|
SetOutput | Supported | Vocalizer only supports 8 kHz and 22 kHz voices. If the application chooses other frequencies, then the Microsoft SAPI 5 layer will use conversion software installed in the PC, which might cause speech quality degradation and performance degradation. If other frequencies are required, speak to your Nuance sales representative to see if a different Vocalizer product variant would be a better match for your application. |
GetOutputObjectToken |
Supported |
See ISpVoice::SetOutput. |
GetOutputStream | Supported | |
Pause | Supported | |
Resume | Supported | |
SetVoice | Supported | |
GetVoice | Supported | |
Speak | Supported | |
SpeakStream | Supported | |
GetStatus | Supported | |
Skip |
Not supported |
|
SetPriority | Supported | |
GetPriority | Supported | |
SetAlertBoundary | Supported | |
GetAlertBoundary | Supported | |
SetRate | Supported | |
GetRate | Supported | |
SetVolume | Supported | SAPI enforces its own default volume. This means that the baseline.xml configuration file does not support specifying the default rate and volume, and that the Vocalizer default volume when used via SAPI is 100 (the maximum volume) instead of 80. |
GetVolume | Supported | See ISpVoice::SetVolume note. |
WaitUntilDone | Supported | |
SetSyncSpeakTimeout | Supported | |
GetSyncSpeakTimeout | Supported | |
SpeakCompleteEvent | Supported | |
IsUISupported | Not supported | This member function is not supported by Vocalizer as Vocalizer does not provide a custom SAPI 5 Control Panel User Interface. |
DisplayUI | Not supported | This member function is not supported by Vocalizer as Vocalizer does not provide a custom SAPI 5 Control Panel User Interface. |
SAPI 5 XML tag support
The following SAPI 5 Text-To-Speech XML tags can be embedded in the input text to change the Text-To-Speech output. For each XML tag, you will find the following information:
Description |
Gives a description of the XML tag. |
Syntax |
Displays the syntax of the XML tag. |
Comments |
Gives remarks that are specific to Vocalizer’s support of the XML tag. |
Example |
Shows how to use the XML tag. |
Note: Incorrectly specified SAPI XML control tags are ignored and treated as white space. Also note that SAPI XML control tags that are not described in the following table are not supported.
Tip: To learn abut the use and syntax of SAPI 5 XML tags, see the Text-To-Speech Interface chapter of the Microsoft Speech SDK v5.1 Reference.

Indicates a bookmark in the text.
Syntax:
<bookmark mark=string/>
Example:
This sentence contains a <bookmark mark="bookmark_one"/> bookmark.

Sets the context for the text that follows, determining how to speak specific strings. This is equivalent to the <ESC>\tn\ native control sequence. For the supported <ESC>\tn\ types, see the appropriate Language Supplement.
Syntax:
<Context ID=string> Input Text </Context>
Example:
Today is <context ID="date">12/22/99</Context>.

Emphasizes a sentence to be spoken.
Syntax:
<Emph> Input text </Emph>
Comments:
Vocalizer only supports emphasizing the whole sentence.
Example:
<emph>John and Peter are coming tomorrow</emph>.

Indicates a language change in the text. This tag is handled by the SAPI 5 layer.
Syntax:
<lang langid=string> Input text </lang>
Example:
<lang langid="409"> A U.S. English voice speaks this sentence. </lang>

Indicates the part-of-speech of the next word. This tag is effective only when the word is in a SAPI 5 lexicon and has the same part-of-speech setting as in the lexicon.
Syntax:
<Partofsp Part=string> word </Partofsp>
Comments:
The following part-of-speech types are supported:
- <Partofsp Part="noun">
- <Partofsp Part="verb">
- <Partofsp Part="modifier">
- <Partofsp Part="function">
- <Partofsp Part="interjection">
- <Partofsp Part="unknown">
Example:
<Partofsp Part="noun"> A </Partofsp> is the first letter of the alphabet.

Controls the pitch of a voice.
Syntax:
<Pitch Absmiddle=string> Input Text </Pitch>
Comments:
Vocalizer does not support this tag.
Example:
<pitch absmiddle="5">This is a test.</pitch>

Inserts a specified pronunciation. The voice processes the sequence of phonemes exactly as they are specified. This tag can be empty, or it can have content. If it has content, it is interpreted as providing the pronunciation for the enclosed text. That is, the enclosed text is not processed as it normally would be.
The Pron tag has one attribute, sym, whose value is a string of white space separated SAPI 5 phonemes (not native Vocalizer L&H+ phonemes).
Syntax:
<pron sym=phonetic string>
or
<pron sym=phonetic string>Input text</pron>
Comments:
The supported SAPI 5 phoneme symbols can be found in the Vocalizer user guide for each language. If no phoneme table is available for a specific language, then this tag is not supported for that language.
Example:
<pron sym="h eh 1 l ow & w er 1 l d"> hello world </pron>

Rate of a voice. The tag can be empty, in which case it applies to all subsequent text, or it can have content, in which case it only applies to that content.
The Rate tag has two attributes, Speed and AbsSpeed, one of which must be present. The value of both attributes is an integer between -10 and 10.
- The AbsSpeed attribute controls the absolute rate of the voice, so a value of 10 always corresponds to a value of 10.
- The Speed attribute controls the relative rate of the voice. The voice speaks at a rate found by adding the value of the speed attribute to the current rate.
Syntax:
<rate absspeed=number>Input text</rate>
or
<rate speed=number>Input text</rate>
Examples:
<rate absspeed="5">This text is spoken at rate 5.</rate>
<rate speed="5">This text is spoken at rate 10.</rate>
<rate speed="-5">This text is spoken at rate 5.</rate>
<rate absspeed="8"/>
All following text is spoken at rate 8.

The Silence tag inserts a specified number of milliseconds of silence into the output audio stream. This tag must be empty, and must have one attribute, Msec.
Syntax:
<silence msec=number>Input text
Example:
<silence msec="500"/> This is a sentence.

Forces the voice to spell all text instead of using default breaking rules for words and sentences, normalization rules, and so on. This requires that you expand all characters to their corresponding words (including punctuation, numbers, and so forth). The Spell tag cannot be empty.
Syntax:
<spell>Input text</spell>
Example:
<spell>UN</spell>
To speak “123” as “one two three” instead of “one hundred and twenty-three,” use the Spell tag.
<spell>123</spell>

The Voice tag is completely implemented by SAPI 5 components, with no control by Vocalizer other than publishing the Vocalizer voice attributes.
The Voice tag selects a voice based on its attributes, Age, Gender, Language, Name, Vendor, and VendorPreferred. The tag can be empty, in which case it changes the voice for all subsequent text, or it can have content, in which case it only changes the voice for that content.
The Voice tag has two attributes: Required and Optional. These correspond exactly to the required and optional attributes for the EnumerateTokens and SpFindBestToken methods in the SAPI 5 ISpObjectTokenCategory interface. The selected voice follows exactly the same rules as the latter of these two functions. That is, it selects a voice where all the required attributes are present, and where more optional attributes are present than with the other installed voices (if several voices have equal numbers of optional attributes, one is selected at random).
For more details, see Object Tokens and Registry Settings in the Microsoft Speech SDK v5.1 Reference.
In addition, the attributes of the current voice are always added as optional attributes when the Voice tag is used. This means that a voice that is more similar to the current voice will be selected over one that is less similar.
If no voice is found that matches all of the required attributes, no voice change will occur.
Syntax:
<voice required=type of info.=info.>Input text</voice>
or
<voice optional=type of info.=info.>Input text</voice>
Examples:
<voice required="Gender=Female;Age!=Child">A female non-child speaks this sentence, if one exists. </voice>
<voice required="Age=Teen">A teen speaks this sentence—if a female, non-child teen is present, she will be selected over a male teen, for example. </voice>

The Volume tag controls the volume of a voice. The tag can be empty, in which case it applies to all subsequent text, or it can have content, in which case it only applies to that content.
The Volume tag has one required attribute: Level. The value of this attribute is be an integer between zero and one hundred. Values outside this range will be truncated.
Syntax:
<volume level=number>Input text</volume>
Example:
<volume level="50">This is a sentence.</volume>
SAPI 5 lexicons
SAPI 5 provides lexicons so that users and applications can specify pronunciation and part of speech information for particular words. As such, SAPI compliant text-to-speech engines like Vocalizer need to support these lexicons to guarantee uniformity of pronunciation and part of speech information.
There are two types of lexicons in SAPI: user lexicons and application lexicons.
- User lexicons: Each user who logs in to a computer has a user lexicon. Initially, this lexicon is empty; words can be added either programmatically, or by using an engine’s add/remove words UI component. (For example, the sample application Dictation Pad provides an Add/Remove Words dialog.)
- Application lexicons: Applications can create and ship their own lexicons of specialized words. These lexicons are fixed and cannot be edited.
Detailed information on how to use the Microsoft SAPI 5 lexicons can be found in the Microsoft Speech SDK v5.1 Reference chapter entitled “ISpLexicon Interface”.
The Vocalizer SAPI 5 engine can be configured to automatically load Vocalizer proprietary user dictionaries that tune Vocalizer pronunciations, as well as Vocalizer rulesets and ActivePrompt databases. This mechanism provides a better alternative to the Microsoft-defined SAPI 5 lexicon solution, as Vocalizer user dictionaries can specify native Vocalizer phoneme strings for more accurate pronunciations, and Vocalizer user dictionaries are stored in files which are easier to manage then registry-based Microsoft SAPI 5 lexicons.
To configure default Vocalizer user dictionaries, modify the <default_dictionaries> element within the Vocalizer SAPI 5 configuration file (install_path\config\tts_config.xml in the Vocalizer installation directory).
Using SSML via SAPI
Vocalizer supports W3C SSML markup for text submitted via SAPI. However, the application must use special methods to do so, and the available methods depend on the installed version of the SAPI components (sapi.dll). Nuance does not redistribute these components; they are a standard part of Windows. Contact Microsoft for more information.
SAPI 5.2 and earlier do not have native support for SSML markup. By default, they try to interpret the SSML as SAPI 5 XML tags, strip out the markup, and speak the resulting text. This behavior results in incorrect text-to-speech output. To work around this, specify the SPF_IS_NOT_XML flag when calling the ISpVoice::Speak method. While this flag may seem counter-intuitive, it blocks the SAPI XML parser, allowing the SSML markup to be passed to Vocalizer as-is. As long as the SSML markup is encoded as little-endian UTF-16 and has a valid SSML header, Vocalizer’s SAPI integration auto-detects the SSML markup and parses it during Vocalizer’s internal SSML parser (with the SSML conformance described in Vocalizer SSML support).
SAPI 5.3 and later have native support for SSML markup, where the SAPI library parses the SSML markup itself, then passes parsed SSML fragments to Vocalizer for synthesis. (This behavior is similar to how all versions of SAPI handle SAPI 5 XML tags.) However, this conversion loses some SSML information, and SAPI 5.3 currently has bugs that block proper handling for some SSML constructs, so SSML conformance is lower than Vocalizer’s built-in SSML parser. Thus, the SPF_IS_NOT_XML method described for SAPI 5.2 above is recommended; this method continues to work with SAPI 5.3.