Microsoft SAPI 5 compliance

Vocalizer provides support for the Microsoft Speech API (SAPI), version 5. For information on the functions documented below, see the Microsoft documentation for the Microsoft Speech SDK 5.1. For information on Vocalizer support for Microsoft SAPI 5 phoneme sets, see the Vocalizer user guide for each Vocalizer language.

SAPI 5 text-to-speech engine interface

Vocalizer implements the SAPI 5 Text-To-Speech Engine interface. Applications do not directly call this interface; they instead call into SAPI 5 components that are implemented on top of this interface.

The following table lists the SAPI 5 Text-To-Speech Engine interfaces and methods, describing which of these are supported by Vocalizer.

Interface	Function name	Availability
ISpTTSEngine	Speak	Supported
ISpTTSEngine	GetOutputFormat	Supported
ISpTTSEngineSite	ISpEventSink	Supported
	GetActions	Supported
	Write	Supported
	GetRate	Supported
	GetVolume	Supported
	GetSkipInfo	Supported
	CompleteSkip	Supported

SAPI 5 text-to-speech application interface

Applications directly call this interface (ISpVoice), which is implemented by SAPI 5 components. Those Microsoft components then call into the Vocalizer SAPI 5 Text-To-Speech Engine implementation.

ISpVoice is the only public SAPI 5 interface for application access to the Text-To-Speech engine. The ISpVoice interface enables an application to perform text synthesis operations. Applications can speak text strings and text files, or play audio files through this interface. All of these can be done synchronously or asynchronously.

Applications can choose a specific TTS voice using ISpVoice::SetVoice. The state of the voice (for example, rate, pitch, and volume), can be modified using SAPI XML tags that are embedded into the spoken text. Some attributes, like rate and volume, can also be changed in real time using SAPI API methods such as ISpVoice::SetRate and ISpVoice::SetVolume. Voices can be set to different priorities using ISpVoice::SetPriority.

ISpVoice inherits from the ISpEventSource interface. An ISpVoice object forwards events back to the application when the corresponding audio data has been rendered to the output device.

The following table lists the SAPI 5 Text-To-Speech Engine interfaces and methods, describing which of these are supported when the Vocalizer engine is used. It also includes notes on Vocalizer-specific extensions or limitations. For a description of each member function, see the Microsoft Speech SDK v5.1 Reference chapter on Text-To-Speech interfaces (ISpVoice).

ISpVoice function name	Availability	Vocalizer-specific notes
SetOutput	Supported	Vocalizer only supports 8 kHz and 22 kHz voices. If the application chooses other frequencies, then the Microsoft SAPI 5 layer will use conversion software installed in the PC, which might cause speech quality degradation and performance degradation. If other frequencies are required, speak to your Nuance sales representative to see if a different Vocalizer product variant would be a better match for your application.
GetOutputObjectToken	Supported	See ISpVoice::SetOutput.
GetOutputStream	Supported
Pause	Supported
Resume	Supported
SetVoice	Supported
GetVoice	Supported
Speak	Supported
SpeakStream	Supported
GetStatus	Supported
Skip	Not supported
SetPriority	Supported
GetPriority	Supported
SetAlertBoundary	Supported
GetAlertBoundary	Supported
SetRate	Supported
GetRate	Supported
SetVolume	Supported	SAPI enforces its own default volume. This means that the baseline.xml configuration file does not support specifying the default rate and volume, and that the Vocalizer default volume when used via SAPI is 100 (the maximum volume) instead of 80.
GetVolume	Supported	See ISpVoice::SetVolume note.
WaitUntilDone	Supported
SetSyncSpeakTimeout	Supported
GetSyncSpeakTimeout	Supported
SpeakCompleteEvent	Supported
IsUISupported	Not supported	This member function is not supported by Vocalizer as Vocalizer does not provide a custom SAPI 5 Control Panel User Interface.
DisplayUI	Not supported	This member function is not supported by Vocalizer as Vocalizer does not provide a custom SAPI 5 Control Panel User Interface.

SAPI 5 XML tag support

The following SAPI 5 Text-To-Speech XML tags can be embedded in the input text to change the Text-To-Speech output. For each XML tag, you will find the following information:

Description	Gives a description of the XML tag.
Syntax	Displays the syntax of the XML tag.
Comments	Gives remarks that are specific to Vocalizer’s support of the XML tag.
Example	Shows how to use the XML tag.

Note: Incorrectly specified SAPI XML control tags are ignored and treated as white space. Also note that SAPI XML control tags that are not described in the following table are not supported.

Tip: To learn abut the use and syntax of SAPI 5 XML tags, see the Text-To-Speech Interface chapter of the Microsoft Speech SDK v5.1 Reference.

Voice

The Voice tag is completely implemented by SAPI 5 components, with no control by Vocalizer other than publishing the Vocalizer voice attributes.

The Voice tag selects a voice based on its attributes, Age, Gender, Language, Name, Vendor, and VendorPreferred. The tag can be empty, in which case it changes the voice for all subsequent text, or it can have content, in which case it only changes the voice for that content.

The Voice tag has two attributes: Required and Optional. These correspond exactly to the required and optional attributes for the EnumerateTokens and SpFindBestToken methods in the SAPI 5 ISpObjectTokenCategory interface. The selected voice follows exactly the same rules as the latter of these two functions. That is, it selects a voice where all the required attributes are present, and where more optional attributes are present than with the other installed voices (if several voices have equal numbers of optional attributes, one is selected at random).

For more details, see Object Tokens and Registry Settings in the Microsoft Speech SDK v5.1 Reference.

In addition, the attributes of the current voice are always added as optional attributes when the Voice tag is used. This means that a voice that is more similar to the current voice will be selected over one that is less similar.

If no voice is found that matches all of the required attributes, no voice change will occur.

Syntax:

<voice required=type of info.=info.>Input text</voice>

<voice optional=type of info.=info.>Input text</voice>

Examples:

<voice required="Gender=Female;Age!=Child">A female non-child speaks this sentence, if one exists. </voice>

<voice required="Age=Teen">A teen speaks this sentence—if a female, non-child teen is present, she will be selected over a male teen, for example. </voice>

SAPI 5 lexicons

SAPI 5 provides lexicons so that users and applications can specify pronunciation and part of speech information for particular words. As such, SAPI compliant text-to-speech engines like Vocalizer need to support these lexicons to guarantee uniformity of pronunciation and part of speech information.

There are two types of lexicons in SAPI: user lexicons and application lexicons.

User lexicons: Each user who logs in to a computer has a user lexicon. Initially, this lexicon is empty; words can be added either programmatically, or by using an engine’s add/remove words UI component. (For example, the sample application Dictation Pad provides an Add/Remove Words dialog.)
Application lexicons: Applications can create and ship their own lexicons of specialized words. These lexicons are fixed and cannot be edited.

Detailed information on how to use the Microsoft SAPI 5 lexicons can be found in the Microsoft Speech SDK v5.1 Reference chapter entitled “ISpLexicon Interface”.

The Vocalizer SAPI 5 engine can be configured to automatically load Vocalizer proprietary user dictionaries that tune Vocalizer pronunciations, as well as Vocalizer rulesets and ActivePrompt databases. This mechanism provides a better alternative to the Microsoft-defined SAPI 5 lexicon solution, as Vocalizer user dictionaries can specify native Vocalizer phoneme strings for more accurate pronunciations, and Vocalizer user dictionaries are stored in files which are easier to manage then registry-based Microsoft SAPI 5 lexicons.

To configure default Vocalizer user dictionaries, modify the <default_dictionaries> element within the Vocalizer SAPI 5 configuration file (install_path\config\tts_config.xml in the Vocalizer installation directory).

Using SSML via SAPI

Vocalizer supports W3C SSML markup for text submitted via SAPI. However, the application must use special methods to do so, and the available methods depend on the installed version of the SAPI components (sapi.dll). Nuance does not redistribute these components; they are a standard part of Windows. Contact Microsoft for more information.

SAPI 5.2 and earlier do not have native support for SSML markup. By default, they try to interpret the SSML as SAPI 5 XML tags, strip out the markup, and speak the resulting text. This behavior results in incorrect text-to-speech output. To work around this, specify the SPF_IS_NOT_XML flag when calling the ISpVoice::Speak method. While this flag may seem counter-intuitive, it blocks the SAPI XML parser, allowing the SSML markup to be passed to Vocalizer as-is. As long as the SSML markup is encoded as little-endian UTF-16 and has a valid SSML header, Vocalizer’s SAPI integration auto-detects the SSML markup and parses it during Vocalizer’s internal SSML parser (with the SSML conformance described in Vocalizer SSML support).

SAPI 5.3 and later have native support for SSML markup, where the SAPI library parses the SSML markup itself, then passes parsed SSML fragments to Vocalizer for synthesis. (This behavior is similar to how all versions of SAPI handle SAPI 5 XML tags.) However, this conversion loses some SSML information, and SAPI 5.3 currently has bugs that block proper handling for some SSML constructs, so SSML conformance is lower than Vocalizer’s built-in SSML parser. Thus, the SPF_IS_NOT_XML method described for SAPI 5.2 above is recommended; this method continues to work with SAPI 5.3.

Microsoft SAPI 5 compliance

SAPI 5 text-to-speech engine interface

SAPI 5 text-to-speech application interface

SAPI 5 XML tag support

SAPI 5 lexicons

Using SSML via SAPI

Related topics