Microsoft SAPI 5 compliance

Vocalizer provides support for the Microsoft Speech API (SAPI), version 5. For information on the functions documented below, see the Microsoft documentation for the Microsoft Speech SDK 5.1. For information on Vocalizer support for Microsoft SAPI 5 phoneme sets, see the Vocalizer user guide for each Vocalizer language.

SAPI 5 text-to-speech engine interface

Vocalizer implements the SAPI 5 Text-To-Speech Engine interface. Applications do not directly call this interface; they instead call into SAPI 5 components that are implemented on top of this interface.

The following table lists the SAPI 5 Text-To-Speech Engine interfaces and methods, describing which of these are supported by Vocalizer.

Interface

Function name

Availability

ISpTTSEngine

Speak

Supported

GetOutputFormat

Supported

ISpTTSEngineSite

ISpEventSink

Supported

GetActions

Supported

Write

Supported

GetRate

Supported

GetVolume

Supported

GetSkipInfo

Supported

CompleteSkip

Supported

SAPI 5 text-to-speech application interface

Applications directly call this interface (ISpVoice), which is implemented by SAPI 5 components. Those Microsoft components then call into the Vocalizer SAPI 5 Text-To-Speech Engine implementation.

ISpVoice is the only public SAPI 5 interface for application access to the Text-To-Speech engine. The ISpVoice interface enables an application to perform text synthesis operations. Applications can speak text strings and text files, or play audio files through this interface. All of these can be done synchronously or asynchronously.

Applications can choose a specific TTS voice using ISpVoice::SetVoice. The state of the voice (for example, rate, pitch, and volume), can be modified using SAPI XML tags that are embedded into the spoken text. Some attributes, like rate and volume, can also be changed in real time using SAPI API methods such as ISpVoice::SetRate and ISpVoice::SetVolume. Voices can be set to different priorities using ISpVoice::SetPriority.

ISpVoice inherits from the ISpEventSource interface. An ISpVoice object forwards events back to the application when the corresponding audio data has been rendered to the output device.

The following table lists the SAPI 5 Text-To-Speech Engine interfaces and methods, describing which of these are supported when the Vocalizer engine is used. It also includes notes on Vocalizer-specific extensions or limitations. For a description of each member function, see the Microsoft Speech SDK v5.1 Reference chapter on Text-To-Speech interfaces (ISpVoice).

ISpVoice function name

Availability

Vocalizer-specific notes

SetOutput Supported Vocalizer only supports 8 kHz and 22 kHz voices. If the application chooses other frequencies, then the Microsoft SAPI 5 layer will use conversion software installed in the PC, which might cause speech quality degradation and performance degradation. If other frequencies are required, speak to your Nuance sales representative to see if a different Vocalizer product variant would be a better match for your application.
GetOutputObjectToken

Supported

See ISpVoice::SetOutput.
GetOutputStream Supported  
Pause Supported  
Resume Supported  
SetVoice Supported  
GetVoice Supported  
Speak Supported  
SpeakStream Supported  
GetStatus Supported  
Skip

Not supported

SetPriority Supported  
GetPriority Supported  
SetAlertBoundary Supported  
GetAlertBoundary Supported  
SetRate Supported  
GetRate Supported  
SetVolume Supported SAPI enforces its own default volume. This means that the baseline.xml configuration file does not support specifying the default rate and volume, and that the Vocalizer default volume when used via SAPI is 100 (the maximum volume) instead of 80.
GetVolume Supported See ISpVoice::SetVolume note.
WaitUntilDone Supported  
SetSyncSpeakTimeout Supported  
GetSyncSpeakTimeout Supported  
SpeakCompleteEvent Supported  
IsUISupported Not supported This member function is not supported by Vocalizer as Vocalizer does not provide a custom SAPI 5 Control Panel User Interface.
DisplayUI Not supported This member function is not supported by Vocalizer as Vocalizer does not provide a custom SAPI 5 Control Panel User Interface.

SAPI 5 XML tag support

The following SAPI 5 Text-To-Speech XML tags can be embedded in the input text to change the Text-To-Speech output. For each XML tag, you will find the following information:

Description

Gives a description of the XML tag.

Syntax

Displays the syntax of the XML tag.

Comments

Gives remarks that are specific to Vocalizer’s support of the XML tag.

Example

Shows how to use the XML tag.

Note: Incorrectly specified SAPI XML control tags are ignored and treated as white space. Also note that SAPI XML control tags that are not described in the following table are not supported.

Tip: To learn abut the use and syntax of SAPI 5 XML tags, see the Text-To-Speech Interface chapter of the Microsoft Speech SDK v5.1 Reference.

SAPI 5 lexicons

SAPI 5 provides lexicons so that users and applications can specify pronunciation and part of speech information for particular words. As such, SAPI compliant text-to-speech engines like Vocalizer need to support these lexicons to guarantee uniformity of pronunciation and part of speech information.

There are two types of lexicons in SAPI: user lexicons and application lexicons.

  • User lexicons: Each user who logs in to a computer has a user lexicon. Initially, this lexicon is empty; words can be added either programmatically, or by using an engine’s add/remove words UI component. (For example, the sample application Dictation Pad provides an Add/Remove Words dialog.)
  • Application lexicons: Applications can create and ship their own lexicons of specialized words. These lexicons are fixed and cannot be edited.

Detailed information on how to use the Microsoft SAPI 5 lexicons can be found in the Microsoft Speech SDK v5.1 Reference chapter entitled “ISpLexicon Interface”.

The Vocalizer SAPI 5 engine can be configured to automatically load Vocalizer proprietary user dictionaries that tune Vocalizer pronunciations, as well as Vocalizer rulesets and ActivePrompt databases. This mechanism provides a better alternative to the Microsoft-defined SAPI 5 lexicon solution, as Vocalizer user dictionaries can specify native Vocalizer phoneme strings for more accurate pronunciations, and Vocalizer user dictionaries are stored in files which are easier to manage then registry-based Microsoft SAPI 5 lexicons.

To configure default Vocalizer user dictionaries, modify the <default_dictionaries> element within the Vocalizer SAPI 5 configuration file (install_path\config\tts_config.xml in the Vocalizer installation directory).

Using SSML via SAPI

Vocalizer supports W3C SSML markup for text submitted via SAPI. However, the application must use special methods to do so, and the available methods depend on the installed version of the SAPI components (sapi.dll). Nuance does not redistribute these components; they are a standard part of Windows. Contact Microsoft for more information.

SAPI 5.2 and earlier do not have native support for SSML markup. By default, they try to interpret the SSML as SAPI 5 XML tags, strip out the markup, and speak the resulting text. This behavior results in incorrect text-to-speech output. To work around this, specify the SPF_IS_NOT_XML flag when calling the ISpVoice::Speak method. While this flag may seem counter-intuitive, it blocks the SAPI XML parser, allowing the SSML markup to be passed to Vocalizer as-is. As long as the SSML markup is encoded as little-endian UTF-16 and has a valid SSML header, Vocalizer’s SAPI integration auto-detects the SSML markup and parses it during Vocalizer’s internal SSML parser (with the SSML conformance described in Vocalizer SSML support).

SAPI 5.3 and later have native support for SSML markup, where the SAPI library parses the SSML markup itself, then passes parsed SSML fragments to Vocalizer for synthesis. (This behavior is similar to how all versions of SAPI handle SAPI 5 XML tags.) However, this conversion loses some SSML information, and SAPI 5.3 currently has bugs that block proper handling for some SSML constructs, so SSML conformance is lower than Vocalizer’s built-in SSML parser. Thus, the SPF_IS_NOT_XML method described for SAPI 5.2 above is recommended; this method continues to work with SAPI 5.3.