Minimizing latency

This topic provides suggestions to help optimize performance and minimize latency.

Operating system restrictions for Vocalizer

Each instance of Vocalizer requires a number of file handles that are used for accessing, among others, the voice database. Some operating systems, such as Microsoft Windows, have a default limit of file handles per process. If you have a very large number of Vocalizer instances, or if the application uses many file handles of its own, the application can run out of file handles. In order to avoid this problem, Nuance recommends that you set the maximum number of file handles to an appropriate value. For Microsoft Windows, you can set this value by having your application issue the _setmaxstdio C-runtime call. See your compiler or operating system manual.

For Unix, the number of file handles can be increased by means of the limit/ulimit commands. Refer to the man pages or to the compiler manual.

Optimal audio buffer size

The audio buffer size is an important factor for minimizing latency (time to first audio) and avoiding underruns, where larger buffer sizes are more efficient for CPU use, but increase latency and the risk of underruns. A good starting point is a buffer that is big enough for half a second of audio rounded up to the nearest multiple of 1024 bytes (1K):

  • 4096 bytes for an 8 kHz sampling rate voice for µ-law or A-law audio output
  • 8192 bytes for an 8 kHz sampling rate voice for linear 16-bit PCM audio output
  • 22528 bytes for a 22kHz sampling rate voice

For the native API, this buffer is provided by the application and is specified via the return value of the TTS_DEST_CB typed Destination callback function.

For SAPI 5 based applications, this is determined by SAPI 5 and cannot be changed.

Limiting delays when internet fetching is used

When content such as input texts, user dictionaries, rulesets, and ActivePrompt databases are located on a Web server, this can result in delays when the content is fetched for the first time. Since the internet fetch library uses a (configurable) cache, the download time will be minimal if the cache has been configured well (big enough, reasonable cache entry expiration time), the web server is configured to support caching all the data (specifies HTTP/1.1 caching parameters like maxage), and the cache has been warmed up.

To warm up the cache, the application can perform a number of dummy speak requests. For input texts, the content will already be cached before the first audio packet is delivered. So during the warmup, the application can stop the synthesis request after the first audio packet to speed up the warmup.

Audio content specified via the SSML <audio> tag is always fetched on message (normally a sentence) boundaries, but not necessarily before the first audio packet is delivered. User dictionaries, rulesets, and ActivePrompt databases can be loaded and unloaded to obtain a copy in the cache without consuming RAM. If RAM usage is not a problem, load them as soon as possible and leave them loaded.