Performance testing

This topic describes the methodology used to characterize the performance and scalability of Vocalizer. It contains an overview of the method chosen, a description of the tool that Nuance uses to run tests, and details about specific metrics in the test.

Actual test results for any given Vocalizer release, operating system, and voice are provided in the Release Notes for each voice.

Ultimately, the model Nuance chose to measure Vocalizer performance and the criteria for interpreting the test results is arbitrary when compared to the deployment context of actual applications. Nuance has chosen a representative approach, but each deployment might have different requirements.

Testing methodology

Because of the wide variety of situations where TTS is used, it is impractical to model and test all of those scenarios. There is a spectrum of possible ways. At one end, a model could simulate an application that requires TTS output a small percentage of the time—say 15%. Results from this model are only useful when sizing a similar application. At the other end, another model could have all engine instances synthesizing all of the time, even if they are faster than real time. Results from this method are also dubious because that level of aggressive use is atypical.

Nuance tests Vocalizer with a method that lies in the middle. The model simulates an application that has TTS output playing on all output channels 100% of the time. This is different from having all engines synthesizing 100% of the time. For example, in the Nuance model, if the testing application requests 10 seconds of audio from Vocalizer and receives all of the audio data within one second, it waits another 9 seconds before sending the next request. This model is probably also atypical—few applications require that much TTS—but the results offer a more accurate way to size a deployment than the other ends of the spectrum.

These tests assume no specific tuning of any operating system to improve application performance. For all measurements, Vocalizer runs in-process within the test application, and uses the native Vocalizer API.

Test application and a scenario

The Vocalizer SDK contains a set of sample applications. This includes the nvsload utility, the same testing application Nuance uses to test Vocalizer performance. You can use this tool to test performance on your specific configuration (regardless of how you build your application or configure Vocalizer. For usage details including command line options, see VOCALIZER_SDK\api\demos\nvnload\README.txt.

The application submits text to Vocalizer over a variable and configurable number of engine instances. Command-line options specify the initial and final number of engines as well as the increment. For example, a test run may begin with 40 engines and progress to 70 in increments of five. At each step, a configurable number of speak requests, for example 100, is submitted per engine. The test application’s output is a table of result data with each row showing the results from each incremental run. The row shows various statistics describing the responsiveness and speed of Vocalizer. A few of these are described in the sections below. All of the measurements, except CPU utilization and memory usage, are made on the application side of the Vocalizer API.

Vocalizer’s performance depends on the text being synthesized, including its content, complexity, individual utterance length, presence of unusual strings, abbreviations, and so on. Sample texts are taken from a selection of newspaper articles with each heading and paragraph forming a separate speak request that is sent to Vocalizer in a random order. The Vocalizer sample supplies the input texts that Nuance uses for testing. These input texts are included in files named loadTest_American_English.txt, loadTest_French.txt, and so on, for the different languages stored in the test_data subdirectory within the Vocalizer installation.

Performance statistics

Nuance uses several metrics to evaluate test results. These help when estimating the tasks and loads that Vocalizer can handle in a given configuration. The output table from the testing application has several columns of results from each test. This description focuses on the most important metrics.

  • Latency (time-to-first-audio): Applications with performance concerns start with the most obvious criterion: once the application asks Vocalizer to synthesize text, how long will it be until the application receives the first audio data? This time span is called latency, defined as the time between the application’s call to TtsProcessEx and the arrival of the first corresponding audio packet in the callback function.

    Although Vocalizer uses streaming audio to minimize latency, the TTS engine must process whole phrases at a time to obtain optimal rhythm and intonation. This causes a processing delay but the size of this delay is highly dependent on the characteristics of the requested text, specifically the length and, to a lesser degree, the complexity of the first phrase. For example, an input that begins with a one-word phrase such as “Hello” has shorter latency than an input that begins with a 30 word sentence.

  • Audio buffer underflow: Once the first audio data is delivered and audio output (to a caller or PC user) can begin, the concern is whether subsequent audio data will arrive fast enough to avoid having “gaps” in the audio. When these gaps occur, this is called an underflow: audio buffer underflow refers to the audio dropout that occurs when the application plays all of the audio it received from Vocalizer and is idle while waiting for the next audio data packet.

    The audio buffer underflow rate is the percentage of underflows that occur over time. For example, assume that each audio packet passed to the callback function contains 512 ms of audio data. An underflow rate of 1% therefore translates to a potential gap every 100 audio buffers, or 51.2 seconds of audio. A rate of 0.01% equals a gap on average once every 90 minutes.

Resource consumption

The testing application does not test two important metrics needed for proper sizing estimation: CPU use and memory use. These have to be monitored using external tools such as Windows Performance Monitor or Unix ps.

  • CPU utilization is the percentage of the total CPU time spent processing user or system calls on behalf of Vocalizer. As more TTS ports are opened and used for processing, CPU utilization increases approximately linearly. When CPU utilization exceeds 85%, performance degrades rapidly as further ports are opened.
  • Memory use: Vocalizer’s base memory requirement varies per voice, and is usually around 10–20 MB. See the Release Notes for each voice for detailed sizing data that describes the base memory usage for each voice. In addition, each engine instance (as created by calling TtsOpen) requires incremental amounts of memory, usually about 500 KB. Therefore, the formula to calculate total memory requirements is: base + (# engines * 500 KB). For 50 instances of a voice that has a base of 15 MB, that results in a memory use of 40 MB.

Determining the supported number of channels

Vocalizer supports about 50 parallel channels on a single host. The true maximum depends on hardware, configuration, and your license. When your hosts are within the guidelines, you can expect the following performance:

  • Average latency <= 250ms
  • Audio buffer underflow rate <= 0.04%
  • CPU utilization <= 85%

These thresholds may not suit your requirements, and you may wish to rerun tests with different criteria.