Developing with Vocalizer

Text-to-speech (TTS) refers to the process of converting normal text from a file or command line into synthesized speech that a user can hear and understand. TTS allows a computer to communicate audio information in situations where visual communication is inadequate or impossible, and can thus add extra value to a product.

Text-to-speech primer

In general, TTS provides a valuable and flexible alternate to digital audio recordings in the following cases:

  • The application does not know in advance what it will need to speak.
  • The information varies too much to record and store all the possibilities.
  • Disk storage is insufficient to store recordings.
  • Professional recordings are too expensive.

Vocalizer supports mixing digital audio recordings with TTS for applications where a mixed approach is desired, or doing TTS where the generated audio comes from concatenating digital audio recordings.

Terminology

There are several terms that describe the processing and analysis that occurs during TTS conversion:

  • Grapheme: Text consists of a sequence of graphemes. A grapheme is the most elementary building block in the written form of a language. For example, a grapheme can be an alphabetic character, a grouping of characters that represent a single sound, a digit, punctuation, or a symbol such as the dollar sign ($).
  • Phoneme: A phonetic representation consists of a sequence of phonemes. A phoneme is the most elementary building block in the sound system of a language. In essence a phoneme constitutes a family of sound variants, which a language treats as being the “same.” The concept of phonemes makes it possible to establish patterns of organization in the indefinitely large range of sounds heard in a language. Typically, a specific language contains approximately 50 different phonemes.

    Nuance has established its own specifications for the representation of phonemes: the L&H+ phonetic alphabet. This alphabet associates each phoneme to a sequence of one or more characters. The phonemes of the supported languages with their associated L&H+ representation are described in the appropriate Vocalizer Language Supplement.

  • Morpheme: A morpheme is a linguistic unit of semantic meaning. A morpheme can be a single word, like “map”, or a prefix or suffix like “un-” or “-able”. Morphological analysis breaks down a sentence into its component morphemes and examines how they form words.
  • Orthography: Orthography is the set of rules that define how to write a language. A phonemic orthography is a writing system where each grapheme corresponds to a single phoneme in the language. For example, the L&H+ phonetic alphabet is a phonemic orthography.
  • Prosody: Prosody is the rhythm, intonation, and stresses used in a spoken language. All these factors can be important in conveying meaning in spoken language.

    For example, the word “record” can be a noun or a verb, depending on which syllable receives the stress. The inflection given to a word can also indicate whether it is a statement or a question, and different parts of a sentence may need different inflections. In tonal languages like Mandarin Chinese, it is particular important to use the correct prosody, as incorrect prosody can completely change the meaning of a word or phrase.

  • Syntax: The syntax of a language consists of the rules that define how words are used to construct a correct sentence. For example, the syntax may define the appropriate subject-verb agreement, or the proper word order in a sentence.
  • Viseme: A generic facial image associated with a sound. It is the visual equivalent of a linguistic phoneme of spoken language. You can use visemes to mimic the facial expressions and movements associated with Vocalizer speech. For example, you can synchronize TTS audio with an avatar that appears to speak the audio from the TTS engine. Your customers benefit when they see and hear speech at the same time, and hearing-impaired customers can often read the avatar's lips even if they cannot hear the audio.

How TTS works

The goal of TTS conversion is to take the input text, and produce audio that conveys the intended meaning as accurately and naturally as possible. This conversion takes place in three general stages of processing: linguistic, phonetic, and acoustic.

  1. First, ordinary text is entered into the system. Linguistic processing converts text into a phonetic representation.
  2. Phonetic processing calculates the speech parameters from this representation.
  3. Acoustic processing uses these parameters to generate a synthetic speech signal.