Developing with Vocalizer
Text-to-speech (TTS) refers to the process of converting normal text from a file or command line into synthesized speech that a user can hear and understand. TTS allows a computer to communicate audio information in situations where visual communication is inadequate or impossible, and can thus add extra value to a product.
Text-to-speech primer
In general, TTS provides a valuable and flexible alternate to digital audio recordings in the following cases:
- The application does not know in advance what it will need to speak.
- The information varies too much to record and store all the possibilities.
- Disk storage is insufficient to store recordings.
- Professional recordings are too expensive.
Vocalizer supports mixing digital audio recordings with TTS for applications where a mixed approach is desired, or doing TTS where the generated audio comes from concatenating digital audio recordings.
Terminology
There are several terms that describe the processing and analysis that occurs during TTS conversion:
- Grapheme: Text consists of a sequence of graphemes. A grapheme is the most elementary building block in the written form of a language. For example, a grapheme can be an alphabetic character, a grouping of characters that represent a single sound, a digit, punctuation, or a symbol such as the dollar sign ($).
- Phoneme: A phonetic representation consists of a sequence of phonemes. A phoneme is the most elementary building block in the sound system of a language. In essence a phoneme constitutes a family of sound variants, which a language treats as being the “same.” The concept of phonemes makes it possible to establish patterns of organization in the indefinitely large range of sounds heard in a language. Typically, a specific language contains approximately 50 different phonemes.
Nuance has established its own specifications for the representation of phonemes: the L&H+ phonetic alphabet. This alphabet associates each phoneme to a sequence of one or more characters. The phonemes of the supported languages with their associated L&H+ representation are described in the appropriate Vocalizer Language Supplement.
- Morpheme: A morpheme is a linguistic unit of semantic meaning. A morpheme can be a single word, like “map”, or a prefix or suffix like “un-” or “-able”. Morphological analysis breaks down a sentence into its component morphemes and examines how they form words.
- Orthography: Orthography is the set of rules that define how to write a language. A phonemic orthography is a writing system where each grapheme corresponds to a single phoneme in the language. For example, the L&H+ phonetic alphabet is a phonemic orthography.
- Prosody: Prosody is the rhythm, intonation, and stresses used in a spoken language. All these factors can be important in conveying meaning in spoken language.
For example, the word “record” can be a noun or a verb, depending on which syllable receives the stress. The inflection given to a word can also indicate whether it is a statement or a question, and different parts of a sentence may need different inflections. In tonal languages like Mandarin Chinese, it is particular important to use the correct prosody, as incorrect prosody can completely change the meaning of a word or phrase.
- Syntax: The syntax of a language consists of the rules that define how words are used to construct a correct sentence. For example, the syntax may define the appropriate subject-verb agreement, or the proper word order in a sentence.
- Viseme: A generic facial image associated with a sound. It is the visual equivalent of a linguistic phoneme of spoken language. You can use visemes to mimic the facial expressions and movements associated with Vocalizer speech. For example, you can synchronize TTS audio with an avatar that appears to speak the audio from the TTS engine. Your customers benefit when they see and hear speech at the same time, and hearing-impaired customers can often read the avatar's lips even if they cannot hear the audio.
How TTS works
The goal of TTS conversion is to take the input text, and produce audio that conveys the intended meaning as accurately and naturally as possible. This conversion takes place in three general stages of processing: linguistic, phonetic, and acoustic.
- First, ordinary text is entered into the system. Linguistic processing converts text into a phonetic representation.
- Phonetic processing calculates the speech parameters from this representation.
- Acoustic processing uses these parameters to generate a synthetic speech signal.

The linguistic processing of a TTS system performs several tasks:
- Text preprocessing
- Text normalization
- Orthographic-to-phonetics conversion (that is, grapheme-to-phoneme conversion and stress assignment)
- Lexical and morphological analysis, syntactic analysis, and, to a lesser extent, semantic analysis.
Text preprocessing breaks the input text into individual sentences. For specific application domains, additional intelligence can be built into a text preprocessing module.
Vocalizer can read aloud any written text, even if it contains various abbreviations, dates, currency indicators, time indicators, addresses, telephone numbers, bank account numbers, and punctuation marks such as quotation marks, parentheses, apostrophes, and so on.
For example, to solve the abbreviation problem, use an abbreviation dictionary. When abbreviations do not occur in the dictionary, Vocalizer pronounces them as single words or spells them out, depending on the structure of the abbreviation.
Vocalizer normalizes digits according to the syntactic and semantic context in which they appear. In English (as in Dutch and German) digit strings such as 1991 are pronounced differently according to the context—whether they represent a number, or a year. This is not the case in Spanish or French. However, in Spanish the conversion of digit strings also needs lexical information because the pronunciation of the digit string sometimes changes depending on the gender of the noun, or on the abbreviation that follows the string.
To handle text normalization, TTS systems use orthographic knowledge, frequently phrased by linguistic context dependent rules, in combination with dictionary lookup.
This conversion is one of the main tasks of the linguistic processing stage of TTS. Vocalizer possesses a large amount of pronunciation knowledge to perform this task, which includes grapheme-to-phoneme conversion, syllabification, and stress assignment.
- Consulting dictionaries containing full word forms or morphemes.
- Using a set of pronunciation rules.
- Using techniques such as neural nets or classification trees.
There are many different methods of orthographic-to-phonetic conversion. Vocalizer uses a hybrid strategy that combines word dictionaries, morpheme dictionaries, and pronunciation rules. Although the same strategy can be used for the development of all language versions, each language has its own particularities.
Vocalizer performs lexical, morphological, and syntactic analysis to solve pronunciation ambiguities. Lexical analysis converts the input characters into categorized blocks of text, such as sentences, words, and special types like dates and numbers. Morphological analysis breaks down a sentence into its component morphemes (such as root words, prefixes, and suffixes) and examines how they form words. Syntactic analysis determines the grammatical structure of the text.
The English verb re-cord for example, can also be pronounced as the noun re-cord. In French, the character string président is pronounced differently depending on its part-of-speech (noun or verb).
Lexical, morphological, and syntactic information also creates a correct prosodic pattern for each sentence. For example, important syntactic boundaries entail intonational changes and vowel lengthening.
Vocalizer tags isolated words with their parts of speech using a combination of morphological rules and dictionary look-up. For example, particular word endings help predict the part of speech of words.
The syntactic analysis can be performed with different parsing techniques. Some of these techniques are developed within the field of Natural Language Processing (NLP) and adapted to the special needs of TTS synthesis. For example, parsing techniques for TTS need to meet the realtime requirement much more than for NLP applications such as text translation, .
Vocalizer does not perform a full syntactic analysis, that is, it does not construct a full syntax tree, but rather performs a phrase level parsing. For instance, Vocalizer can use context dependent rules to solve part-of-speech ambiguities and divide a sentence in word groups and prosodic phrases.

The phonetic module performs two main tasks: segmental synthesis and creation of good prosodic patterns.
Vocalizer’s segmental synthesis module is responsible for the synthesis of the spectral characteristics of synthetic speech. The segmental synthesis module also handles amplitude (loudness).
Vocalizer creates good prosodic characteristics to ensure intelligible and natural sounding speech. To synthesize prosody, Vocalizer assigns a correct duration to each phoneme, and produces an intonation contour.
With respect to the intonation, some important principles have to be taken into account.
- Each sentence contains one or more important or dominant words.
- In many languages, an important word is marked by means of an intonation accent realized as a pitch movement on the lexically accented syllable of the important word.
- Intonation is not only used to emphasize words but also to mark the sentence type (for example, declarative versus interrogative, WH-questions versus yes/no-questions) and to mark important syntactic boundaries (for example, with phrase final continuation rises).
- In tone languages such as Chinese, Vocalizer conveys word meanings and grammatical contrasts by variations in pitch. In pitch-accent languages such as Swedish and Japanese, Vocalizer pronounces a particular syllable in a word with a certain tone. This is in contrast to languages such as English where each word has a fixed lexical stress position.
Vocalizer’s language-specific intonation module models the perceptually relevant intonation effects of the target language. It takes into account the number, location, and stress level of the important words, the location of the major syntactic boundaries, and the sentence type.
Assigning a correct duration to each phoneme is essential. Phoneme durations are influenced by several factors. A partial list includes:
- Phonetic context
- Stress level
- Position within the word
- Syntactic structure of the sentence
- Opposition between content and function words

The last part of a TTS conversion is the acoustic processing where Vocalizer converts the speech data into a speech signal. The chosen synthesis model needs to allow the independent manipulation of spectral characteristics, phoneme duration, and intonation.