Speech synthesis systems require accurate phoneme segmentation, as this segmentation information is explicitly modelled with Hidden Markov Models (duration modelling) and is generated back in the synthesis phase. Any errors in the segmentation may lead to loss of timbre, glitches and other artifacts in the synthesized voice. Conventionally, segmentation is carried out as a three step process: (1) Flat-start initialization of monophone HMMs, (2) Embedded reestimation and (3) Forced-Viterbi alignment. A fundamental drawback of this approach is that boundaries are not represented by this model as HMMs do not use proximity to boundary positions as a criterion for optimality during training.
Phone transitions are not necessarily distinguishable owing to coarticulation in continuous speech. On the other hand, syllable boundaries are more or less distinct, owing to syllable being the fundamental unit of speech production and cognition. The acoustic energy between syllables is significantly lower than at the middle of a syllable. Since syllable boundaries are characterised by low energy, short-time energy (STE) can be used as a cue to determine syllable boundaries but it cannot be applied directly owing to local fluctuations. The STE function, when smoothed by performing group delay processing, can be used to detect syllable boundaries.
Syllable boundary correction with group-delay : If the syllable does not end with a nasal, fricative or if it is not followed by a nasal, fricative, affricate or a semi-vowel; the HMM boundary of that syllable is moved to the nearby region of low energy given by the group delay function with high resolution.
Syllable boundary correction with spectral flux : Syllable boundaries for certain class of phonemes such as fricatives and affricates can be characterized by an abrupt change in the spectral energy. Sub-band spectral flux, which is the Euclidean distance between two successive frames of FFT coefficients, processed in 4 different sub-bands, is used to capture such acoustic landmarks in the spectrogram. These acoustic landmarks can then be used for correcting the HMM boundaries of the aforementioned class of phonemes.
After correcting the boundaries of the syllables, the syllables are then spliced and the models within the syllables are reestimated, which is followed by a forced Viterbi alignment at the syllable level. This boundary correction and reestimation is done as a two pass procedure. The final alignment is obtained from concatenation of these syllable splices to form an utterance level alignment. This phoneme alignment is very accurate compared to the flatstart alignment procedure. The accurate alignment is then used for initializing phone models in the Hidden Markov based speech synthesis system (HTS) framework.