TRAMES, 2007, 11(61/56), 2, 284298


Modelling speech temporal structure for Estonian text-to-speech synthesis: feature selection

(full text in pdf format)


Meelis Mihkla


Institute of the Estonian Language, Tallinn


Abstract. The article discusses the principles of selecting features for modelling the temporal structure of Estonian speech, using different types of read-out texts, with a view to text-to-speech synthesis (TTS). Feature selection is known to depend on certain general issues regulating speech temporal structure, as well as on some language specific aspects. The durational model of Estonian stands out for some foot-bound features (foot quantity degree, number of feet in the word) being included in the input. In addition to the traditional descriptors of sound context and hierarchical position the prediction of Estonian segmental durations requires information on some morphological, syntactic and lexical features of the word, such as word form, part of sentence, and part of speech. In the prediction of pauses in the speech flow the relevant features are: distance from sentence beginning and from the previous pause, the length and quantity degree of the preceding foot, and the occurrence of a punctuation mark or conjunction. Although expert opinions were used in feature selection, statistical methods should be applied to test the vector of optimal argument features.


Keywords: feature selection, speech timing, segmental durations, pauses, text-to-speech synthesis, feature significance, statistical modelling




Campbell, Nick (2000) “Timing in speech: a multilevel process”. In Prosody: theory and experiment, 281–334. M. Horne, ed. Dordrecht/Boston/London: Kluwer Academic Publishers.

Campbell, N. W. and S. D. Isard (1991) “Segment durations in a syllable frame” Journal of Phonetics 19, 37–47.

Eek, Arvo and Einar Meister (1999) “Estonian speech in the BABEL multi-language database: phonetic-phonological problems revealed in the text corpus”. In Proceedings of LP’98, II, 529–546. O. Fujimura, ed. Prague: The Karolinum Press.

Eek, Arvo and Einar Meister (2003) “Foneetilisi katseid ja arutlusi kvantiteedi alalt (I): Häälikukestusi muutvad kontekstid ja välde”. [Phonetic tests and disputes about quantity (I): Contexts changing sound duration and quantity degreeTõlge inglise keelde.] Keel ja Kirjandus (Tallinn) 46, 11, 815–837 and 12, 904–918.

Eek, Arvo and Einar Meister (2004) “Foneetilisi katseid ja arutlusi kvantiteedi alalt (II): Takt, silp ja välde”. [Phonetic tests and disputes about quantity (II). Foot+++++, syllable and quantity.] Keel ja Kirjandus (Tallinn) 47, 4, 251–277 and 5, 336–357.

Dutoit , Thierry (1997) An introduction to text-to-speech synthesis. Dordrecht: Kluwer Academic Publishers.

Horak, Pavel (2005) “Using neural networks to model Czech text-to-speech synthesis”. In Proceedings of the 16th Conference of electronic speech signal processing, 76–83. R. Vich, ed. Prague: TUDpress.

Huggins, A.W.F. (1968) “The perception of timing in natural speech: compensation within syllable”. Language and Speech 11, 1–11.

Kaalep, Heiki-Jaan and Tarmo Vaino (2001) “Complete morphological analysis in the linguist’s toolbox”. In Congressus Nonus Internationalis Fenno-Ugristarum, Tartu 7.-13.08.2000, V, 9–16. Tartu: TÜ Kirjastus.

Klatt, D. H. (1979) “Synthesis by rule of segmental durations in English sentences”. In Frontiers of Speech Communication research, 287–300. B. Lindblom and S. Öhman, eds. New York: Academic Press.

Liiv, Georg (1961) “Eesti keele kolme vältusastme vokaalide kestus ja meloodiatüübid”. [Duration of vowels of the three quantity degree++++ of Estonian and types of melody.Tõlge inglise keelde] Keel ja Kirjandus (Tallinn) 4, 7, 412–424 and 8, 480–490.

Meister, Einar and Stefan Werner (2006) “Intrinsic microprosodic variations in Estonian and Finnish: acoustic analysis”. In Fonetiikan Päivät 2006 = The Phonetics Symposium 2006, 103–112. R. Aulanko, L. Wahlberg, and M. Vainio, eds. (Publications of the Department of Speech Sciences, University of Helsinki) Helsinki: University of Helsinki.

Mihkla, Meelis and Jüri Kuusik (2005) “Analysis and modelling of temporal characteristics of speech for Estonian text-to-speech synthesis”. Linguistica Uralica 41, 2, 91–97.

Mihkla, Meelis (2006a) “Pausid kõnes”. [Pauses in Speech.] Keel ja Kirjandus (Tallinn) 49, 4, 286–295.

Mihkla, Meelis (2006b) “Comparison of statistical methods used to predict segmental durations”. In Fonetiikan Päivät 2006 = The Phonetics Symposium 2006, 120–124. R. Aulanko, L. Wahl­berg, and M. Vainio, eds. (Publications of the Department of Speech Sciences, University of Helsinki) Helsinki: University of Helsinki.

Mihkla, Meelis (2007) “Morphological and synthetic factors in predicting segmental durations for Estonian text-to-speech synthesis”. Proceedings ICPhS 2007. (accepted, in print).

Sagisaka, Yoshinori (2003) “Modeling and perception of temporal characteristics in speech”. In Proceedings of 15th International Congress of Phonetic Sciences, 1–6. M. J. Sole, D. Recasens, and J. Romero, eds. Barcelona.

van Santen, Jan (1998) “Timing”. In Multilingual text-to-speech synthesis: the Bell Labs approach, 115–140. R. Sproat, ed. [KOHT] Kluwer Academic Publishers.

Stout, Rex 2003 “Deemoni surm”. [Death of a Demon.Tõlge inglise keelde] CD-versioon (Read by Andres Ots). Tallinn: Elmatar.


Tatham, Mark and Katherine Morton (2005) Developments in speech synthesis. Chichester: John Wiley & Sons Ltd.

Tseng, C. (2002) “The prosodic status of breaks in running speech: examination and evaluation”. In Proceedings of Speech Prosody 2002, 667–670. Aix-en-Provence, France.

Vainio, Martti (2001) Artificial neural network based prosody models for Finnish text-to-speech synthesis. Helsinki: University of Helsinki.

Viks, Ülle (2000). “Eesti keele avatud morfoloogiamudel” [Open morphology model of Estonian language.Tõlge inglise keelde]. In Arvutuslingvistikalt inimesele, 9–36. [From computational linguistics to people.] T. Hennoste, ed. (Tartu Ülikooli üldkeeleteaduse õppetooli toimetised, 1.) Tartu.