Krishna Sankar recently developed a Malayalam - English bilingual Text to Speech System, Mozhi. Check the web demo page and play around with your choice of words and listen to the natural speech it produces. Krishna has set a future goal of understanding the emotional content of the text and read it out accordingly. This is expected to make the application suitable for audio books. Generating audio for arbitrary speaker with very few training samples is another area he plans to work on.
How was Mozhi built?
Neural networks based Text to Speech synthesis started to give impressive results thanks to the early results from Tacotron2 which came out in 2017. This was followed by rapid development in improving the speech quality and efficiency thanks to works like GlowTTS, FastSpeech2, FastPitch, SpeedySpeech to name a few. Malayalam Text to speech synthesis used in Mozhi was built based on experimenting with the above deep learning approaches, and customising them to Malayalam.
The datasets came from IndicTTS and OpenSLR. Phonetic representation of the raw text from the datasets was generated using the open source project Mlphon. Phonetic representation and corresponding cleaned audio served as input for the training the model. Given that typical Malayalam usage also contains sporadic usage of English, the model was trained in a bilingual fashion with supporting both Malayalam and English language. Support for multiple speakers was achieved by conditioning the audio synthesis based on speaker embedding. The system can read out arabic numerals in Malayalam, using the number spell out feature of Mlmorph.
The TTS model is deployed in Amazon web servers for usage by content creators on pay-per-use model. Details about using the APIs available at https://mozhi.me/api.