What is a phonetic lexicon?
A pronunciation dictionary or a phonetic lexicon is a list of words and their pronunciation described as a sequence of phonemes. It is an essential component in the training and decoding of speech to text (STT) and text to speech (TTS) systems. A pronunciation dictionary is slightly different from a simple phonetic transcription. It should contain delimiters between phonemes, space is usually the default choice.
Sample entries in a Malayalam pronunciation dictionary:
ഒന്ന് o n̪ n̪ ə രണ്ട് ɾ a ɳ ʈ ə മൂന്ന് m uː n̪ n̪ ə നാല് n̪ aː l ə അഞ്ച് a ɲ t͡ʃ ə എന്നാൽ e n n aː l എന്നാൽ e n̪ n̪ aː l ഐഎസ്ആർഒ ai̯ e s ə aː r o
Are phonetic lexicons available for other languages?
Ready to use machine readable pronunciation dictionaries are available for various world languages. CMUDict is an open source machine readable pronunciation lexicon for North American English that contains over 134k words and their pronunciations. Similar efforts for creating pronunciation lexicons for different world languages include; Globalphone, providing pronunciation lexicon of 20 world languages, the LC-STAR Phonetic Lexica of 13 different languages, Arabic speech recognition pronunciation lexicon with two million pronunciation entries for 526k Modern Standard Arabic words, ASR oriented Indian English pronunciation lexicon, manually curated Bangla phonetic lexicon of 65k lexical entries prepared for TTS, to mention a few.
How to create a Malayalam pronunciation dictionary?
Using Mlphon, it is easy to create a pronunciation dictionary. The
PhoneticAnalyser class in Mlphon has method to
analyse a valid Malayalam word and return the list of phonemes in the word along with articulatory and orthographic feature information. A pronunciation dictionary can be created by eliminating the feature information from the analysis result. The utility function
phonemize in Mlphon, performs this task. It can return the list of phonemes with user defined delimiters at syllable and phoneme boundaries.
If you want to create a pronunciation dictionary of a set of words, use the the following snippet
createlexicon.py. It takes an input file of Malayalam words one word per line and generates a lexicon with a space after phonemes and period after syllable boundaries (Line #21). As presented in the sample lexicon, words like എന്നാൽ can have two valid pronunciations, both of which are provided by Mlphon. However Mlphon invalidates English abbreviations like ഐഎസ്ആർഒ, where there are word medial independent vowels, invalid as per Malayalam script grammar.
If your list of words contain English abbreviations, the next code snippet
createlexicon_expanded.py is suggested. It splits such words at position of word medial vowels, before passing it to Mlphon for analysis.
This python notebook, exemplifies the usage of these code snippets.
Is there a ready to use Malayalam Pronunciation Dictionary?
Yes. If you do not want to create one on your own, we have published a set of pronunciation dictionaries that consist of different categories of words as described in Table.
|Category||Number of Lexical Entries||Description|
|Common Words||1000,000||Most common 100k worforms in decreasing order of frequency|
|Verbs||3895||Malayalam verbs in citation form in alphabetic order|
|Nouns||59763||Malayalam nouns in alphabetic order|
|Proper Nouns||6751||Common person names, place names and brand names in alphabetic order|
|Foreign words||4350||Sanskrit and English borrowed words|
The entries in common words pronunciation lexicon are extracted from a general domain text corpus of 167 million types covering the fields of business, entertainment, sports, technology etc. as described in Indic NLP dataset. The rest of the categories are curated word lists from the Malayalam morphology analyser, Mlmorph.
Citing this work
If you want to use these code snippets in your project, you are free to do so, citing the original work as:
K. Manohar, A. R. Jayan and R. Rajan, "Mlphon: A Multifunctional Grapheme-Phoneme Conversion Tool Using Finite State Transducers," in IEEE Access, vol. 10, pp. 97555-97575, 2022, doi: 10.1109/ACCESS.2022.3204403.
Applications Using Mlphon
- Mozhi Malayalam-English code switched TTS
- Malayalam Automatic Speech Recognition
- Web Demo of Malayalam Phonetic Analysis