Malayalam Phonetic Analyser: Version 1.0.0 -

Edit (September 20,2022): A detailed report on this is now available as a journal article

In the previous post, I had shared the work in progress version of a finite state transducer based Malaylam phonetic analyser. A phonetic analyser analyses the written form of the text to give the phonetic characteristics of the grapheme sequence.

Understanding the phonetic characteristics of a word is helpful in many computational linguistic problems. For instance, translating a word into its phonetic representation is needed in the synthesis of a text to speech (TTS) system. The phonetic representation is helpful to transliterate the word to a different script. It will be useful if the phonetic representation can be converted back to the grapheme sequence.

The first version of project mlphon is now released. It is packaged as a python library in Pypi. You can now install it by

pip install mlphon

It has built-in methods for bidirectional grapheme to phoneme conversions, IPA mappings and a syllablizer. These three functions has command line tools as well. Tryout for yourself.

Examples

Syllablizer

$ mlsyllablizer

For the input

സഫലമീയാത്ര

the output would be

<BoS>സ<EoS><BoS>ഫ<EoS><BoS>ല<EoS><BoS>മീ<EoS><BoS>യാ<EoS><BoS>ത്ര<EoS>

['സ', 'ഫ', 'ല', 'മീ', 'യാ', 'ത്ര']

<BoS> indicate the beginning of a syllable and <EoS> the end of a syllable.

G2P analysis and synthesis

 $ mlg2p -a

Give the input

കാവ്യ

It will give you the result of g2p analysis as:

<BoS>k<plosive><voiceless><unaspirated><velar>aː<v_sign><EoS><BoS>ʋ<approximant><labiodental><virama>j<glide><palatal>a<schwa><EoS>

The details of each phoneme are given in angle brackets. The operation is bidirectional. You can retrieve the graphemes from the analysis string.

IPA analysis and synthesis

If the phonetic detailing is not relevant to you, a minimal mapping of the graphemes to IPA can be obtained by

$ mlipa -a

For the input

കൽക്കണ്ടം

The output would be

kal<chil>kkaɳʈam<anuswara>

Certain tags like <chil>, <anuswara>, <visaraga> are retained so that bidirectional analysis and generation are unambiguously possible.

More details on its usage is available in the PyPi documentation as well as in the README section of mlphon repository.

Will update the progress here. For a quick web demo of what mlphon does, checkout this link https://phon.smc.org.in/

Thanks for reading 😀.

References

malayalam unicode script grapheme phoneme g2p fst pypi python

Examples

Syllablizer

G2P analysis and synthesis

IPA analysis and synthesis

References

See also