Kavya Manohar

From Benchmarks to Benches

The Journey of Building AI for Indian Courtrooms

Posted on October 29, 2025 |

I haven’t posted much about my job at Adalat AI. The team I’m part of builds dictation systems for the Indian court rooms, and initially, I thought it would be a straightforward extension of my PhD research. I was wrong. Unlike academic work where success means beating SOTA on benchmark datasets, the reality at Adalat AI demanded something more: a deep understanding of India’s diverse courtroom dynamics, legal workflows, and the intricate linguistic landscape where regional languages and English constantly intertwine. [Read More]

The Ghost of ASCII Past

In Indian Language Computing

Posted on May 17, 2025 |

Indian language computing has evolved from ASCII-based font encoding to Unicode standardization. This article explains how text was represented in Indian languages before Unicode, the problems with ASCII-based fonts, and why Unicode became necessary. It covers various input methods developed for typing Indian languages and demonstrates how Unicode solved the compatibility issues between different systems. Table of Contents What is Unicode? Some Hindi (Devanagari) Unicode Characters Unicode is more than just Codepoints! [Read More]

Unicode Input Methods Fonts ASCII

Malayalam: Life and Praxis - Seminar Series at Tirur

Posted on February 20, 2025 |

A three day National Seminar, “Malayalam: Life and Praxis”, was organized by the Tirur Regional Centre of Sree Sankaracharya University of Sanskrit during February 18-20, 2025 as a tribute to Dr. Sushama L., Professor on Malayalam Lingustics, who is retiring from her teaching career in this academic year. Dr. Sushama currently serves as the Vice Chancellor of Thunchath Ezhuthachan Malayalam University. I was invited to deliver a session on “കമ്പ്യൂട്ടർ മനസ്സിലാക്കുന്ന മലയാളഭാഷ”. [Read More]

seminar malayalam

Recap 2024

Posted on January 1, 2025 |

I write this not for the world, but for me. In the first quarter of 2024, I struggled a bit to find an answer to whether I should switch back to the full time faculty position or just continue the current semi-academic reserch position and finally decided to stick with the latter. As the year ends, I gladly realize that the decision saved me from the mundaneness of many academic/administrative chores. [Read More]

personal

EMNLP 2024

Posted on November 13, 2024 |

Empirical Methods in Natural Language Processing (EMNLP), കമ്പ്യൂട്ടേഷണൽ ലിംഗ്വിസ്റ്റിക്സിന്റെ ലോകോത്തര കോൺഫറൻസ് വേദികളിലൊന്നാണ്. കേരള ഡിജിറ്റൽ യൂണിവേഴ്സിറ്റിയിലെ Virtual Resource Centre for Language Computing (VRCLC) എന്ന ഭാഷാകമ്പ്യൂട്ടിങ്ങ് കേന്ദ്രത്തെ പ്രതിനിധീകരിച്ച് കോൺഫറൻസിൽ പങ്കെടുത്ത് ഒരു പ്രബന്ധം അവതരിപ്പിക്കുകയുണ്ടായി. VRCLCയിലെ പ്രാദേശികഭാഷാഗവേഷണം ഇംഗ്ലീഷ് ഭാഷയ്ക്ക് അനുയോജ്യമായ വിധത്തിലുള്ള ഏറ്റവും മികച്ച ആർട്ടിഫിഷൽ ഇന്റലിജൻസ് മോഡലുകളുടെ നിർമ്മാണത്തിൽ ഒരുപാട് ബഹുരാഷ്ട്ര കമ്പനികൾ മത്സരിക്കുന്നുണ്ട്. അതിൽ ചില എഐ മോഡലുകളൊക്കെ ബഹുഭാഷാശേഷിയുള്ളതാണെന്നൊക്കെ അവർ അവകാശപ്പെടുമ്പോഴും അവയിലൊക്കെ കൃത്യത ഉറപ്പുവരുത്താനുള്ള ശ്രമങ്ങൾ പലപ്പോഴും ഉണ്ടാകാറില്ല. ഇംഗ്ലീഷിതര ഭാഷകൾക്കുള്ള ഭാഷാകമ്പ്യൂട്ടിങ്ങ്, സ്പീച്ച് എഐ മോഡലുകളുടെ നിർമ്മാണം ഒക്കെ പല കാരണങ്ങൾ കൊണ്ട് ബുദ്ധിമുട്ടുള്ളതാണ്. [Read More]

unicode multilingual nlp conferences emnlp2024

താപസം സെമിനാർ 2024

Posted on October 3, 2024 |

താരതമ്യപഠനസംഘം ഒക്ടോബർ 1, 2 തീയതികളിലായി സംഘടിപ്പിച്ച താപസം സെമിനാർ ശ്രീശങ്കരാചാര്യ സംസ്കൃതസർവ്വകലാശാലയിൽ വെച്ച് നടന്നു. ഈ സെമിനാറിൽ ‘യൂണിക്കോഡിലെത്തിയ മലയാളം: ചില ഭാഷാസാംസ്കാരികചിചാരങ്ങൾ’ എന്ന വിഷത്തിൽ ഞാനവതരിപ്പിച്ച പ്രഭാഷണം ഇവിടെ കൊടുക്കുന്നു.

seminar malayalam

Wav2Vec2-BERT+LM: Transcribing Speech and Evaluating Models using Huggingface Transformers

Posted on August 20, 2024 |

What is Wav2Vec2-BERT? Wav2Vec2-BERT is a successor of the popular Wav2Vec2 Model, a pre-trained model for Automatic Speech Recognition (ASR). Wav2Vec2-BERT is a 580M-parameters audio model that has been pre-trained on 4.5M hours of unlabeled audio data covering more than 143 languages. Following the basic architecture of Wav2Vec2, with increased pretraining data and slighly different training objectives, various models (XLSR, XLS-R and MMS) with pretrained checkpoints were released. Wav2Vec2-BERT pretrained model was introduced in the SeamlessM4T Paper by Meta in August 2023. [Read More]

malayalam speech recognition Transformer

Indian Languages and Text Normalization: Part 1

Posted on May 6, 2024 |

This is a two part article. The first part will cover how the normalization routine in the popular ASR engine Whisper, removes essential characters like vowel signs in Indian languages while evaluating the performance. The second part (yet to be written) will cover various existing libraries and the approaches needed to perform proper normalization in Indian languages. Text Normalization Text Normalization in natural language processing (NLP) refers to the conversion of different written forms of text to one standardised form. [Read More]

Multilingual Normalization Indian Languages Malayalam Whisper Normalization

Call me Dr. Kavya 🤩

Posted on March 5, 2024 |

I was awarded doctoral degree by APJ Abdul Kalam Technological University, Kerala, India.

You can read my thesis ‘Linguistic challenges in Malayalam speech recognition: Analysis and solutions’ here.

Pics from the graduation ceremony hosted by College of Engineering Trivandrum and APJ Abdul Kalam Technological University.

phd

Live Dictation: Malayalam speech to text using subword tokens

Posted on November 19, 2023 |

The research carried out as part of my PhD was centred around the linguistic challenges in Malayalam speech recognition. One of the biggest chellenges associated with recognizing speech in morphologically complex languages is centred around how granular should be the text tokens. Classical ASR with Word tokens In the classical architecture of Automatic Speech Recognition (ASR) with word tokens, the acoustic model identifies fundamental sound units, the pronunciation lexicon maps sounds to words, and the language model predicts word sequences to convert speech to text. [Read More]

malayalam demo speech to text asr open source subword tokens