Monday, September 8, 2025
27.1 C
New Delhi

TIME100 AI 2025: Mitesh Khapra—The Only Indian Who Dared to Do

In one way or another, nearly every Indian startup working on voice technology for the country’s many languages relies on the datasets of Mitesh Khapra and his team. Khapra, an associate professor of computer science at the Indian Institute of Technology Madras, recognized early on that “the reason Indian language technology is behind English is because we do not have enough data for Indian languages,” he says.

While Western models may perform well on highly represented languages like Hindi and Bengali, they are weaker on underrepresented languages. To close the gap, Khapra’s research lab AI4Bharat led a project that took researchers to almost 500 of India’s 700 districts, recording thousands of hours of voices from people with diverse educational and socioeconomic backgrounds to capture all 22 of India’s official languages.

- Advertisement -

Co-founded in 2019, AI4Bharat became an official partner for the Indian government’s Bhashini program, which uses AI to assist citizens in accessing digital services in their own languages; AI4Bharat supplies 80% of the data with its open-source dataset. It welcomes other developers utilizing its data for their models, too. Khapra says that if big tech companies use its data to make their models better at Hindi or Marathi, “it benefits the country at large.”

AI4Bharat’s AI models have been deployed in the Indian Supreme Court to translate official documents, and to create a voice bot that regional farmers can call to report issues with their government subsidy payments. The company’s latest project involves partnering with Sarvam AI—a startup launched by two other AI4Bharat co-founders—to build India’s first foundation model for the Indian government. Even if the model initially underperforms its Western counterparts, Khapra sees its creation as essential for the country’s sovereignty. “Unless we learn that skill, we will always be in a perpetually dependent position,” he says.

Already Khapra sees his work helping to reshape his nation’s academic research. He says that “15 years back, an average PhD student in India working on language technology would end up working on English problems,” but adds that “with these datasets available, I see a shift: now Indian students are working on Indian problems.”

Professor Mitesh M. Khapra as featured in TIME Magazine’s TIME100 AI 2025

Mitesh M. Khapra

Mitesh M. Khapra is an Assistant Professor in the Department of Computer Science and Engineering at IIT Madras. He researches in the areas of Deep Learning, Multimodal Multilingual Processing, Dialog systems and Question Answering. He holds masters and Ph.D. degrees from IIT Bombay. He has worked for over 4.5 years at IBM Research and published over 25 papers. He was a recipient of the IBM PhD Fellowship and the Microsoft Rising Star Award. He is also a recipient of the Google Faculty Research Award, 2018.

He is also a co-founder of One Fourth Labs where our mission is to provide high quality education in AI at affordable prices to build a workforce capable of building AI solutions of societal and commercial value in India. Specialties: Deep Learning, Natural Language Processing, Conversation Systems

He can be contacted on https://www.linkedin.com/in/mitesh-khapra-3bb3032/?originalSubdomain=in

One Fourth Labs

We are a technology startup founded by IIT Madras faculty members Mitesh Khapra and Pratyush Kumar. We are incubated at the IIT Madras Research Park.

We educate and research on Artificial Intelligence, specifically Deep Learning. We are also driven by contributing to nation building.

What is in a name?

One Fourth Labs is named after turiya a philosophical construct from the Mandukya Upanishad that hypothesises that there is one background that underlies and transcends the three common states of consciousness: waking, dreaming, and deep sleep. Hence, the fourth that is the one. One Fourth.

Thinking

Artificial Intelligence is rapidly evolving. The pace of development and scale of impact of AI puts it in the unique position where education, research, and social good are converging rapidly. One Fourth Labs is founded to situate itself at the cusp of this convergence.

At the forefront, we develop learning content on AI. By 2020, we will have a course stack that spans from data sciences to machine learning to deep learning. We believe that such a coherent treatment of these topics that combine in equal measure the mathematical insight with practical skill will bring much value.

We recognised that capacity building is just the start. To ensure the right employment opportunities are generated, our business organisations – big and small – need support in adopting AI at the right time and at the right scale. We work closely with organisations in this major transformation.

Finally, we have to recognise that AI has the potential to create impact at scale disproportionate to the design and build efforts. Especially, given India’s severe and several challenges, we need to pool together people and processes to create AI solutions that solve real challenges.

AI4Bharat: Building AI for India!

AI4Bharat, a research lab at IIT Madras, is dedicated to advancing AI technology for Indian languages through open-source contributions. Over the past, the lab has developed and released a wide range of datasets, tools, and state-of-the-art models. The focus areas of the lab include transliteration, natural language understanding, generation, translation, automatic speech recognition, and speech synthesis. AI4Bharat’s work is recognized globally, with publications in top-tier conferences and deployments in real-world use cases, making a significant impact across academia, industry, and government sectors.

Cutting-edge work across areas

Large Language Models: AI4Bharat has pioneered the development of multilingual LLMs tailored for Indian languages, such as IndicBERT, IndicBART, and Airavata trained on extensive, diverse datasets like IndicCorpora and Sangraha.

Machine Translation: Our machine translation models, including IndicTransv2, are built on large-scale datasets mined from the web and carefully curated human translations, catering to all 22 Indian languages and competing with commercial models as validated on multiple benchmarks.

Transliteration: AI4Bharat’s transliteration models, like IndicXlit, are optimized for converting text between scripts of Indian languages and English, leveraging large scale datasets such as Aksharantar

Automatic Speech Recognition: Our ASR models, including IndicWav2Vec and IndicWhisper, are trained on rich datasets like Kathbath, Shrutilipi and IndicVoices, covering multiple Indian languages.

Text to Speech: AI4Bharat’s TTS efforts, exemplified by AI4BTTS, focus on creating natural-sounding synthetic voices for Indian languages using a mix of web-crawled data and carefully curated datasets like Rasa.

Optical Character Recognition: We are in the early stages of developing models and datasets for advancing Document Layout Parsing and OCR technologies to support the wide range of Indian scripts.

Pioneering Data Collection

Early on in our journey, we recognized that advancing Indian technology necessitates large-scale datasets. Thus, building and collecting extensive datasets across multiple verticals has become a critical endeavor at AI4Bharat. Thanks to generous grants from MeitY, we are spearheading pioneering efforts in data collection as part of the Data Management Unit of Bhashini. Our nationwide initiative aims to gather 15,000 hours of transcribed data from over 400 districts, encompassing all 22 scheduled languages of India. In parallel, our in-house team of over 100 translators is diligently creating a parallel corpus with 2.2 million translation pairs across 22 languages. To produce studio-quality data for expressive TTS systems, we have established recording studios in our lab, where professional voice artists contribute their expertise. Additionally, our annotators are meticulously labeling pages for Document Layout Parsing, accommodating the diverse scripts of India. To accelerate the development of Indic Large Language Models (LLMs), we are focused on building pipelines for curating and synthetically generating pre-training data, collecting contextually grounded prompts, and creating evaluation datasets that reflect India’s rich linguistic tapestry. Collecting and annotating data at this scale demands standardization of processes and tools. To meet this challenge, AI4Bharat has invested in developing various open-source data collection and annotation tools, aiming to enhance these efforts not only within India but also in multilingual regions across the globe.

Sources:

  1. https://time.com/
  2. https://www.onefourthlabs.com/
  3. https://ai4bharat.iitm.ac.in/

 

-- Advertisement --

Latest Stories

LATEST STORIES

-- Advertisement --

Related articles