India has begun to deliver services more digitally and build language datasets through Bhashini, an AI-led language translation system.

This system creates an open source database in local languages for creating AI tools. It includes a crowdsourcing initiative for people to contribute sentences in various languages, validate audio or text transcribed by others, and translate texts and label images. Tens of thousands of Indians have contributed to Bhashini.

“The government is pushing very strongly to create datasets to train large language models in Indian languages, and these are already in use in translation tools for education, tourism and in the courts,” said Pushpak Bhattacharyya, head of the Computation for Indian Language Technology Lab in Mumbai.

“But there are many challenges: Indian languages mainly have an oral tradition, electronic records are not plentiful, and there is a lot of code mixing. Also, to collect data in less common languages is hard, and requires a special effort.”

That’s not all. For a few weeks this year, villagers in Karnataka have read out dozens of sentences in their native Kannada language into an app as part of a project to build India’s first AI-based chatbot for Tuberculosis.

Kannada is one of the country’s 22 official languages and one of over 121 languages spoken by at least 10,000 people in India. But few of these 121 languages are covered by natural language processing (NLP), the branch of artificial intelligence that enables computers to understand text and spoken words. Thus, hundreds of millions of Indians are excluded from useful information and beneficial economic opportunities.

“For AI tools to work for everyone, they need to also cater to people who don’t speak English or French or Spanish,” said Kalika Bali, principal researcher at Microsoft Research India. The quickest and most convenient way to counter this is to create layers on top of generative AI models like ChatGPT and Llama.

Thus, the villagers Karnataka are among thousands of speakers of various Indian languages generating speech data for the tech firm Karya, which is building datasets for firms such as Microsoft and Google to use in AI models for healthcare, education, and other services.