BENGALURU: For a few weeks this year, villagers in the southwestern Indian state of Karnataka read out dozens of sentences in their native Kannada language into an app as part of a project to build the country's first AI-based chatbot for Tuberculosis.
There are more than 40 million native Kannada speakers in India, and it is one of the country's 22 official languages and one of over 121 languages spoken by 10,000 people or more in the world's most populous nation.
But few of these languages are covered by natural language processing (NLP), the branch of artificial intelligence that enables computers to understand text and spoken words.
Hundreds of millions of Indians are thus excluded from useful information and many economic opportunities.
"For AI tools to work for everyone, they need to also cater to people who don't speak English or French or Spanish," said Kalika Bali, principal researcher at Microsoft Research India.
"But if we had to collect as much data in Indian languages as went into a large language model like GPT, we'd be waiting another 10 years. So what we can do is create layers on top of generative AI models such as ChatGPT or Llama," Bali told the Thomson Reuters Foundation.
The villagers in Karnataka are among thousands of speakers of different Indian languages generating speech data for tech firm Karya, which is building datasets for firms such as Microsoft and Google to use in AI models for education, healthcare and other services.
The Indian government, which aims to deliver more services digitally, is also building language datasets through Bhashini, an AI-led language translation system that is creating open source datasets in local languages for creating AI tools.
The platform includes a crowdsourcing initiative for people to contribute sentences in various languages, validate audio or text transcribed by others, translate texts and label images.
Tens of thousands of Indians have contributed to Bhashini.
"The government is pushing very strongly to create datasets to train large language models in Indian languages, and these are already in use in translation tools for education, tourism and in the courts," said Pushpak Bhattacharyya, head of the Computation for Indian Language Technology Lab in Mumbai.
"But there are many challenges: Indian languages mainly have an oral tradition, electronic records are not plentiful, and there is a lot of code mixing. Also, to collect data in less common languages is hard, and requires a special effort."
ECONOMIC VALUE
Of the more than 7,000 living languages in the world, fewer than 100 are captured in major NLPs, with English the most advanced.
ChatGPT - whose launch last year triggered a wave of interest in generative AI - is trained primarily on English. Google's Bard is limited to English, and of the nine languages that Amazon's Alexa can respond to, only three are non-European; Arabic, Hindi and Japanese.
Governments and startups are trying to bridge this gap.
Grassroots organisation Masakhane aims to strengthen NLP research in African languages, while in the United Arab Emirates, a new large language model called Jais can power generative AI applications in Arabic.
For a country like India, crowdsourcing is an effective way to collect speech and language data, said Bali, who was named among the 100 most influential people in AI by Time magazine in September.
"Crowdsourcing also helps to capture linguistic, cultural and socio-economic nuances," said Bali.
"But there has to be awareness of gender, ethnic and socio-economic bias, and it has to be done ethically, by educating the workers, paying them, and making a specific effort to collect smaller languages," she said. "Otherwise it doesn't scale."
With the rapid growth of AI, there is demand for languages "we haven't even heard of", including from academics looking to preserve them, said Karya co-founder Safiya Husain.
Karya works with non-profit organisations to identify workers who are below the poverty line, or with an annual income of less than $325, and pays them about $5 an hour to generate data - well above the minimum wage in India.
Workers own a part of the data they generate so they can earn royalties, and there is potential to build AI products for the community with that data, in areas such as healthcare and farming, Husain said.
"We see huge potential for adding economic value with speech data - an hour of Odia speech data used to cost about $3-$4, now it's $40," she said, referring to the language of eastern Odisha state.
VILLAGE VOICE
Fewer than 11% of India's 1.4 billion people speak English. Much of the population is not comfortable reading and writing, so several AI models focus on speech and speech recognition.
Google-funded Project Vaani, or voice, is collecting speech data of about 1 million Indians and open-sourcing it for use in automatic speech recognition and speech-to-speech translation.
Bengaluru-based EkStep Foundation's AI-based translation tools are used at the Supreme Court in India and Bangladesh, while the government-backed AI4Bharat centre has launched Jugalbandi, an AI-based chatbot that can answer questions on welfare schemes in several Indian languages.
The bot, named after a duet where two musicians riff off each other, uses language models from AI4Bharat and reasoning models from Microsoft, and can be accessed on WhatsApp, which is used by about 500 million people in India.
Gram Vaani, or voice of the village, a social enterprise that works with farmers, also uses AI-based chatbots to respond to questions on welfare benefits.
"Automatic speech recognition technologies are helping to mitigate language barriers and provide outreach at the grassroots level," said Shubhmoy Kumar Garg, a product lead at Gram Vaani.
"They will help empower communities which need them the most."
For Swarnalata Nayak in Raghurajpur district in Odisha, the growing demand for speech data in her native Odia has also meant a much-needed additional income from her work for Karya.
"I do the work at night, when I am free. I can provide for my family through talking on the phone," she said.
Reuters
Fri Dec 08 2023
AI tools are largely in English, European languages. - REUTERS/Filepic
PM Singapura mahu warga bersatu, kekalkan kestabilan politik dalam ucapan terakhir
Perdana Menteri Singapura Lee Hsien Loong menyeru rakyat Singapura supaya terus bersatu padu dan mengekalkan kestabilan politik bagi hala tuju Singapura ke hadapan.
Mei Xing hampa gagal sumbang mata buat Malaysia
Mei Xing tidak dapat menyembunyikan kesedihannya selepas gagal menyumbang mata kedua buat pasukan Malaysia ketika menentang Australia.
Profil Malaysia semakin meningkat di mata dunia - PM Anwar
Bukan ingin mendabik dada atau bersikap sombong, tetapi Perdana Menteri Datuk Seri Anwar Ibrahim melihat profil Malaysia kembali meningkat dan dihormati di mata dunia.
Kenaikan gaji penjawat awam tidak lagi pentingkan kelompok tertinggi seperti dahulu – PM Anwar
Kenaikan gaji penjawat awam melebihi 13 peratus, seperti yang diumumkan pada Hari Pekerja 2024, adalah berbeza dengan kenaikan yang pernah dilaksanakan dalam pentadbiran terdahulu.
AWANI 7:45 [01/05/2024] – Kos mencecah RM10 bilion | Gaji minimum beri impak positif | Seolah-olah dunia menghampakan
Kenaikan gaji penjawat awam lebih 13 peratus dibentang ketika Belanjawan 2025.
Kerajaan semak semula kadar gaji minimum.
Kusut, lesu dan kecewa dengan perkembangan di Gaza, ulas PM.
Protes pelajar universiti pro-Palestin di Amerika terus meningkat.
Kerajaan semak semula kadar gaji minimum.
Kusut, lesu dan kecewa dengan perkembangan di Gaza, ulas PM.
Protes pelajar universiti pro-Palestin di Amerika terus meningkat.
Ceroboh hutan Raub: Tindakan penguatkuasaan berdasarkan bukti - MB Pahang
Tindakan penguat kuasa menahan dua individu yang didakwa menceroboh kawasan hutan di Raub semalam dibuat berdasarkan bukti kukuh, kata Menteri Besar Pahang Datuk Seri Wan Rosdy Wan Ismail.
"Yang tak ambil gaji dikritik, yang mencuri tak salah"
Satu tahun setengah sebagai perdana menteri, Datuk Seri Anwar Ibrahim ulangi tetap tegas tidak benar sebarang 'kecurian' atau 'kronisma' dalam kerajaan.
Isu Palestin: Ini soal prinsip dan nilai
Perdana menteri tegaskan walaupun berdepan kerugian, kerajaan Madani mahu memastikan keputusan yang diambil turut mengambil kira soal nilai dan prinsip.
Kedudukan, profil Malaysia tinggi di peringkat antarabangsa
Anwar akui antara kelemahan utama kerajaan adalah komunikasi sehingga rakyat tidak sedar reputasi tinggi negara di peringkat antarabangsa.
"Saya boleh terima kritikan tapi jangan fitnah"
Anwar kesal apabila masih dituduh tidak berbuat apa-apa sedangkan sudah banyak usaha dilaksanakan untuk rakyat.
PRK KKB: MIPP jamin PN akan terima undi terbesar kaum India
Punithan berkata, pihak lawan dilihat mula menunjukkan kebimbangan dan rasa takut dengan kekuatan yang dimiliki PN ketika ini.
Parti politik India perhebat dekati pengundi, bagi pusingan pertama pilihan raya kebangsaan
Modi berkata kerajaannya terus berusaha untuk pembangunan ekonomi wilayah timur laut itu.
Pakistan sambut Aidilfitri hari ini, Bangladesh dan India rai Khamis
Umat Islam di Pakistan dan Sri Lanka meraikan Aidilfitri pada Rabu manakala mereka yang berada di Bangladesh dan sebahagian besar India akan menyambut perayaan itu Khamis.
15 pekerja maut bas terjunam dalam lombong di tengah India
Sekurang-kurangnya 15 pekerja terbunuh selepas bas yang membawa mereka terjunam ke dalam lombong di tengah India pada Selasa malam.
Harimau Muda atasi India 2-1
Skuad bawah 23 tahun (B-23) kebangsaan menundukkan India 2-1 pada aksi persahabatan tertutup di Stadium Bola Sepak Kuala Lumpur, malam tadi.
Doktor terpendek di dunia
Tiada siapa sangka seorang lelaki di India dinobatkan doktor terpendek di dunia apabila hanya mempunyai ketinggian tiga kaki atau 91 sentimeter.
Zii Jia dedah resipi kalahkan bekas juara dunia
Dalam aksi di Birmingham itu, Zii Jia memperagakan aksi bertenaga untuk menundukkan juara dunia 2021 dalam masa 33 minit permainan.
Umat Islam di Bangladesh, India dan Pakistan mula berpuasa pada Selasa
Arab Saudi mengumumkan anak bulan kelihatan pada Ahad dan Isnin merupakan hari pertama Ramadan.
Pelan Tindakan Masyarakat India akan diteliti semula
Satu sesi libat urus bersama pelbagai pihak berkepentingan termasuk kumpulan fokus akan diadakan bulan ini.
AirAsia sasar hak trafik dua hala ke beberapa bandar di India
Tumpuan untuk meluaskan rangkaian penting bagi AirAsia kerana bandar utama di India telah mencapai puncak pengembangan.