Maithili Tools and Technologies
India has over 75 crore people using the internet. However, a significant proportion of Indian users use the internet across low friction verticals such as entertainment, news, messaging, and social media. Of the various barriers that prevent these users from engaging with more verticals and availing more services, and truly enjoying what the internet has to offer, one of the biggest is that while India is a country with great linguistic diversity, much of the information (and therefore access to internet, e-governance, ecommerce, e-banking etc.) cannot be used by the majority of the population as it is available majorly in English. Unfamiliarity with English means that Indian language users hesitate using these services, as they don’t completely understand what is being offered to them. They would feel safer using services in their own language.
Similarly, ‘text information’ is majorly available in English only (e.g. text books, information on government schemes, crop advisory etc.) and due to lack of English reading/writing skills, a large percentage of the population is not able to tap the complete benefits of the internet revolution and is unable to avail the full benefits of government schemes.
Content creation poses another major problem, with most users finding friction points when it comes to typing and viewing Indian text. The Internet today is thus severely deficient in terms of content for Indian languages. In fact, 53% of non-Internet owners in India state that they would start using the Internet if it has content available in their mother tongues. The lack of local language content is particularly stark in the multimedia domain that consists of videos, podcasts and digital assistants.
Therefore, in India’s new and emerging digital ecosystems, we must find novel ways to engage with citizens who communicate in varied languages and dialects. The objective is to make information available to the people in their native language in order to be a “truly” connected nation.
On the one hand, over the last decade, the enormous penetration of the Internet in the country means that a major chunk of this multilingual population is actively seeking to consume and interact with online content in their own languages. With heightened Smartphone proliferation, availability of cheap mobile data, expanding WiFi services in villages and overall digital literacy, India has an unprecedented opportunity to create a blueprint for building the Internet for local languages.
On the other hand, over the last decade, globally, automatic speech recognition, text to speech synthesis, machine translation & optical character recognition (i.e. OCR) technologies have been making dramatic improvements due to data driven Artificial Intelligence (AI), specifically deep learning technologies.
We too have an opportunity to leverage AI and deep learning technologies. However, the key factor in building solutions using deep learning is the accessibility to multilingual data.
Progress in Indian language technologies has been hampered due to the lack of availability of large amounts of high quality datasets to train state-of-the-art AI models. Curated and validated contributions are urgently required so that these can go into an open source repository, where it can be harnessed and put to use by the entire ecosystem.
Bhashini, also known as the National Language Translation Mission (NLTM), has been envisaged to take advantage of this opportunity. Bhashini aims to build a distinct Indian language technology platform, related services and products, by leveraging the power of artificial intelligence.
It intends to build advanced open source datasets and models in language technologies. These will cover the areas of Input Tools, Machine Translation (MT), Automatic Speech Recognition (ASR), Text-to-Speech (TTS) and Optical Character Recognition (OCR). In addition, the project will also focus on building datasets and models for natural language understanding (NLU) tasks such as Named Entity Recognition (NER), Sentiment Analysis, Question Answering (QA), and Summarization in Indian Languages etc.
The mission of Bhashini is to: “[Harness natural language technologies to] create a knowledge-based society by transcending the language barriers and providing content and services to citizens, in their own language, both in the form of speech and text.”
Objectives
- To increase the content in Indian languages on the Internet substantially in the domains of public interest, particularly, governance-and-policy, science & technology, etc
- To create and nurture an ecosystem involving start-ups, Central/State government agencies working together to develop and deploy innovative products and services in Indian languages
- To build a high-quality speech to speech machine translation (SSMT) system for major Indian languages.
This would result in the creation of a knowledge based society where information is freely and readily available.
Some specific examples of the possible usages are:
- A translation system to effectively translate government documents from English/Hindi to Maithili or books in Maithili to Bangla, Odia or Tamil, while minimizing the turnaround time and human effort involved.
- A system to evaluate the oral reading fluency (ORF) for primary school children in their mother tongue (Maithili)
- A voice based mobile payment application to allow all Maithils to make secure payments using Maithili
- Speech-based access to agricultural commodity prices and weather information to Maithil farmers and sending the information retrieved from Mandi System through voice SMS
- An OCR system to convert scanned Maithili books to regular texts for further use and processing
- An application to summarize texts written in Maithili
- Application for ‘text to speech’ in Maithili for audiobooks or visually impaired person
- Application for ‘speech to text’ or digital assistants in Maithili
- A large number of tools such as taggers, spell checkers, grammar checkers, sorting utilities etc.
Aripana Foundation is working on the project in collaboration with AI4Bharat, IIT Madras, to build high quality datasets in Maithili, which will be used to train state-of-the-art AI models and develop automatic speech recognition, text to speech synthesis, machine translation, optical character recognition (i.e. OCR) and related technologies, in Maithili.
This, in turn, will enable the development of various applications for the public and private sector, NGOs and the end-user. Watch this video to know more:
We have a team of 10 English<-> Maithili Translators, 5 Hindi<->Maithili Translators, 5 Typists and Proofreaders, 5 Transcribers, and 20 Voice-data Recorders, spread over 10 districts of North Bihar, as well as cities such as Delhi, Bangalore, Ranchi and Chennai, working on the project.