Published 12:57 IST, September 23rd 2020
IIT-Madras, AI4Bharat develop AI models to process texts in 11 Indian languages
IIT-Madras has developed artificial intelligence (AI) models and datasets in association with AI4Bharat that can process texts in 11 major Indian languages.
The Indian Institute of Technology Madras (IIT-M) has developed artificial intelligence (AI) models and datasets in association with AI4Bharat that can process texts in 11 major Indian languages from Indo-Aryan and Dravidian branch. The languages that the AI models and datasets can process are Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu, and also Indian English, which makes it compatible with 12 languages.
The tools will help computers process texts in Indian languages which will help learners, industry, start-ups to work, and innovate more efficiently, said one of the professors involved in the project. There are already such tools available for the English language but the tools lack for Indian languages and this project will help fill the gap. IIT-M and AI4Bharat released IndicNLPSuite to help solve the problem. It is a collection of various resources and models for Indian languages such as IndicCorp, IndicFT, IndicBERT, and IndicGLUE.
How does it work?
The monolingual corpora contain a total of 8.9 billion tokens across all 11 languages and Indian English, primarily sourced from news crawls. The word embeddings are based on FastText, hence suitable for handling the morphological complexity of Indian languages. The pre-trained language models are based on the compact ALBERT model. ALBERT model was chosen because it is very compact and hence easier to use in downstream tasks.
"Lastly, the IndicGLUE benchmark for Indian language NLU contains datasets for the following tasks: Article Genre Classification, Headline Prediction, Named Entity Recognition, Cross-lingual Sentence Retrieval, Wikipedia Section-Title Prediction, and Clozestyle Multiple choice QA," said researchers in their study published on AI4Bharat website.
(Image Credit: AI4Bharat/Website)
Updated 12:56 IST, September 23rd 2020