IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages

Home
Happenings
Events
IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages

Date13th May 2021

Time02:00 PM

Venue Online

PAST EVENT

Details

In this work, we seek to improve the state of Indic NLP by building IndicNLPSuite - a collection of core datasets, models and benchmarks for Indic languages. As part of IndicNLPSuite, we introduce NLP resources for 11 major Indian languages. These resources include: (a) IndicCorp: large-scale sentence-level monolingual corpora, (b) IndicFT: pre-trained word embeddings, (c) IndicBERT: pre-trained language model, and (d) IndicGLUE: a benchmark containing multiple NLU evaluation datasets. Indic-Corp is mined using webcorpus, which is a scalable software pipeline to mine text from the web. IndicCorp contains a total of 8.8 billion tokens across all 11 languages and Indian English, with an average increase of 9-fold over the previously largest corporafor Indic languages. The word embeddings are based on FastText and hence are suitable for handling morphological complexity of Indian languages. The pre-trained language models are based on the ALBERT model, hence are compact and efficient to use. Lastly, we compile the IndicGLUE benchmark for Indian language NLU by creating datasets for the following tasks: Article Genre Classification, Headline Prediction, Wikipedia Section-Title Prediction, Cloze-style Multiplechoice QA, Winograd NLI, and COPA. We also include publicly available datasets for some Indic languages for tasks like Named Entity Recognition, Cross-lingual Sentence Retrieval, Paraphrase detection, etc. Our embeddings significantly outperform existing pre-trained embeddings on 4 out of 5 tasks, whereas our language model, with 10x fewer parameters, achieves a performance on-par or better than other models. We also perform analysis and provide strategies on how to build language-specific language models depending on the resource size of the language. We hope that the availability of the tools, datasets, and models will accelerate Indic NLP research which has the potential to impact more than a billion people.

Speakers

Divyanshu Kakwani (CS18S005)

Department of Computer Science and Engineering