Summary of L3cube-indicnews: News-based Short Text and Long Document Classification Datasets in Indic Languages, by Aishwarya Mirashi et al.

L3Cube-IndicNews: News-based Short Text and Long Document Classification Datasets in Indic Languages

by Aishwarya Mirashi, Srushti Sonavane, Purva Lingayat, Tejas Padhiyar, Raviraj Joshi

First submitted to arxiv on: 4 Jan 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary This paper introduces L3Cube-IndicNews, a multilingual text classification corpus focused on Indian regional languages. The dataset consists of news headlines and articles in 10 prominent Indic languages, including Hindi, Bengali, Marathi, Telugu, Tamil, Gujarati, Kannada, Odia, Malayalam, and Punjabi. Each language has multiple classes of news articles. The dataset is divided into three categories: Short Headlines Classification (SHC), Long Document Classification (LDC), and Long Paragraph Classification (LPC). Consistent labeling across all datasets enables in-depth length-based analysis. The authors evaluate each dataset using four models, including monolingual BERT, Indic Sentence BERT, and IndicBERT. This research expands the pool of text classification datasets and enables cross-lingual analysis due to label overlap among languages.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This paper creates a big database of news articles in many Indian languages. The goal is to help machines understand these languages better. They collected news headlines and articles from 10 important Indian languages and labeled them into different categories like short headlines, long documents, and longer paragraphs. They tested four special computer programs on this dataset to see how well they work. This project helps us learn more about these languages and can be used for lots of cool things like understanding news in many languages.

Keywords

* Artificial intelligence * Bert * Classification * Text classification

L3Cube-IndicNews: News-based Short Text and Long Document Classification Datasets in Indic Languages

by Aishwarya Mirashi, Srushti Sonavane, Purva Lingayat, Tejas Padhiyar, Raviraj Joshi

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Policy-regularized Offline Multi-objective Reinforcement Learning, by Qian Lin et al.

Summary of Balancing Continual Learning and Fine-tuning For Human Activity Recognition, by Chi Ian Tang et al.

Related Posts