Loading Now

Summary of L3cube-indicnews: News-based Short Text and Long Document Classification Datasets in Indic Languages, by Aishwarya Mirashi et al.


L3Cube-IndicNews: News-based Short Text and Long Document Classification Datasets in Indic Languages

by Aishwarya Mirashi, Srushti Sonavane, Purva Lingayat, Tejas Padhiyar, Raviraj Joshi

First submitted to arxiv on: 4 Jan 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper introduces L3Cube-IndicNews, a multilingual text classification corpus focused on Indian regional languages. The dataset consists of news headlines and articles in 10 prominent Indic languages, including Hindi, Bengali, Marathi, Telugu, Tamil, Gujarati, Kannada, Odia, Malayalam, and Punjabi. Each language has multiple classes of news articles. The dataset is divided into three categories: Short Headlines Classification (SHC), Long Document Classification (LDC), and Long Paragraph Classification (LPC). Consistent labeling across all datasets enables in-depth length-based analysis. The authors evaluate each dataset using four models, including monolingual BERT, Indic Sentence BERT, and IndicBERT. This research expands the pool of text classification datasets and enables cross-lingual analysis due to label overlap among languages.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper creates a big database of news articles in many Indian languages. The goal is to help machines understand these languages better. They collected news headlines and articles from 10 important Indian languages and labeled them into different categories like short headlines, long documents, and longer paragraphs. They tested four special computer programs on this dataset to see how well they work. This project helps us learn more about these languages and can be used for lots of cool things like understanding news in many languages.

Keywords

* Artificial intelligence  * Bert  * Classification  * Text classification