Loading Now

Summary of Indotoxic2024: a Demographically-enriched Dataset Of Hate Speech and Toxicity Types For Indonesian Language, by Lucky Susanto et al.


IndoToxic2024: A Demographically-Enriched Dataset of Hate Speech and Toxicity Types for Indonesian Language

by Lucky Susanto, Musa Izzanardi Wijanarko, Prasetia Anugrah Pratama, Traci Hong, Ika Idris, Alham Fikri Aji, Derry Wijaya

First submitted to arxiv on: 27 Jun 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This research paper proposes a comprehensive Indonesian hate speech and toxicity classification dataset, called IndoToxic2024, which focuses on texts targeting vulnerable groups in Indonesia during the presidential election. The dataset comprises 43,692 entries annotated by 19 diverse individuals, aiming to address the urgent need for effective detection mechanisms in the face of a ten-fold increase in online hate speech ratio over the past two years. The authors establish baselines for seven binary classification tasks using a BERT model (IndoBERTweet) fine-tuned for hate speech classification, achieving a macro-F1 score of 0.78. Additionally, they demonstrate how incorporating demographic information can enhance the zero-shot performance of the large language model gpt-3.5-turbo.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper creates a special kind of dataset to help identify online hate speech in Indonesia. It’s like a big collection of labeled text messages that can be used by computers to learn how to spot mean and hurtful language. The goal is to make it easier to detect hate speech, especially against vulnerable groups like Shia, LGBTQ, and other ethnic minorities. The researchers tested different models on this dataset and found that one model in particular did a pretty good job (macro-F1 score of 0.78). They also showed how using demographic information can make the model work better in some cases.

Keywords

» Artificial intelligence  » Bert  » Classification  » F1 score  » Gpt  » Large language model  » Zero shot