Loading Now

Summary of Multisocial: Multilingual Benchmark Of Machine-generated Text Detection Of Social-media Texts, by Dominik Macko et al.


MultiSocial: Multilingual Benchmark of Machine-Generated Text Detection of Social-Media Texts

by Dominik Macko, Jakub Kopal, Robert Moro, Ivan Srba

First submitted to arxiv on: 18 Jun 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The recent advancements in Large Language Models (LLMs) have enabled them to generate high-quality multilingual texts that are indistinguishable from authentic human-written ones. However, most research in machine-generated text detection has focused on longer texts such as news articles, scientific papers, or student essays in the English language. The social-media domain presents a gap in studying the ability of existing methods to detect shorter and informal texts, which often feature grammatical errors, emoticons, and hashtags. To address this gap, we propose the first multilingual (22 languages) and multi-platform (5 social media platforms) dataset called MultiSocial for benchmarking machine-generated text detection in the social-media domain. The dataset contains 472,097 texts, with approximately 58k being human-written and about the same amount generated by each of 7 multilingual LLMs. We compare existing detection methods using this benchmark, both in zero-shot and fine-tuned forms. Our results show that fine-tuned detectors can be trained on social-media texts, and platform selection for training matters.
Low GrooveSquid.com (original content) Low Difficulty Summary
Machine-generated text detection is a challenging task, especially when it comes to social media platforms where texts are short and informal. Currently, most methods are designed for longer texts like news articles or scientific papers, but this doesn’t account for the way people communicate on social media. To fill this gap, researchers have created a new dataset called MultiSocial that contains over 472,000 texts from 5 different social media platforms in 22 languages. This dataset includes both human-written and machine-generated texts, which can be used to test how well detection methods work. The results show that fine-tuning the detectors for each platform improves their performance. This matters because it shows that even small differences between platforms can make a big difference in detecting machine-generated text.

Keywords

» Artificial intelligence  » Fine tuning  » Zero shot