Summary of Cross-lingual Named Entity Corpus For Slavic Languages, by Jakub Piskorski et al.

Cross-lingual Named Entity Corpus for Slavic Languages

by Jakub Piskorski, Michał Marcińczuk, Roman Yangarber

First submitted to arxiv on: 30 Mar 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary A novel corpus is introduced, featuring manual annotations of named entities across six Slavic languages. This effort builds upon a series of shared tasks conducted within the Workshops on Slavic Natural Language Processing from 2017 to 2023. The corpus comprises 5,017 documents covering seven topics, annotated with five classes of named entities. Each entity is described by a category, lemma, and unique cross-lingual identifier. Two train-tune dataset splits are provided: single topic out and cross-topics. Benchmarks are set using transformer-based neural networks with pre-trained multilingual models XLM-RoBERTa-large for named entity recognition and categorization, as well as mT5-large for lemmatization and linking.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This paper creates a special collection of words and phrases labeled with specific information about six languages. This is the result of many teams working together over several years to create a shared resource. The corpus has 5,000 documents about different topics, each with five types of named entities. Each entity has its own description. Two ways to split the data are presented: focusing on one topic or combining multiple topics. To test the quality of the annotated data, the researchers used special computer models that can understand many languages.

Keywords

* Artificial intelligence * Lemmatization * Named entity recognition * Natural language processing * Transformer

Cross-lingual Named Entity Corpus for Slavic Languages

by Jakub Piskorski, Michał Marcińczuk, Roman Yangarber

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Linguistic Calibration Of Long-form Generations, by Neil Band and Xuechen Li and Tengyu Ma and Tatsunori Hashimoto

Summary of Noise-aware Training Of Layout-aware Language Models, by Ritesh Sarkhel et al.

Related Posts