Summary of Cross-lingual Named Entity Corpus For Slavic Languages, by Jakub Piskorski et al.
Cross-lingual Named Entity Corpus for Slavic Languages
by Jakub Piskorski, Michał Marcińczuk, Roman Yangarber
First submitted to arxiv on: 30 Mar 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary A novel corpus is introduced, featuring manual annotations of named entities across six Slavic languages. This effort builds upon a series of shared tasks conducted within the Workshops on Slavic Natural Language Processing from 2017 to 2023. The corpus comprises 5,017 documents covering seven topics, annotated with five classes of named entities. Each entity is described by a category, lemma, and unique cross-lingual identifier. Two train-tune dataset splits are provided: single topic out and cross-topics. Benchmarks are set using transformer-based neural networks with pre-trained multilingual models XLM-RoBERTa-large for named entity recognition and categorization, as well as mT5-large for lemmatization and linking. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper creates a special collection of words and phrases labeled with specific information about six languages. This is the result of many teams working together over several years to create a shared resource. The corpus has 5,000 documents about different topics, each with five types of named entities. Each entity has its own description. Two ways to split the data are presented: focusing on one topic or combining multiple topics. To test the quality of the annotated data, the researchers used special computer models that can understand many languages. |
Keywords
» Artificial intelligence » Lemmatization » Named entity recognition » Natural language processing » Transformer