Summary of Mera: a Comprehensive Llm Evaluation in Russian, by Alena Fenogenova et al.
MERA: A Comprehensive LLM Evaluation in Russian
by Alena Fenogenova, Artem Chervyakov, Nikita Martynov, Anastasia Kozlova, Maria Tikhonova, Albina Akhmetgareeva, Anton Emelyanov, Denis Shevelev, Pavel Lebedev, Leonid Sinev, Ulyana Isaeva, Katerina Kolomeytseva, Daniil Moskovskiy, Elizaveta Goncharova, Nikita Savushkin, Polina Mikhailova, Denis Dimitrov, Alexander Panchenko, Sergei Markov
First submitted to arxiv on: 9 Jan 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper introduces a new instruction benchmark for evaluating foundation models oriented towards the Russian language. The Multimodal Evaluation of Russian-language Architectures (MERA) is designed as a black-box test to ensure data leakage exclusion and includes 21 evaluation tasks for generative models in 11 skill domains. The authors propose an evaluation methodology, open-source code base, and leaderboard with submission system for MERA assessment. They evaluate open language models as baselines and find they are still far behind the human level. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper makes a new benchmark to test AI language models. It’s like a big quiz that checks how well these models can do certain tasks, like generating text or images. The benchmark is special because it uses real Russian language instructions, not made-up ones. This helps us understand how well the models work and what they’re good at. The authors also compare their models to human-level performance and find that there’s still a lot of room for improvement. |