Loading Now

Summary of Mera: a Comprehensive Llm Evaluation in Russian, by Alena Fenogenova et al.


MERA: A Comprehensive LLM Evaluation in Russian

by Alena Fenogenova, Artem Chervyakov, Nikita Martynov, Anastasia Kozlova, Maria Tikhonova, Albina Akhmetgareeva, Anton Emelyanov, Denis Shevelev, Pavel Lebedev, Leonid Sinev, Ulyana Isaeva, Katerina Kolomeytseva, Daniil Moskovskiy, Elizaveta Goncharova, Nikita Savushkin, Polina Mikhailova, Denis Dimitrov, Alexander Panchenko, Sergei Markov

First submitted to arxiv on: 9 Jan 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper introduces a new instruction benchmark for evaluating foundation models oriented towards the Russian language. The Multimodal Evaluation of Russian-language Architectures (MERA) is designed as a black-box test to ensure data leakage exclusion and includes 21 evaluation tasks for generative models in 11 skill domains. The authors propose an evaluation methodology, open-source code base, and leaderboard with submission system for MERA assessment. They evaluate open language models as baselines and find they are still far behind the human level.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper makes a new benchmark to test AI language models. It’s like a big quiz that checks how well these models can do certain tasks, like generating text or images. The benchmark is special because it uses real Russian language instructions, not made-up ones. This helps us understand how well the models work and what they’re good at. The authors also compare their models to human-level performance and find that there’s still a lot of room for improvement.

Keywords

* Artificial intelligence