Summary of Convergences and Divergences Between Automatic Assessment and Human Evaluation: Insights From Comparing Chatgpt-generated Translation and Neural Machine Translation, by Zhaokun Jiang and Qianxi Lv and Ziyin Zhang and Lei Lei
Convergences and Divergences between Automatic Assessment and Human Evaluation: Insights from Comparing ChatGPT-Generated Translation and Neural Machine Translation
by Zhaokun Jiang, Qianxi Lv, Ziyin Zhang, Lei Lei
First submitted to arxiv on: 10 Jan 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The study compares large language models like ChatGPT to neural machine translation (NMT) systems, investigating how well automated metrics align with human evaluation in assessing machine translation quality. Four automated metrics are used for automatic assessment, while human evaluation incorporates a specific error typology and six rubrics. The results show that automated metrics converge with human evaluation when measuring formal fidelity, but diverge when evaluating semantic and pragmatic fidelity, highlighting the importance of human judgment in evaluating advanced translation tools. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This study compares big language models like ChatGPT to special machines that translate languages (NMT). They want to know if the computer ways of judging how good the translations are match what humans think. They use four ways for computers to judge, and a special way for people to rate the translations. The results show that computers and humans agree on how well the translations get the facts right, but disagree on how well they capture the meaning and tone. |
Keywords
» Artificial intelligence » Translation