Summary of Testing and Evaluation Of Large Language Models: Correctness, Non-toxicity, and Fairness, by Wenxuan Wang
Testing and Evaluation of Large Language Models: Correctness, Non-Toxicity, and Fairness
by Wenxuan Wang
First submitted to arxiv on: 31 Aug 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This PhD thesis explores the reliability of large language models (LLMs), such as ChatGPT, which have become increasingly popular due to their conversational skills. Despite their intelligence, LLMs often produce content with factual errors, biases, and toxicity, posing negative impacts on a wide range of applications. To address this issue, the thesis focuses on evaluating the correctness, non-toxicity, and fairness of LLMs from software testing and natural language processing perspectives. The researchers introduce four frameworks: FactChecker and LogicAsker for evaluating factual knowledge and logical reasoning accuracy, respectively; two red-teaming works to assess non-toxicity; and BiasAsker and XCulturalBench to measure social bias and cultural bias, respectively. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This PhD thesis looks at the reliability of big language models. These models can have problems like giving out wrong facts, being biased, or saying mean things. This is a big deal because so many people use these models every day. The researchers want to figure out how to make sure these models are correct, nice, and fair. They do this by testing the models in different ways. First, they test if the models know the right facts and can reason logically. Then, they check if the models say mean or unfair things. Finally, they look at whether the models have biases against certain groups of people. |
Keywords
* Artificial intelligence * Natural language processing