Summary of Chemsafetybench: Benchmarking Llm Safety on Chemistry Domain, by Haochen Zhao et al.
ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain
by Haochen Zhao, Xiangru Tang, Ziran Yang, Xiao Han, Xuanzhi Feng, Yueqing Fan, Senhao Cheng, Di Jin, Yilun Zhao, Arman Cohan, Mark Gerstein
First submitted to arxiv on: 23 Nov 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The advancement of large language models (LLMs) in scientific research assistance has been remarkable, but concerns about their accuracy and safety have grown. To address these issues in chemistry, we introduce ChemSafetyBench, a benchmark designed to evaluate the accuracy and safety of LLM responses. ChemSafetyBench comprises three tasks: querying chemical properties, assessing the legality of chemical uses, and describing synthesis methods, each requiring increasing chemical knowledge. Our dataset has over 30K samples across various chemical materials. We use handcrafted templates and advanced jailbreaking scenarios to enhance task diversity. Our automated evaluation framework assesses the safety, accuracy, and appropriateness of LLM responses. Experimental results with state-of-the-art LLMs reveal strengths and vulnerabilities, underscoring the need for robust safety measures. ChemSafetyBench aims to be a pivotal tool in developing safer AI technologies in chemistry. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Imagine using artificial intelligence (AI) to help you with science research. Sounds great, but what if the AI gives you incorrect or even dangerous information? To fix this problem in chemistry, we created ChemSafetyBench, a special test to see how well AI models can answer questions safely and correctly. We made three types of tasks: asking about chemical properties, checking if chemicals are allowed to be used, and explaining how to make new chemicals. Our dataset has thousands of examples from different areas of chemistry. We also came up with special ways to make the tasks more interesting and challenging. A computer program evaluates how well AI models do on these tasks, looking at safety, accuracy, and appropriateness. By testing state-of-the-art AI models, we found some good points but also some weaknesses that need to be fixed. ChemSafetyBench can help us create safer AI tools for chemistry research. |