Summary of Sorry-bench: Systematically Evaluating Large Language Model Safety Refusal, by Tinghao Xie et al.
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusalby Tinghao Xie, Xiangyu Qi, Yi Zeng, Yangsibo…
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusalby Tinghao Xie, Xiangyu Qi, Yi Zeng, Yangsibo…
Holistic Evaluation for Interleaved Text-and-Image Generationby Minqian Liu, Zhiyang Xu, Zihao Lin, Trevor Ashby, Joy…
Does GPT Really Get It? A Hierarchical Scale to Quantify Human vs AI’s Understanding of…
Identifying User Goals from UI Trajectoriesby Omri Berkovitch, Sapir Caduri, Noam Kahlon, Anatoly Efros, Avi…
CryptoGPT: a 7B model rivaling GPT-4 in the task of analyzing and classifying real-time financial…
AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understandingby Alessandro Suglia, Claudio Greco,…
SPL: A Socratic Playground for Learning Powered by Large Language Modelby Liang Zhang, Jionghao Lin,…
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoningby Bingchen…
OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AIby Zhen Huang, Zengzhi Wang, Shijie Xia, Xuefeng…
Chumor 1.0: A Truly Funny and Challenging Chinese Humor Understanding Dataset from Ruo Zhi Baby…