Summary of Adaptive Inference-time Compute: Llms Can Predict If They Can Do Better, Even Mid-generation, by Rohin Manvi et al.
Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better, Even Mid-Generationby Rohin Manvi,…
Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better, Even Mid-Generationby Rohin Manvi,…
Grounding Large Language Models In Embodied Environment With Imperfect World Modelsby Haolan Liu, Jishen ZhaoFirst…
GPT-4o as the Gold Standard: A Scalable and General Purpose Approach to Filter Language Model…
CulturalBench: a Robust, Diverse and Challenging Benchmark on Measuring the (Lack of) Cultural Knowledge of…
DailyDilemmas: Revealing Value Preferences of LLMs with Quandaries of Daily Lifeby Yu Ying Chiu, Liwei…
CodeJudge: Evaluating Code Generation with Large Language Modelsby Weixi Tong, Tianyi ZhangFirst submitted to arxiv…
Can LLMs Reliably Simulate Human Learner Actions? A Simulation Authoring Framework for Open-Ended Learning Environmentsby…
MARPLE: A Benchmark for Long-Horizon Inferenceby Emily Jin, Zhuoyi Huang, Jan-Philipp Fränken, Weiyu Liu, Hannah…
Automated Red Teaming with GOAT: the Generative Offensive Agent Testerby Maya Pavlova, Erik Brinkman, Krithika…
The Labyrinth of Links: Navigating the Associative Maze of Multi-modal LLMsby Hong Li, Nanxi Li,…