Summary of Do Models Explain Themselves? Counterfactual Simulatability Of Natural Language Explanations, by Yanda Chen et al.

Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations

by Yanda Chen, Ruiqi Zhong, Narutatsu Ri, Chen Zhao, He He, Jacob Steinhardt, Zhou Yu, Kathleen McKeown

First submitted to arxiv on: 17 Jul 2023

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary Large language models (LLMs) are designed to mimic human reasoning by providing natural language explanations for their decisions. However, this raises questions about LLMs’ ability to explain themselves and help humans build mental models of how they process different inputs. To investigate this, we propose evaluating the “counterfactual simulatability” of these explanations. Specifically, we examine whether an explanation can accurately enable humans to predict a model’s output on diverse counterfactual versions of the input it was explaining. For instance, if a model answers “yes” to the question “Can eagles fly?” with the explanation that all birds can fly, then humans would infer from this explanation that the model would also answer “yes” to the counterfactual input “Can penguins fly?”. If the explanation is precise, it should align with humans’ expectations. We believe this evaluation will provide valuable insights into LLMs’ ability to facilitate human understanding of their decision-making processes.
Low	GrooveSquid.com (original content)	Low Difficulty Summary Large language models try to explain how they make decisions. But do these models also understand themselves? Can they help people figure out how they work? To answer these questions, we’re looking at whether these explanations can be used to predict what the model would say if given different information. For example, if a model says “yes” to the question “Can eagles fly?” and explains it’s because all birds can fly, then people should be able to use that explanation to guess what the model would say about penguins flying. If the explanation is good, it should match what people expect.

Keywords

* Artificial intelligence

Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations

by Yanda Chen, Ruiqi Zhong, Narutatsu Ri, Chen Zhao, He He, Jacob Steinhardt, Zhou Yu, Kathleen McKeown

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Pagar: Taming Reward Misalignment in Inverse Reinforcement Learning-based Imitation Learning with Protagonist Antagonist Guided Adversarial Reward, by Weichao Zhou et al.

Summary of Attention-free Spikformer: Mixing Spike Sequences with Simple Linear Transforms, by Qingyu Wang et al.

Related Posts