Summary of Do Models Explain Themselves? Counterfactual Simulatability Of Natural Language Explanations, by Yanda Chen et al.
Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations
by Yanda Chen, Ruiqi Zhong, Narutatsu Ri, Chen Zhao, He He, Jacob Steinhardt, Zhou Yu, Kathleen McKeown
First submitted to arxiv on: 17 Jul 2023
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary Large language models (LLMs) are designed to mimic human reasoning by providing natural language explanations for their decisions. However, this raises questions about LLMs’ ability to explain themselves and help humans build mental models of how they process different inputs. To investigate this, we propose evaluating the “counterfactual simulatability” of these explanations. Specifically, we examine whether an explanation can accurately enable humans to predict a model’s output on diverse counterfactual versions of the input it was explaining. For instance, if a model answers “yes” to the question “Can eagles fly?” with the explanation that all birds can fly, then humans would infer from this explanation that the model would also answer “yes” to the counterfactual input “Can penguins fly?”. If the explanation is precise, it should align with humans’ expectations. We believe this evaluation will provide valuable insights into LLMs’ ability to facilitate human understanding of their decision-making processes. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Large language models try to explain how they make decisions. But do these models also understand themselves? Can they help people figure out how they work? To answer these questions, we’re looking at whether these explanations can be used to predict what the model would say if given different information. For example, if a model says “yes” to the question “Can eagles fly?” and explains it’s because all birds can fly, then people should be able to use that explanation to guess what the model would say about penguins flying. If the explanation is good, it should match what people expect. |