Loading Now

Summary of Do Models Explain Themselves? Counterfactual Simulatability Of Natural Language Explanations, by Yanda Chen et al.


Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations

by Yanda Chen, Ruiqi Zhong, Narutatsu Ri, Chen Zhao, He He, Jacob Steinhardt, Zhou Yu, Kathleen McKeown

First submitted to arxiv on: 17 Jul 2023

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
Large language models (LLMs) are designed to mimic human reasoning by providing natural language explanations for their decisions. However, this raises questions about LLMs’ ability to explain themselves and help humans build mental models of how they process different inputs. To investigate this, we propose evaluating the “counterfactual simulatability” of these explanations. Specifically, we examine whether an explanation can accurately enable humans to predict a model’s output on diverse counterfactual versions of the input it was explaining. For instance, if a model answers “yes” to the question “Can eagles fly?” with the explanation that all birds can fly, then humans would infer from this explanation that the model would also answer “yes” to the counterfactual input “Can penguins fly?”. If the explanation is precise, it should align with humans’ expectations. We believe this evaluation will provide valuable insights into LLMs’ ability to facilitate human understanding of their decision-making processes.
Low GrooveSquid.com (original content) Low Difficulty Summary
Large language models try to explain how they make decisions. But do these models also understand themselves? Can they help people figure out how they work? To answer these questions, we’re looking at whether these explanations can be used to predict what the model would say if given different information. For example, if a model says “yes” to the question “Can eagles fly?” and explains it’s because all birds can fly, then people should be able to use that explanation to guess what the model would say about penguins flying. If the explanation is good, it should match what people expect.

Keywords

* Artificial intelligence