Loading Now

Summary of Negating Negatives: Alignment with Human Negative Samples Via Distributional Dispreference Optimization, by Shitong Duan et al.


Negating Negatives: Alignment with Human Negative Samples via Distributional Dispreference Optimization

by Shitong Duan, Xiaoyuan Yi, Peng Zhang, Yan Liu, Zheng Liu, Tun Lu, Xing Xie, Ning Gu

First submitted to arxiv on: 6 Mar 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This research paper proposes a novel approach to aligning large language models (LLMs) towards human preference, addressing potential social risks. Existing methods rely heavily on high-quality positive-negative training pairs, which can be noisy and difficult to distinguish. Instead, the authors introduce Distributional Dispreference Optimization (D^2O), a method that maximizes the discrepancy between dispreferred responses and non-negative ones generated by the LLMs. By eschewing harmful information without incorporating noisy positive samples, D^2O effectively reduces harmfulness while preserving helpfulness. The paper demonstrates that D^2O can be regarded as learning a distributional preference model reflecting human dispreference against negative responses, which is theoretically an upper bound of the instance-level DPO. Experimental results show that this approach achieves comparable generation quality and surpasses strong baselines in producing less harmful and more informative responses with better training stability and faster convergence.
Low GrooveSquid.com (original content) Low Difficulty Summary
This research aims to make sure large language models (AI) do what humans want them to do, rather than causing harm. Right now, these AI models are very good at generating helpful answers, but they can also sometimes produce harmful ones that we don’t want. The authors of this paper came up with a new way to teach the AI models to avoid producing those harmful responses. They did it by using examples of what humans don’t like, rather than trying to match things that humans do like. This approach worked well and produced better results than other methods. The research shows that we can use these negative examples to train the AI models to be more helpful and less likely to cause harm.

Keywords

» Artificial intelligence  » Optimization