Loading Now

Summary of Referring Expression Generation in Visually Grounded Dialogue with Discourse-aware Comprehension Guiding, by Bram Willemsen et al.


Referring Expression Generation in Visually Grounded Dialogue with Discourse-aware Comprehension Guiding

by Bram Willemsen, Gabriel Skantze

First submitted to arxiv on: 9 Sep 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The paper proposes a method for generating referring expressions (REs) in visually grounded dialogue. The approach involves a two-stage process: first, it models REG as a next-token prediction task conditioned on the preceding linguistic context and an image representation of the referent. Then, it uses discourse-aware comprehension to guide the generation of REs and rerank candidate expressions based on their discriminatory power. The results show that the proposed method is effective in producing discriminative REs, with improved performance in text-image retrieval accuracy compared to greedy decoding.
Low GrooveSquid.com (original content) Low Difficulty Summary
The paper tries to make computers better at understanding what people are talking about when they describe pictures. They do this by using a special kind of language model that looks at both the words being used and the picture being described. The goal is to come up with phrases (called referring expressions) that accurately point to the correct thing in the picture. The paper shows that their approach works well, making it easier for computers to figure out what people are talking about when they describe pictures.

Keywords

» Artificial intelligence  » Discourse  » Language model  » Token