Summary of Referring Expression Generation in Visually Grounded Dialogue with Discourse-aware Comprehension Guiding, by Bram Willemsen et al.
Referring Expression Generation in Visually Grounded Dialogue with Discourse-aware Comprehension Guiding
by Bram Willemsen, Gabriel Skantze
First submitted to arxiv on: 9 Sep 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper proposes a method for generating referring expressions (REs) in visually grounded dialogue. The approach involves a two-stage process: first, it models REG as a next-token prediction task conditioned on the preceding linguistic context and an image representation of the referent. Then, it uses discourse-aware comprehension to guide the generation of REs and rerank candidate expressions based on their discriminatory power. The results show that the proposed method is effective in producing discriminative REs, with improved performance in text-image retrieval accuracy compared to greedy decoding. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary The paper tries to make computers better at understanding what people are talking about when they describe pictures. They do this by using a special kind of language model that looks at both the words being used and the picture being described. The goal is to come up with phrases (called referring expressions) that accurately point to the correct thing in the picture. The paper shows that their approach works well, making it easier for computers to figure out what people are talking about when they describe pictures. |
Keywords
» Artificial intelligence » Discourse » Language model » Token