Loading Now

Summary of Adversarial Training with Ocr Modality Perturbation For Scene-text Visual Question Answering, by Zhixuan Shen et al.


Adversarial Training with OCR Modality Perturbation for Scene-Text Visual Question Answering

by Zhixuan Shen, Haonan Luo, Sijia Li, Tianrui Li

First submitted to arxiv on: 14 Mar 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper proposes a novel approach to Scene-Text Visual Question Answering (ST-VQA), which aims to understand scene text in images and answer questions related to the text content. The method, called multimodal adversarial training with spatial awareness capabilities, tackles the issue of overfitting caused by relying heavily on Optical Character Recognition (OCR) systems. A key component is the Adversarial OCR Enhancement (AOE) module, which enhances fault-tolerant representation of OCR texts by leveraging adversarial training in the embedding space. The method also incorporates a Spatial-Aware Self-Attention (SASA) mechanism to capture spatial relationships among OCR tokens. Experiments demonstrate significant performance improvements on both ST-VQA and TextVQA datasets.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper helps us better understand images with text, like signs or menus. Right now, computers are not great at this because they rely too much on special software that tries to read the text. This can be tricky and make the computer’s answers wrong. The new method proposed in this paper uses a combination of techniques to make it more accurate. It makes the computer’s reading skills better by using something called “adversarial training” and adding more attention to where words are located on the page. This helps the computer give better answers to questions about what it sees.

Keywords

» Artificial intelligence  » Attention  » Embedding space  » Overfitting  » Question answering  » Self attention