Loading Now

Summary of Solution For Smart-101 Challenge Of Cvpr Multi-modal Algorithmic Reasoning Task 2024, by Jinwoo Ahn et al.


Solution for SMART-101 Challenge of CVPR Multi-modal Algorithmic Reasoning Task 2024

by Jinwoo Ahn, Junhyeok Park, Min-Jun Kim, Kang-Hyeon Kim, So-Yeong Sohn, Yun-Ji Lee, Du-Seong Chang, Yu-Jung Heo, Eun-Sol Kim

First submitted to arxiv on: 10 Jun 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The HYU MLLAB KT Team presents their solution to the Multimodal Algorithmic Reasoning Task: SMART-101 CVPR 2024 Challenge, which aims to achieve human-level multimodal understanding by solving complex visio-linguistic puzzles designed for children in the 6-8 age group. To tackle this problem, they propose two main ideas. First, they utilize the reasoning ability of a large-scale language model (LLM) by grounding visual cues (images) in text modality through highly detailed captions that describe the image context and serve as input for the LLM. Second, they employ an object detection algorithm, specifically the SAM algorithm, to detect various-size objects and capture geometric visual patterns, feeding this information into the LLM. The team achieved an option selection accuracy (Oacc) of 29.5 on the test set and a weighted option selection accuracy (WOSA) of 27.1 on the challenge set.
Low GrooveSquid.com (original content) Low Difficulty Summary
The SMART-101 challenge is trying to get computers to understand complex puzzles that combine pictures and language, just like kids do. To solve this problem, researchers came up with two ideas. First, they use a special computer program called a large-scale language model (LLM) that can reason and make connections between what it sees in an image and what it reads about the image. They describe the image with lots of detail to help the LLM understand what’s going on. Second, they use another program that can detect different objects and patterns in images, like shapes and colors. This helps them identify important things in the picture that might be hard for the LLM to notice otherwise.

Keywords

» Artificial intelligence  » Grounding  » Language model  » Object detection  » Sam