Loading Now

Summary of Giraffe: Design Choices For Extending the Context Length Of Visual Language Models, by Mukai Li et al.


GIRAFFE: Design Choices for Extending the Context Length of Visual Language Models

by Mukai Li, Lei Li, Shansan Gong, Qi Liu

First submitted to arxiv on: 17 Dec 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The proposed solution aims to enhance the long-range modeling capabilities of Visual Language Models (VLMs) while preserving their performance in short context scenarios. To achieve this, the authors analyze data sources and length distributions, proposing a data recipe called ETVLM to balance performance across scenarios. They also examine existing position extending methods, identifying limitations and developing an enhanced approach, M-RoPE++. Additionally, they discuss how to utilize extended context windows and propose hybrid-resolution training. Built on the Qwen-VL series model, the authors introduce Giraffe, a VLM effectively extended to 128K lengths. Evaluated on long context VLM benchmarks like VideoMME and Visual Haystacks, Giraffe achieves state-of-the-art performance among open-source models and is competitive with commercial GPT-4V.
Low GrooveSquid.com (original content) Low Difficulty Summary
Giraffe is a new Visual Language Model that can understand very long videos or images. The authors made it better by using a special recipe for the data and a new way to extend the model’s context. They tested Giraffe on many different datasets and found that it performed well, even compared to commercial models.

Keywords

» Artificial intelligence  » Gpt  » Language model