Summary of Giraffe: Design Choices For Extending the Context Length Of Visual Language Models, by Mukai Li et al.
GIRAFFE: Design Choices for Extending the Context Length of Visual Language Models
by Mukai Li, Lei Li, Shansan Gong, Qi Liu
First submitted to arxiv on: 17 Dec 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The proposed solution aims to enhance the long-range modeling capabilities of Visual Language Models (VLMs) while preserving their performance in short context scenarios. To achieve this, the authors analyze data sources and length distributions, proposing a data recipe called ETVLM to balance performance across scenarios. They also examine existing position extending methods, identifying limitations and developing an enhanced approach, M-RoPE++. Additionally, they discuss how to utilize extended context windows and propose hybrid-resolution training. Built on the Qwen-VL series model, the authors introduce Giraffe, a VLM effectively extended to 128K lengths. Evaluated on long context VLM benchmarks like VideoMME and Visual Haystacks, Giraffe achieves state-of-the-art performance among open-source models and is competitive with commercial GPT-4V. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Giraffe is a new Visual Language Model that can understand very long videos or images. The authors made it better by using a special recipe for the data and a new way to extend the model’s context. They tested Giraffe on many different datasets and found that it performed well, even compared to commercial models. |
Keywords
» Artificial intelligence » Gpt » Language model