Summary of Florence-vl: Enhancing Vision-language Models with Generative Vision Encoder and Depth-breadth Fusion, by Jiuhai Chen et al.

Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

by Jiuhai Chen, Jianwei Yang, Haiping Wu, Dianqi Li, Jianfeng Gao, Tianyi Zhou, Bin Xiao

First submitted to arxiv on: 5 Dec 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary We present Florence-VL, a family of multimodal large language models (MLLMs) that leverage enriched visual representations from the Florence-2 generative vision foundation model. Unlike CLIP-style vision transformers, Florence-2 captures diverse levels and aspects of visual features, making it more versatile for various downstream tasks. Our novel feature-fusion architecture and innovative training recipe integrate Florence-2’s visual features into pretrained LLMs like Phi 3.5 and LLama 3 using “depth-breath fusion” (DBFusion). We train our models on diverse open-source datasets, including high-quality image captions and instruction-tuning pairs. Our quantitative analysis and visualization show that Florence-VL outperforms popular vision encoders in vision-language alignment, where enriched depth and breath play crucial roles. Florence-VL achieves significant improvements over existing state-of-the-art MLLMs across various benchmarks covering general VQA, perception, hallucination, OCR, chart understanding, and knowledge-intensive tasks.
Low	GrooveSquid.com (original content)	Low Difficulty Summary We’ve developed a new kind of AI model called Florence-VL that can understand both text and images. This is different from other models because it can capture many different aspects of what’s in an image, which makes it better for doing various tasks. We also came up with a new way to combine these visual features with language models like Phi 3.5 and LLama 3. Our model was trained on lots of different images and captions, and we tested it on many different kinds of tasks, including recognizing objects, answering questions, and understanding charts. The results show that Florence-VL is better than other models at this kind of task, which is important for things like building chatbots and virtual assistants.

Keywords

* Artificial intelligence * Alignment * Hallucination * Instruction tuning * Llama

Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

by Jiuhai Chen, Jianwei Yang, Haiping Wu, Dianqi Li, Jianfeng Gao, Tianyi Zhou, Bin Xiao

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of The Hyperfitting Phenomenon: Sharpening and Stabilizing Llms For Open-ended Text Generation, by Fredrik Carlsson et al.

Summary of Cross-self Kv Cache Pruning For Efficient Vision-language Inference, by Xiaohuan Pei et al.

Related Posts