Summary of Mm1.5: Methods, Analysis & Insights From Multimodal Llm Fine-tuning, by Haotian Zhang et al.
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuningby Haotian Zhang, Mingfei Gao, Zhe Gan,…
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuningby Haotian Zhang, Mingfei Gao, Zhe Gan,…
Visual Prompting in Multimodal Large Language Models: A Surveyby Junda Wu, Zhehao Zhang, Yu Xia,…
Transformer with Controlled Attention for Synchronous Motion Captioningby Karim Radouane, Sylvie Ranwez, Julien Lagarde, Andon…
What Makes a Maze Look Like a Maze?by Joy Hsu, Jiayuan Mao, Joshua B. Tenenbaum,…
Shaking Up VLMs: Comparing Transformers and Structured State Space Models for Vision & Language Modelingby…
Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understandingby Yunze Man, Shuhong Zheng, Zhipeng…
DSTI at LLMs4OL 2024 Task A: Intrinsic versus extrinsic knowledge for type classificationby Hanna Abi…
A Lightweight Modular Framework for Low-Cost Open-Vocabulary Object Detection Trainingby Bilal Faye, Binta Sow, Hanane…
Neural Reward Machinesby Elena Umili, Francesco Argenziano, Roberto CapobiancoFirst submitted to arxiv on: 16 Aug…
Infusing Environmental Captions for Long-Form Video Language Groundingby Hyogun Lee, Soyeon Hong, Mujeen Sung, Jinwoo…