Summary of Longllava: Scaling Multi-modal Llms to 1000 Images Efficiently Via a Hybrid Architecture, by Xidong Wang et al.
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architectureby Xidong Wang, Dingjie…
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architectureby Xidong Wang, Dingjie…
Action-Based ADHD Diagnosis in Videoby Yichun Li, Yuxing Yang, Syed Nohsen NaqviFirst submitted to arxiv…
Multi-modal Situated Reasoning in 3D Scenesby Xiongkun Linghu, Jiangyong Huang, Xuesong Niu, Xiaojian Ma, Baoxiong…
SCOPE: Sign Language Contextual Processing with Embedding from LLMsby Yuqi Liu, Wenqian Zhang, Sihan Ren,…
A Survey for Large Language Models in Biomedicineby Chong Wang, Mengyao Li, Junjun He, Zhongruo…
Look, Compare, Decide: Alleviating Hallucination in Large Vision-Language Models via Multi-View Multi-Path Reasoningby Xiaoye Qu,…
M4CXR: Exploring Multi-task Potentials of Multi-modal Large Language Models for Chest X-ray Interpretationby Jonggwon Park,…
Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Inputby Jiajun Liu, Yibing Wang, Hanghang Ma,…
VHAKG: A Multi-modal Knowledge Graph Based on Synchronized Multi-view Videos of Daily Activitiesby Shusaku Egami,…
LMM-VQA: Advancing Video Quality Assessment with Large Multimodal Modelsby Qihang Ge, Wei Sun, Yu Zhang,…