Summary of Grounding Is All You Need? Dual Temporal Grounding For Video Dialog, by You Qin et al.
Grounding is All You Need? Dual Temporal Grounding for Video Dialogby You Qin, Wei Ji,…
Grounding is All You Need? Dual Temporal Grounding for Video Dialogby You Qin, Wei Ji,…
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agentsby Boyu Gou,…
TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokensby…
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Modelsby Haibo Wang, Zhiyang Xu, Yu…
Adaptive Masking Enhances Visual Groundingby Sen Jia, Lei LiFirst submitted to arxiv on: 4 Oct…
From Concrete to Abstract: A Multimodal Generative Approach to Abstract Concept Learningby Haodong Xie, Rahul…
Choices are More Important than Efforts: LLM Enables Efficient Multi-Agent Explorationby Yun Qu, Boyuan Wang,…
Robin3D: Improving 3D Large Language Model via Robust Instruction Tuningby Weitai Kang, Haifeng Huang, Yuzhang…
Learning to Ground Existentially Quantified Goalsby Martin Funkquist, Simon StÃ¥hlberg, Hector GeffnerFirst submitted to arxiv…
World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filteringby Jiacong Wang, Bohong…