Summary of Grounding Is All You Need? Dual Temporal Grounding For Video Dialog, by You Qin et al.
Grounding is All You Need? Dual Temporal Grounding for Video Dialogby You Qin, Wei Ji,…
Grounding is All You Need? Dual Temporal Grounding for Video Dialogby You Qin, Wei Ji,…
TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokensby…
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agentsby Boyu Gou,…
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Modelsby Haibo Wang, Zhiyang Xu, Yu…
Adaptive Masking Enhances Visual Groundingby Sen Jia, Lei LiFirst submitted to arxiv on: 4 Oct…
Choices are More Important than Efforts: LLM Enables Efficient Multi-Agent Explorationby Yun Qu, Boyuan Wang,…
From Concrete to Abstract: A Multimodal Generative Approach to Abstract Concept Learningby Haodong Xie, Rahul…
Robin3D: Improving 3D Large Language Model via Robust Instruction Tuningby Weitai Kang, Haifeng Huang, Yuzhang…
World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filteringby Jiacong Wang, Bohong…
Learning to Ground Existentially Quantified Goalsby Martin Funkquist, Simon Ståhlberg, Hector GeffnerFirst submitted to arxiv…