Summary of See Then Tell: Enhancing Key Information Extraction with Vision Grounding, by Shuhang Liu et al.
See then Tell: Enhancing Key Information Extraction with Vision Groundingby Shuhang Liu, Zhenrong Zhang, Pengfei…
See then Tell: Enhancing Key Information Extraction with Vision Groundingby Shuhang Liu, Zhenrong Zhang, Pengfei…
Grounding 3D Scene Affordance From Egocentric Interactionsby Cuiyu Liu, Wei Zhai, Yuhang Yang, Hongchen Luo,…
SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusionby Ming Dai, Lingfeng Yang,…
LTNtorch: PyTorch Implementation of Logic Tensor Networksby Tommaso Carraro, Luciano Serafini, Fabio AiolliFirst submitted to…
MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression Comprehensionby Ting Liu, Zunnan Xu, Yue…
Multi-Document Grounded Multi-Turn Synthetic Dialog Generationby Young-Suk Lee, Chulaka Gunasekara, Danish Contractor, Ramón Fernandez Astudillo,…
Question-Answering Dense Video Eventsby Hangyu Qin, Junbin Xiao, Angela YaoFirst submitted to arxiv on: 6…
Improving Apple Object Detection with Occlusion-Enhanced Distillationby Liang GengFirst submitted to arxiv on: 3 Sep…
From Grounding to Planning: Benchmarking Bottlenecks in Web Agentsby Segev Shlomov, Ben wiesel, Aviad Sela,…
Unlocking the Wisdom of Large Language Models: An Introduction to The Path to Artificial General…