Summary of Linvt: Empower Your Image-level Large Language Model to Understand Videos, by Lishuai Gao et al.
LinVT: Empower Your Image-level Large Language Model to Understand Videosby Lishuai Gao, Yujie Zhong, Yingsen…
LinVT: Empower Your Image-level Large Language Model to Understand Videosby Lishuai Gao, Yujie Zhong, Yingsen…
Leveraging Multimodal Protein Representations to Predict Protein Melting Temperaturesby Daiheng Zhang, Yan Zeng, Xinyu Hong,…
Enhancing CLIP Conceptual Embedding through Knowledge Distillationby Kuei-Chun KaoFirst submitted to arxiv on: 4 Dec…
Learning on One Mode: Addressing Multi-modality in Offline Reinforcement Learningby Mianchu Wang, Yue Jin, Giovanni…
WxC-Bench: A Novel Dataset for Weather and Climate Downstream Tasksby Rajat Shinde, Christopher E. Phillips,…
Visual Error Patterns in Multi-Modal AI: A Statistical Approachby Ching-Yi WangFirst submitted to arxiv on:…
ElectroVizQA: How well do Multi-modal LLMs perform in Electronics Visual Question Answering?by Pragati Shuddhodhan Meshram,…
LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videosby Tiantian Geng, Jinrui Zhang, Qingni…
CAREL: Instruction-guided reinforcement learning with cross-modal auxiliary objectivesby Armin Saghafian, Amirmohammad Izadi, Negin Hashemi Dijujin,…
MoTe: Learning Motion-Text Diffusion Model for Multiple Generation Tasksby Yiming Wu, Wei Ji, Kecheng Zheng,…