Summary of Crema: Generalizable and Efficient Video-language Reasoning Via Multimodal Modular Fusion, by Shoubin Yu et al.
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusionby Shoubin Yu, Jaehong Yoon, Mohit…
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusionby Shoubin Yu, Jaehong Yoon, Mohit…
StableMask: Refining Causal Masking in Decoder-only Transformerby Qingyu Yin, Xuzheng He, Xiang Zhuang, Yu Zhao,…
Boosting Adversarial Transferability across Model Genus by Deformation-Constrained Warpingby Qinliang Lin, Cheng Luo, Zenghao Niu,…
Position: Stop Making Unscientific AGI Performance Claimsby Patrick Altmeyer, Andrew M. Demetriou, Antony Bartlett, Cynthia…
Toward Human-AI Alignment in Large-Scale Multi-Player Gamesby Sugandha Sharma, Guy Davidson, Khimya Khetarpal, Anssi Kanervisto,…
UniMem: Towards a Unified View of Long-Context Large Language Modelsby Junjie Fang, Likai Tang, Hongzhe…
Faster Inference of Integer SWIN Transformer by Removing the GELU Activationby Mohammadreza Tayaranian, Seyyed Hasan…
Leveraging Swin Transformer for Local-to-Global Weakly Supervised Semantic Segmentationby Rozhan Ahmadi, Shohreh KasaeiFirst submitted to…
Arabic Tweet Act: A Weighted Ensemble Pre-Trained Transformer Model for Classifying Arabic Speech Acts on…
Fine-tuning Transformer-based Encoder for Turkish Language Understanding Tasksby Savas YildirimFirst submitted to arxiv on: 30…