Summary of Mining Fine-grained Image-text Alignment For Zero-shot Captioning Via Text-only Training, by Longtian Qiu et al.
Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Trainingby Longtian Qiu, Shan Ning, Xuming…
Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Trainingby Longtian Qiu, Shan Ning, Xuming…
SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for Multimodal Alignmentby Ziping Ma, Furong Xu, Jian…
Object-oriented backdoor attack against image captioningby Meiling Li, Nan Zhong, Xinpeng Zhang, Zhenxing Qian, Sheng…
GRIT: Faster and Better Image captioning Transformer Using Dual Visual Featuresby Van-Quang Nguyen, Masanori Suganuma,…
ErgoChat: a Visual Query System for the Ergonomic Risk Assessment of Construction Workersby Chao Fan,…
ViPCap: Retrieval Text-Based Visual Prompts for Lightweight Image Captioningby Taewhan Kim, Soeun Lee, Si-Woo Kim,…
Survey of Large Multimodal Model Datasets, Application Categories and Taxonomyby Priyaranjan Pattnayak, Hitesh Laxmichand Patel,…
GCS-M3VLT: Guided Context Self-Attention based Multi-modal Medical Vision Language Transformer for Retinal Image Captioningby Teja…
Unveiling Uncertainty: A Deep Dive into Calibration and Performance of Multimodal Large Language Modelsby Zijun…
Seeing Syntax: Uncovering Syntactic Learning Limitations in Vision-Language Modelsby Sri Harsha Dumpala, David Arps, Sageev…