Summary of Hyvilm: Enhancing Fine-grained Recognition with a Hybrid Encoder For Vision-language Models, by Shiding Zhu et al.
HyViLM: Enhancing Fine-Grained Recognition with a Hybrid Encoder for Vision-Language Modelsby Shiding Zhu, Wenhui Dong,…
HyViLM: Enhancing Fine-Grained Recognition with a Hybrid Encoder for Vision-Language Modelsby Shiding Zhu, Wenhui Dong,…
A4-Unet: Deformable Multi-Scale Attention Network for Brain Tumor Segmentationby Ruoxin Wang, Tianyi Tang, Haiming Du,…
Automatic Tongue Delineation from MRI Images with a Convolutional Neural Network Approachby Karyna Isaieva, Yves…
Parametric-ControlNet: Multimodal Control in Foundation Models for Precise Engineering Design Synthesisby Rui Zhou, Yanxia Zhang,…
Using Images to Find Context-Independent Word Representations in Vector Spaceby Harsh KumarFirst submitted to arxiv…
Training-Free Mitigation of Language Reasoning Degradation After Multimodal Instruction Tuningby Neale Ratzlaff, Man Luo, Xin…
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generationby Liao Qu, Huichao Zhang, Yiheng Liu,…
[CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Fasterby…
MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages with Negligible Costby Sen Xing, Muyan…
StableAnimator: High-Quality Identity-Preserving Human Image Animationby Shuyuan Tu, Zhen Xing, Xintong Han, Zhi-Qi Cheng, Qi…