Summary of Showui: One Vision-language-action Model For Gui Visual Agent, by Kevin Qinghong Lin et al.
ShowUI: One Vision-Language-Action Model for GUI Visual Agentby Kevin Qinghong Lin, Linjie Li, Difei Gao,…
ShowUI: One Vision-Language-Action Model for GUI Visual Agentby Kevin Qinghong Lin, Linjie Li, Difei Gao,…
HEIE: MLLM-Based Hierarchical Explainable AIGC Image Implausibility Evaluatorby Fan Yang, Ru Zhen, Jianing Wang, Yanhao…
freePruner: A Training-free Approach for Large Multimodal Model Accelerationby Bingxin Xu, Yuzhang Shang, Yunhao Ge,…
Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancyby Te Yang, Jian Jia, Xiangyu…
XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Modelsby Yixin Dong, Charlie F.…
FOCUS: Knowledge-enhanced Adaptive Visual Compression for Few-shot Whole Slide Image Classificationby Zhengrui Guo, Conghao Xiong,…
FoPru: Focal Pruning for Efficient Large Vision-Language Modelsby Lei Jiang, Weizhe Huang, Tongxuan Liu, Yuting…
LaVida Drive: Vision-Text Interaction VLM for Autonomous Driving with Token Selection, Recovery and Enhancementby Siwen…
CATCH: Complementary Adaptive Token-level Contrastive Decoding to Mitigate Hallucinations in LVLMsby Zhehan Kan, Ce Zhang,…
Do LLMs Understand Ambiguity in Text? A Case Study in Open-world Question Answeringby Aryan Keluskar,…