Summary of Efficient Llm Inference Using Dynamic Input Pruning and Cache-aware Masking, by Marco Federici et al.
Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Maskingby Marco Federici, Davide Belli, Mart…
Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Maskingby Marco Federici, Davide Belli, Mart…
Token Cropr: Faster ViTs for Quite a Few Tasksby Benjamin Bergner, Christoph Lippert, Aravindh MahendranFirst…
Mixture of Cache-Conditional Experts for Efficient Mobile Device Inferenceby Andrii Skliar, Ties van Rozendaal, Romain…
Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findingsby Qiong Wu,…
Attamba: Attending To Multi-Token Statesby Yash Akhauri, Safeen Huda, Mohamed S. AbdelfattahFirst submitted to arxiv…
Ensuring Fair LLM Serving Amid Diverse Applicationsby Redwan Ibne Seraj Khan, Kunal Jain, Haiying Shen,…
Improving Next Tokens via Second-to-Last Predictions with Generate and Refineby Johannes SchneiderFirst submitted to arxiv…
Transforming NLU with Babylon: A Case Study in Development of Real-time, Edge-Efficient, Multi-Intent Translation System…
Text Embedding is Not All You Need: Attention Control for Text-to-Image Semantic Alignment with Text…
HeadRouter: A Training-free Image Editing Framework for MM-DiTs by Adaptively Routing Attention Headsby Yu Xu,…