Summary of Beyond Kv Caching: Shared Attention For Efficient Llms, by Bingli Liao and Danilo Vasconcellos Vargas
Beyond KV Caching: Shared Attention for Efficient LLMsby Bingli Liao, Danilo Vasconcellos VargasFirst submitted to…
Beyond KV Caching: Shared Attention for Efficient LLMsby Bingli Liao, Danilo Vasconcellos VargasFirst submitted to…
LookupViT: Compressing visual information to a limited number of tokensby Rajat Koner, Gagan Jain, Prateek…
Mamba-PTQ: Outlier Channels in Recurrent Large Language Modelsby Alessandro Pierro, Steven AbreuFirst submitted to arxiv…
Analyzing the Generalization and Reliability of Steering Vectorsby Daniel Tan, David Chanin, Aengus Lynch, Dimitrios…
UTG: Towards a Unified View of Snapshot and Event Based Models for Temporal Graphsby Shenyang…
When can transformers compositionally generalize in-context?by Seijin Kobayashi, Simon Schug, Yassir Akram, Florian Redhardt, Johannes…
Spectra: Surprising Effectiveness of Pretraining Ternary Language Models at Scaleby Ayush Kaushal, Tejas Vaidhya, Arnab…
Tiled Bit Networks: Sub-Bit Neural Network Compression Through Reuse of Learnable Binary Vectorsby Matt Gorbett,…
Enhancing Split Computing and Early Exit Applications through Predefined Sparsityby Luigi Capogrosso, Enrico Fraccaroli, Giulio…
PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculationby Branden Butler, Sixing Yu, Arya Mazaheri, Ali…