Summary of Quadratic Gating Functions in Mixture Of Experts: a Statistical Insight, by Pedram Akbarian et al.
Quadratic Gating Functions in Mixture of Experts: A Statistical Insightby Pedram Akbarian, Huy Nguyen, Xing…
Quadratic Gating Functions in Mixture of Experts: A Statistical Insightby Pedram Akbarian, Huy Nguyen, Xing…
Mimetic Initialization Helps State Space Models Learn to Recallby Asher Trockman, Hrayr Harutyunyan, J. Zico…
3DS: Decomposed Difficulty Data Selection’s Case Study on LLM Medical Domain Adaptationby Hongxin Ding, Yue…
Towards Better Multi-head Attention via Channel-wise Sample Permutationby Shen Yuan, Hongteng XuFirst submitted to arxiv…
A few-shot Label Unlearning in Vertical Federated Learningby Hanlin Gu, Hong Xi Tae, Chee Seng…
What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysisby Weronika…
When Attention Sink Emerges in Language Models: An Empirical Viewby Xiangming Gu, Tianyu Pang, Chao…
Artificial Intelligence-Based Triaging of Cutaneous Melanocytic Lesionsby Ruben T. Lucassen, Nikolas Stathonikos, Gerben E. Breimer,…
LoLCATs: On Low-Rank Linearizing of Large Language Modelsby Michael Zhang, Simran Arora, Rahul Chalamala, Alan…
Learning Linear Attention in Polynomial Timeby Morris Yau, Ekin Akyürek, Jiayuan Mao, Joshua B. Tenenbaum,…