Summary of Compute or Load Kv Cache? Why Not Both?, by Shuowei Jin et al.
Compute Or Load KV Cache? Why Not Both?by Shuowei Jin, Xueshen Liu, Qingzhao Zhang, Z.…
Compute Or Load KV Cache? Why Not Both?by Shuowei Jin, Xueshen Liu, Qingzhao Zhang, Z.…
UNComp: Uncertainty-Aware Long-Context Compressor for Efficient Large Language Model Inferenceby Jing Xiong, Jianghan Shen, Fanghua…
LLMCO2: Advancing Accurate Carbon Footprint Prediction for LLM Inferencesby Zhenxiao Fu, Fan Chen, Shan Zhou,…
DecTrain: Deciding When to Train a Monocular Depth DNN Onlineby Zih-Sing Fu, Soumya Sudhakar, Sertac…
DANA: Domain-Aware Neurosymbolic Agents for Consistency and Accuracyby Vinh Luong, Sang Dinh, Shruti Raghavan, William…
Selective Attention Improves Transformerby Yaniv Leviathan, Matan Kalman, Yossi MatiasFirst submitted to arxiv on: 3…
Large Language Models as Markov Chainsby Oussama Zekri, Ambroise Odonnat, Abdelhakim Benechehab, Linus Bleistein, Nicolas…
Three-in-One: Fast and Accurate Transducer for Hybrid-Autoregressive ASRby Hainan Xu, Travis M. Bartley, Vladimir Bataev,…
Stochastic variance-reduced Gaussian variational inference on the Bures-Wasserstein manifoldby Hoang Phuc Hau Luu, Hanlin Yu,…
CTARR: A fast and robust method for identifying anatomical regions on CT images via atlas…