Summary of Zyda-2: a 5 Trillion Token High-quality Dataset, by Yury Tokpanov et al.
Zyda-2: a 5 Trillion Token High-Quality Datasetby Yury Tokpanov, Paolo Glorioso, Quentin Anthony, Beren MillidgeFirst…
Zyda-2: a 5 Trillion Token High-Quality Datasetby Yury Tokpanov, Paolo Glorioso, Quentin Anthony, Beren MillidgeFirst…
LEGO-GraphRAG: Modularizing Graph-based Retrieval-Augmented Generation for Design Space Explorationby Yukun Cao, Zengyi Gao, Zhiyang Li,…
SpecHub: Provable Acceleration to Multi-Draft Speculative Decodingby Ryan Sun, Tianyi Zhou, Xun Chen, Lichao SunFirst…
Improving Multi-Domain Task-Oriented Dialogue System with Offline Reinforcement Learningby Dharmendra Prajapat, Durga ToshniwalFirst submitted to…
LLM Generated Distribution-Based Prediction of US Electoral Results, Part Iby Caleb Bradshaw, Caelen Miller, Sean…
Wave Network: An Ultra-Small Language Modelby Xin Zhang, Victor S.ShengFirst submitted to arxiv on: 4…
VQ-Map: Bird’s-Eye-View Map Layout Estimation in Tokenized Discrete Space via Vector Quantizationby Yiwei Zhang, Jin…
RAGViz: Diagnose and Visualize Retrieval-Augmented Generationby Tevin Wang, Jingyuan He, Chenyan XiongFirst submitted to arxiv…
Beyond Label Attention: Transparency in Language Models for Automated Medical Coding via Dictionary Learningby John…
Length-Induced Embedding Collapse in Transformer-based Modelsby Yuqi Zhou, Sunhao Dai, Zhanshuo Cao, Xiao Zhang, Jun…