DeepSeek-V3 Technical Report
by DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J.L. Cai, Jian Liang, Jianzhong Guo, Jiaqi Ni, Jiashi Li, Jiawei Wang, Jin Chen, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, Junxiao Song, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Lei Xu, Leyi Xia, Liang Zhao, Litong Wang, Liyue Zhang, Meng Li, Miaojun Wang, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Mingming Li, Ning Tian, Panpan Huang, Peiyi Wang, Peng Zhang, Qiancheng Wang, Qihao Zhu, Qinyu Chen, Qiushi Du, R.J. Chen, R.L. Jin, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, Runxin Xu, Ruoyu Zhang, Ruyi Chen, S.S. Li, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shaoqing Wu, Shengfeng Ye, Shengfeng Ye, Shirong Ma, Shiyu Wang, Shuang Zhou, Shuiping Yu, Shunfeng Zhou, Shuting Pan, T. Wang, Tao Yun, Tian Pei, Tianyu Sun, W.L. Xiao, Wangding Zeng
First submitted to arxiv on: 27 Dec 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary A new language model, DeepSeek-V3, is presented, which utilizes a Mixture-of-Experts (MoE) architecture with 671 billion total parameters and 37 billion activated for each token. The model incorporates Multi-head Latent Attention (MLA) and DeepSeekMoE architectures to achieve efficient inference and cost-effective training. Additionally, it employs an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for improved performance. Pre-trained on 14.8 trillion diverse tokens, the model is fine-tuned using Supervised Fine-Tuning and Reinforcement Learning. Comprehensive evaluations show that DeepSeek-V3 outperforms open-source models and matches the performance of leading closed-source models, while requiring only 2.788 million GPU hours for training. The training process is remarkably stable, with no irrecoverable loss spikes or rollbacks. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary DeepSeek-V3 is a new language model that’s really good at understanding text. It uses a special type of architecture called Mixture-of-Experts (MoE) and has a lot of parameters to learn from lots of different types of text. The model was trained on 14.8 trillion words and then fine-tuned to do even better. When tested, DeepSeek-V3 did very well compared to other models and only took a little computer time to train. What’s cool is that the training process didn’t get stuck or have any big problems. |
Keywords
* Artificial intelligence * Attention * Fine tuning * Inference * Language model * Mixture of experts * Reinforcement learning * Supervised * Token