有限GPU显存下的大语言模型训练技术综述

唐宇; 乔林波; 尹路珈; 梁鹏; 沈奥; 杨智琳; 张立志; 李东升

doi:10.1631/FITEE.2300710

Your Location：

Home >

Browse articles >

有限GPU显存下的大语言模型训练技术综述

常规文章 | Updated：2025-04-03

- 有限GPU显存下的大语言模型训练技术综述
  Enhanced Publication
- Training large-scale language models with limited GPU memory: a survey
- 信息与电子工程前沿（英文） 2025年26卷第3期页码：309-331
- Affiliations：
  
  National Key Laboratory of Parallel and Distributed Computing, College of Computer, National University of Defense Technology, Changsha 410073, China
- Author bio：
  
  E-mail: tangyu14@nudt.edu.cn
  ‡Corresponding author
- Funds：
  
  National Natural Science Foundation of China(62025208;62421002)
- DOI：10.1631/FITEE.2300710
  中图分类号： TP389.1
- 收稿日期：2023-10-17，
  
  修回日期：2024-03-31，
  
  网络出版日期：2025-03-17，
  
  纸质出版日期：2025-03
- Accepted：
Scan QR Code
唐宇, 乔林波, 尹路珈, 等. 有限GPU显存下的大语言模型训练技术综述[J]. 信息与电子工程前沿（英文）, 2025,26(3):309-331.

Yu TANG, Linbo QIAO, Lujia YIN, et al. Training large-scale language models with limited GPU memory: a survey[J]. Frontiers of information technology & electronic engineering, 2025, 26(3): 309-331.
唐宇, 乔林波, 尹路珈, 等. 有限GPU显存下的大语言模型训练技术综述[J]. 信息与电子工程前沿（英文）, 2025,26(3):309-331. DOI： 10.1631/FITEE.2300710.

Yu TANG, Linbo QIAO, Lujia YIN, et al. Training large-scale language models with limited GPU memory: a survey[J]. Frontiers of information technology & electronic engineering, 2025, 26(3): 309-331. DOI： 10.1631/FITEE.2300710.

摘要

大模型凭借其在多领域应用中的卓越性能，已在计算机视觉、自然语言处理等领域获得广泛关注。然而，此类模型的训练面临图形处理器（GPU）显存容量的显著制约。本文系统梳理了有限GPU显存条件下大模型训练的优化技术体系。首先深入解析训练过程中GPU显存占用的三大核心要素——模型参数、模型状态和模型激活；继而从这三个维度对现有研究成果进行多角度评述；最后展望了该领域未来的发展方向，强调持续创新显存优化技术对推动大语言模型发展的重要性。本综述为研究人员理解大语言模型训练中的显存优化挑战与技术演进提供了系统参考。

Abstract

Large-scale models have gained significant attention in a wide range of fields

such as computer vision and natural language processing

due to their effectiveness across various applications. However

a notable hurdle in training these large-scale models is the limited memory capacity of graphics processing units (GPUs). In this paper

we present a comprehensive survey focused on training large-scale models with limited GPU memory. The exploration commences by scrutinizing the factors that contribute to the consumption of GPU memory during the training process

namely model parameters

model states

and model activations. Following this analysis

we present an in-depth overview of the relevant research work that addresses these aspects individually. Finally

the paper concludes by presenting an outlook on the future of memory optimization in training large-scale language models

emphasizing the necessity for continued research and innovation in this area. This survey serves as a valuable resource for researchers and practitioners keen on comprehending the challenges and advancements in training large-scale language models with limited GPU memory.

关键词

Keywords

references

Abadi M , Barham P , Chen JM , et al. , 2016 . TensorFlow: a system for large-scale machine learning . Proc 12 th USENIX Conf on Operating Systems Design and Implementation , p. 265 - 283 .

Acun B , Murphy M , Wang XD , et al. , 2021 . Understanding training efficiency of deep learning recommendation models at scale . Proc IEEE Int Symp on High-Performance Computer Architecture , p. 802 - 814 . https://doi.org/10.1109/HPCA51647.2021.00072 https://doi.org/10.1109/HPCA51647.2021.00072

Ali Z , Kefalas P , Muhammad K , et al. , 2020 . Deep learning in citation recommendation models survey . Exp Syst Appl , 162 : 113790 . https://doi.org/10.1016/j.eswa.2020.113790 https://doi.org/10.1016/j.eswa.2020.113790

Amari SI , 1993 . Backpropagation and stochastic gradient descent method . Neurocomputing , 5 ( 4-5 ): 185 - 196 . https://doi.org/10.1016/0925-2312(93)90006-O https://doi.org/10.1016/0925-2312(93)90006-O

Bachlechner T , Majumder BP , Mao H , et al. , 2021 . ReZero is all you need: fast convergence at large depth . Proc 37 th Conf on Uncertainty in Artificial Intelligence , p. 1352 - 1361 .

Bae J , Lee J , Jin Y , et al. , 2021 . FlashNeuron: SSD-enabled large-batch training of very deep neural networks . Proc 19 th USENIX Conf on File and Storage Technologies , p. 387 - 401 .

Banner R , Hubara I , Hoffer E , et al. , 2018 . Scalable methods for 8-bit training of neural networks . Proc 32 nd I nt Conf on Neural Information Processing Systems , p. 5151 - 5159 .

Bartan B , Li H , Teague H , et al. , 2023 . MOCCASIN: efficient tensor rematerialization for neural networks . Int Conf on Machine Learning , p. 1826 - 1837 .

Beaumont O , Herrmann J , Pallez G , et al. , 2020 . Optimal memory-aware backpropagation of deep join networks . Phil Trans Roy Soc A , 378 ( 2166 ): 20190049 .

Beaumont O , Eyraud-Dubois L , Shilova A , 2021 . Efficient combination of rematerialization and offloading for training DNNs . Proc 35 th Conf on Neural Information Processing Systems , p. 23844 - 23857 .

Brown TB , Mann B , Ryder N , et al. , 2020 . Language models are few-shot learners . Proc 34 th Conf on Neural Information Processing Systems , p. 1877 - 1901 .

Chen AC , Zhang YM , Jia JH , et al. , 2024 . DeepZero: scaling up zeroth-order optimization for deep model training . https://doi.org/10.48550/arXiv.2310.02025 https://doi.org/10.48550/arXiv.2310.02025

Chen JF , Gai Y , Yao ZW , et al. , 2020 . A statistical framework for low-bitwidth training of deep neural networks . Proc 34 th Int Conf on Neural Information Processing Systems , Article 75 .

Chen JF , Zheng LM , Yao ZW , et al. , 2021 . ActNN: reducing training memory footprint via 2-bit activation compressed training . Proc 38 th Int Conf on Machine Learning , p. 1803 - 1813 .

Chen JF , Li SG , Gun R , et al. , 2023 . AutoDDL: automatic distributed deep learning with asymptotically optimal communication . https://doi.org/10.48550/arXiv.2301.06813 https://doi.org/10.48550/arXiv.2301.06813

Chen TQ , Li M , Li YT , et al. , 2015 . MXNet: a flexible and efficient machine learning library for heterogeneous distributed systems . https://doi.org/10.48550/arXiv.1512.01274 https://doi.org/10.48550/arXiv.1512.01274

Chen TQ , Xu B , Zhang CY , et al. , 2016 . Training deep nets with sublinear memory cost . https://doi.org/10.48550/arXiv.1604.06174 https://doi.org/10.48550/arXiv.1604.06174

Cho K , van Merriënboer B , Bahdanau D , et al. , 2014 . On the properties of neural machine translation: encoder–decoder approaches . Proc 8 th Workshop on Syntax, Semantics and Structure in Statistical Translation , p. 103 - 111 . https://doi.org/10.3115/v1/W14-4012 https://doi.org/10.3115/v1/W14-4012

Choquette J , Gandhi W , Giroux O , et al. , 2021 . NVIDIA A100 tensor core GPU: performance and innovation . IEEE Micro , 41 ( 2 ): 29 - 35 . https://doi.org/10.1109/MM.2021.3061394 https://doi.org/10.1109/MM.2021.3061394

Chowdhury GG , 2003 . Natural language processing . Annu Rev Inform Sci Technol , 37 ( 1 ): 51 - 89 . https://doi.org/10.1002/aris.1440370103 https://doi.org/10.1002/aris.1440370103

Cutkosky A , Mehta H , 2020 . Momentum improves normalized SGD . Proc 37 th Int Conf on Machine Learning , p. 2260 - 2268 .

Dao T , 2023 . FlashAttention-2: faster attention with better parallelism and work partitioning . https://doi.org/10.48550/arXiv.2307.08691 https://doi.org/10.48550/arXiv.2307.08691

Dao T , Fu D , Ermon S , et al. , 2022 . FlashAttention: fast and memory-efficient exact attention with IO-awareness . Proc 36 th Conf on Neural Information Processing Systems , p. 16344 - 16359 .

Dean J , Corrado GS , Monga R , et al. , 2012 . Large scale distributed deep networks . Proc 25 th Int Conf on Neural Information Processing Systems , p. 1223 - 1231 .

Devlin J , Chang MW , Lee K , et al. , 2019 . BERT: pre-training of deep bidirectional Transformers for language understanding . Proc Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , p. 4171 - 4186 . https://doi.org/10.18653/v1/N19-1423 https://doi.org/10.18653/v1/N19-1423

Dong L , Yang N , Wang WH , et al. , 2019 . Unified language model pre-training for natural language understanding and generation . Proc 33 rd Int Conf on Neural Information Processing Systems , Article 1170 .

Fan SQ , Rong Y , Meng C , et al. , 2021 . DAPPLE: a pipelined data parallel approach for training large models . Proc 26 th ACM SIGPLAN Symp on Principles and Practice of Parallel Programming , p. 431 - 445 . https://doi.org/10.1145/3437801.3441593 https://doi.org/10.1145/3437801.3441593

Fang JR , Zhu ZL , Li SG , et al. , 2023 . Parallel training of pre-trained models via chunk-based dynamic memory management . IEEE Trans Parall Distrib Syst , 34 ( 1 ): 304 - 315 . https://doi.org/10.1109/TPDS.2022.3219819 https://doi.org/10.1109/TPDS.2022.3219819

Fedus W , Zoph B , Shazeer N , 2022 . Switch Transformers: scaling to trillion parameter models with simple and efficient sparsity . J Mach Learn Res , 23 ( 1 ): 120 .

Fu FC , Hu YZ , He YH , et al. , 2020 . Don’t waste your bits! Squeeze activations and gradients for deep neural networks via TINYSCRIPT . Proc 37 th Int Conf on Machine Learning , Article 309 .

Gholami A , Yao ZW , Kim S , et al. , 2024 . AI and memory wall . IEEE Micro , 44 ( 3 ): 33 - 39 . https://doi.org/10.1109/MM.2024.3373763 https://doi.org/10.1109/MM.2024.3373763

Guan L , Yin WT , Li DS , et al. , 2019 . XPipe: efficient pipeline model parallelism for multi-GPU DNN training . https://doi.org/10.48550/arXiv.1911.04610 https://doi.org/10.48550/arXiv.1911.04610

Gusak J , Cherniuk D , Shilova A , et al. , 2022 . Survey on large scale neural network training . https://doi.org/10.48550/arXiv.2202.10435 https://doi.org/10.48550/arXiv.2202.10435

Gustafson JL , 1988 . Reevaluating Amdahl’s law . Commun ACM , 31 ( 5 ): 532 - 533 . https://doi.org/10.1145/42411.42415 https://doi.org/10.1145/42411.42415

Han K , Wang YH , Chen HT , et al. , 2023 . A survey on vision Transformer . IEEE Trans Patt Anal Mach Intell , 45 ( 1 ): 87 - 110 . https://doi.org/10.1109/TPAMI.2022.3152247 https://doi.org/10.1109/TPAMI.2022.3152247

Han S , Mao HZ , Dally WJ , 2016 . Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding . https://doi.org/10.48550/arXiv.1510.00149 https://doi.org/10.48550/arXiv.1510.00149

He CY , Li S , Soltanolkotabi M , et al. , 2021 . PipeTransformer: automated elastic pipelining for distributed training of large-scale models . Proc 38 th Int Conf on Machine Learning , p. 4150 - 4159 .

He KM , Zhang XY , Ren SQ , et al. , 2016 . Deep residual learning for image recognition . Proc IEEE Conf on Computer Vision and Pattern Recognition , p. 770 - 778 . https://doi.org/10.1109/CVPR.2016.90 https://doi.org/10.1109/CVPR.2016.90

Herrmann J , Beaumont O , Eyraud-Dubois L , et al. , 2019 . Optimal checkpointing for heterogeneous chains: how to train deep neural networks with limited memory . https://doi.org/10.48550/arXiv.1911.13214 https://doi.org/10.48550/arXiv.1911.13214

Hildebrand M , Khan J , Trika S , et al. , 2020 . AutoTM: automatic tensor movement in heterogeneous memory systems using integer linear programming . Proc 25 th Int Conf on Architectural Support for Programming Languages and Operating Systems , p. 875 - 890 . https://doi.org/10.1145/3373376.3378465 https://doi.org/10.1145/3373376.3378465

Holland JH , 1992 . Adaptation in Natural and Artificial Systems: an Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence . MIT Press , Cambridge, USA .

Huang CC , Jin G , Li JY , 2020 . SwapAdvisor: pushing deep learning beyond the GPU memory limit via smart swapping . Proc 25 th Int Conf on Architectural Support for Programming Languages and Operating Systems , p. 1341 - 1355 . https://doi.org/10.1145/3373376.3378530 https://doi.org/10.1145/3373376.3378530

Huang YP , Cheng YL , Bapna A , et al. , 2019 . GPipe: efficient training of giant neural networks using pipeline parallelism . Proc 33 rd Conf on Neural Information Processing Systems , Article 10 .

Jain P , Jain A , Nrusimha A , et al. , 2020 . Checkmate: breaking the memory wall with optimal tensor rematerialization . Proc 3 rd Conf on Machine Learning and Systems , p. 497 - 511 .

Ji SW , Xu W , Yang M , et al. , 2013 . 3D convolutional neural networks for human action recognition . IEEE Trans Patt Anal Mach Intell , 35 ( 1 ): 221 - 231 . https://doi.org/10.1109/TPAMI.2012.59 https://doi.org/10.1109/TPAMI.2012.59

Jia Z , Maggioni M , Staiger B , et al. , 2018 . Dissecting the NVIDIA Volta GPU architecture via microbenchmarking . https://doi.org/10.48550/arXiv.1804.06826 https://doi.org/10.48550/arXiv.1804.06826

Jia ZH , Zaharia M , Aiken A , 2019 . Beyond data and model parallelism for deep neural networks . Proc 2 nd Conf on Machine Learning and Systems , p. 1 - 13 .

Kang WC , Cheng DZ , Yao TS , et al. , 2021 . Learning to embed categorical features without embedding tables for recommendation . Proc 27 th ACM SIGKDD Conf on Knowledge Discovery & Data Mining , p. 840 - 850 . https://doi.org/10.1145/3447548.3467304 https://doi.org/10.1145/3447548.3467304

Kim C , Lee H , Jeong M , et al. , 2020 . torchgpipe: on-the-fly pipeline parallelism for training giant models . https://doi.org/10.48550/arXiv.2004.09910 https://doi.org/10.48550/arXiv.2004.09910

Kingma DP , Ba J , 2015 . Adam: a method for stochastic optimization . Proc 3 rd Int Conf on Learning Representations .

Kirisame M , Lyubomirsky S , Haan A , et al. , 2021 . Dynamic tensor rematerialization . Proc 9 th Int Conf on Learning Representations .

Kitaev N , Kaiser Ł , Levskaya A , 2020 . Reformer: the efficient Transformer . Proc 8 th Int Conf on Learning Representations .

Ko H , Lee S , Park Y , et al. , 2022 . A survey of recommendation systems: recommendation models, techniques, and application fields . Electronics , 11 ( 1 ): 141 . https://doi.org/10.3390/electronics11010141 https://doi.org/10.3390/electronics11010141

Korthikanti V , Casper J , Lym S , et al. , 2023 . Reducing activation recomputation in large Transformer models . Proc 6 th Conf on Machine Learning and Systems , p. 5 .

Krizhevsky A , Sutskever I , Hinton GE , 2012 . ImageNet classification with deep convolutional neural networks . Proc 25 th Int Conf on Neural Information Processing Systems , p. 1097 - 1105 .

LeCun Y , Bengio Y , Hinton G , 2015 . Deep learning . Nature , 521 ( 7553 ): 436 - 444 . https://doi.org/10.1038/nature14539 https://doi.org/10.1038/nature14539

Lee D , Choi J , Kim JH , et al. , 1999 . On the existence of a spectrum of policies that subsumes the least recently used (LRU) and least frequently used (LFU) policies . Proc ACM SIGMETRICS Int Conf on Measurement and Modeling of Computer Systems , p. 134 - 143 . https://doi.org/10.1145/301453.301487 https://doi.org/10.1145/301453.301487

Li M , Andersen DG , Park JW , et al. , 2014 . Scaling distributed machine learning with the parameter server . Proc 11 th USENIX Conf on Operating Systems Design and Implementation , p. 583 - 598 .

Li S , Zhao YL , Varma R , et al. , 2020 . PyTorch distributed: experiences on accelerating data parallel training . Proc VLDB Endow , 13 ( 12 ): 3005 - 3018 . https://doi.org/10.14778/3415478.3415530 https://doi.org/10.14778/3415478.3415530

Li SG , Hoefler T , 2021 . Chimera: efficiently training large-scale neural networks with bidirectional pipelines . Proc Int Conf for High Performance Computing, Networking, Storage and Analysis , p. 1 - 14 . https://doi.org/10.1145/3458817.3476145 https://doi.org/10.1145/3458817.3476145

Li SG , Xue FZ , Baranwal C , et al. , 2022 . Sequence parallelism: long sequence training from system perspective . https://doi.org/10.48550/arXiv.2105.13120 https://doi.org/10.48550/arXiv.2105.13120

Li SG , Liu HX , Bian ZD , et al. , 2023 . Colossal-AI: a unified deep learning system for large-scale parallel training . Proc 52 nd Int Conf on Parallel Processing , p. 766 - 775 . https://doi.org/10.1145/3605573.3605613 https://doi.org/10.1145/3605573.3605613

Liang P , Tang Y , Zhang XD , et al. , 2022 . A survey on auto-parallelism of neural networks training . https://doi.org/10.36227/techrxiv.19522414.v1 https://doi.org/10.36227/techrxiv.19522414.v1

Lin YJ , Han S , Mao HZ , et al. , 2018 . Deep gradient compression: reducing the communication bandwidth for distributed training . Proc 6 th Int Conf on Learning Representations .

Lin ZQ , Miao YS , Liu GD , et al. , 2023 . SuperScaler: supporting flexible DNN parallelization via a unified abstraction . https://doi.org/10.48550/arXiv.2301.08984 https://doi.org/10.48550/arXiv.2301.08984

Liu SJ , Kailkhura B , Chen PY , et al. , 2018 . Zeroth-order stochastic variance reduction for nonconvex optimization . Proc 32 nd Int Conf on Neural Information Processing Systems , p. 3731 - 3741 .

Liu Z , Lin YT , Cao Y , et al. , 2021 . Swin Transformer: hierarchical vision Transformer using shifted windows . Proc IEEE/CVF Int Conf on Computer Vision , p. 9992 - 10002 . https://doi.org/10.1109/ICCV48922.2021.00986 https://doi.org/10.1109/ICCV48922.2021.00986

Liu ZM , Cheng SG , Zhou HT , et al. , 2023 . Hanayo: harnessing wave-like pipeline parallelism for enhanced large model training efficiency . Proc Int Conf for High Performance Computing, Networking, Storage and Analysis , Article 56 . https://doi.org/10.1145/3581784.3607073 https://doi.org/10.1145/3581784.3607073

Luo Y , Ren XZ , Zheng ZW , et al. , 2023 . CAME: confidence-guided adaptive memory efficient optimization . Proc 61 st Annual Meeting of the Association for Computational Linguistics , p. 4442 - 4453 . https://doi.org/10.18653/v1/2023.acl-long.243 https://doi.org/10.18653/v1/2023.acl-long.243

Ma ZX , He JA , Qiu JZ , et al. , 2022 . BaGuaLu: targeting brain scale pretrained models with over 37 million cores . Proc 27 th ACM SIGPLAN Symp on Principles and Practice of Parallel Programming , p. 192 - 204 . https://doi.org/10.1145/3503221.3508417 https://doi.org/10.1145/3503221.3508417

Margot F , 2010 . Symmetry in integer linear programming . In: Jünger M , Liebling TM , Naddef D , et al. (Eds.), 50 Years of Integer Programming 1958–2008: from the Early Years to the State-of-the-Art . Springer , Berlin , p. 647 - 686 . https://doi.org/10.1007/978-3-540-68279-0_17 https://doi.org/10.1007/978-3-540-68279-0_17

Micikevicius P , Narang S , Alben J , et al. , 2018 . Mixed precision training . https://doi.org/10.48550/arXiv.1710.03740 https://doi.org/10.48550/arXiv.1710.03740

Narayanan D , Harlap A , Phanishayee A , et al. , 2019 . PipeDream: generalized pipeline parallelism for DNN training . Proc 27 th ACM Symp on Operating Systems Principles , p. 1 - 15 . https://doi.org/10.1145/3341301.3359646 https://doi.org/10.1145/3341301.3359646

Neugebauer R , Antichi G , Zazo JF , et al. , 2018 . Understanding PCIe performance for end host networking . Proc Conf of the ACM Special Interest Group on Data Communication , p. 327 - 341 . https://doi.org/10.1145/3230543.3230560 https://doi.org/10.1145/3230543.3230560

Nie XN , Miao XP , Yang Z , et al. , 2022 . TSPLIT: fine-grained GPU memory management for efficient DNN training via tensor splitting . Proc IEEE 38 th Int Conf on Data Engineering , p. 2615 - 2628 . https://doi.org/10.1109/ICDE53745.2022.00241 https://doi.org/10.1109/ICDE53745.2022.00241

OpenAI , Achiam J , Adler S , et al. , 2024 . GPT-4 technical report . https://doi.org/10.48550/arXiv.2303.08774 https://doi.org/10.48550/arXiv.2303.08774

Park JH , Yun G , Yi CM , et al. , 2020 . HetPipe: enabling large DNN training on (Whimpy) heterogeneous GPU clusters through integration of pipelined model parallelism and data parallelism . Proc USENIX Annual Technical Conf , p. 307 - 321 .

Paszke A , Gross S , Massa F , et al. , 2019 . PyTorch: an imperative style, high-performance deep learning library . Proc 33 rd Int Conf on Neural Information Processing Systems , Article 721 .

Peng X , Shi XH , Dai HL , et al. , 2020 . Capuchin: tensor-based GPU memory management for deep learning . Proc 25 th Int Conf on Architectural Support for Programming Languages and Operating Systems , p. 891 - 905 . https://doi.org/10.1145/3373376.3378505 https://doi.org/10.1145/3373376.3378505

Povey D , Ghoshal A , Boulianne G , et al. , 2011 . The Kaldi speech recognition toolkit . Proc IEEE Workshop on Automatic Speech Recognition and Understanding .

Pudipeddi B , Mesmakhosroshahi M , Xi JW , et al. , 2020 . Training large neural networks with constant memory using a new execution algorithm . https://doi.org/10.48550/arXiv.2002.05645 https://doi.org/10.48550/arXiv.2002.05645

Qiu XP , Sun TX , Xu YG , et al. , 2020 . Pre-trained models for natural language processing: a survey . Sci China Technol Sci , 63 ( 10 ): 1872 - 1897 . https://doi.org/10.1007/s11431-020-1647-3 https://doi.org/10.1007/s11431-020-1647-3

Raffel C , Shazeer N , Roberts A , et al. , 2020 . Exploring the limits of transfer learning with a unified text-to-text Transformer . J Mach Learn Res , 21 ( 1 ): 140 .

Rajbhandari S , Rasley J , Ruwase O , et al. , 2020 . ZeRO: memory optimizations toward training trillion parameter models . Proc SC20: Int Conf for High Performance Computing, Networking, Storage and Analysis , p. 1 - 16 . https://doi.org/10.1109/SC41405.2020.00024 https://doi.org/10.1109/SC41405.2020.00024

Rajbhandari S , Ruwase O , Rasley J , et al. , 2021 . ZeRO-Infinity: breaking the GPU memory wall for extreme scale deep learning . Proc Int Conf for High Performance Computing, Networking, Storage and Analysis , Article 595 . https://doi.org/10.1145/3458817.3476205 https://doi.org/10.1145/3458817.3476205

Rajbhandari S , Li CL , Yao ZW , et al. , 2022 . DeepSpeed-MoE: advancing mixture-of-experts inference and training to power next-generation AI scale . Proc 39 th Int Conf on Machine Learning , p. 18332 - 18346 .

Rajpurkar P , Zhang J , Lopyrev K , et al. , 2016 . SQuAD: 100,000+ questions for machine comprehension of text . Proc Conf on Empirical Methods in Natural Language Processing , p. 2383 - 2392 . https://doi.org/10.18653/v1/D16-1264 https://doi.org/10.18653/v1/D16-1264

Rasley J , Rajbhandari S , Ruwase O , et al. , 2020 . DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters . Proc 26 th ACM SIGKDD Int Conf on Knowledge Discovery & Data Mining , p. 3505 - 3506 . https://doi.org/10.1145/3394486.3406703 https://doi.org/10.1145/3394486.3406703

Ren J , Rajbhandari S , Aminabadi RY , et al. , 2021 . ZeRO-Offload: democratizing billion-scale model training . Proc USENIX Annual Technical Conf , p. 551 - 564 .

Ren SQ , He KM , Girshick R , et al. , 2015 . Faster R-CNN: towards real-time object detection with region proposal networks . Proc 28 th Int Conf on Neural Information Processing Systems , p. 91 - 99 .

Rhu M , Gimelshein N , Clemons J , et al. , 2016 . vDNN: virtualized deep neural networks for scalable, memory-efficient neural network design . Proc 49 th Annual IEEE/ACM Int Symp on Microarchitecture , p. 1 - 13 . https://doi.org/10.1109/MICRO.2016.7783721 https://doi.org/10.1109/MICRO.2016.7783721

Sergeev A , del Balso M , 2018 . Horovod: fast and easy distributed deep learning in TensorFlow . https://doi.org/10.48550/arXiv.1802.05799 https://doi.org/10.48550/arXiv.1802.05799

Sethi G , Acun B , Agarwal N , et al. , 2022 . RecShard: statistical feature-based memory optimization for industry-scale neural recommendation . Proc 27 th ACM Int Conf on Architectural Support for Programming Languages and Operating Systems , p. 344 - 358 . https://doi.org/10.1145/3503222.3507777 https://doi.org/10.1145/3503222.3507777

Shazeer N , Stern M , 2018 . Adafactor: adaptive learning rates with sublinear memory cost . Proc 35 th Int Conf on Machine Learning , p. 4596 - 4604 .

Shoeybi M , Patwary M , Puri R , et al. , 2020 . Megatron-LM: training multi-billion parameter language models using model parallelism . https://doi.org/10.48550/arXiv.1909.08053 https://doi.org/10.48550/arXiv.1909.08053

Sun X , Choi J , Chen CY , et al. , 2019 . Hybrid 8-bit floating point (HFP8) training and inference for deep neural networks . Proc 33 rd Int Conf on Neural Information Processing Systems , Article 441 .

Sun X , Wang NG , Chen CY , et al. , 2020 . Ultra-low precision 4-bit training of deep neural networks . Proc 34 th Int Conf on Neural Information Processing Systems , Article 152 .

Sun Y , Wang SH , Li YK , et al. , 2019 . ERNIE: enhanced representation through knowledge integration . https://doi.org/10.48550/arXiv.1904.09223 https://doi.org/10.48550/arXiv.1904.09223

Sun Y , Wang SH , Feng SK , et al. , 2021 . ERNIE 3.0: large-scale knowledge enhanced pre-training for language understanding and generation . https://doi.org/10.48550/arXiv.2107.02137 https://doi.org/10.48550/arXiv.2107.02137

Sutskever I , Martens J , Dahl G , et al. , 2013 . On the importance of initialization and momentum in deep learning . Proc 30 th Int Conf on Machine Learning , p. 1139 - 1147 .

Sutskever I , Vinyals O , Le QV , 2014 . Sequence to sequence learning with neural networks . Proc 27 th Int Conf on Neural Information Processing Systems , p. 3104 - 3112 .

Tang Y , Wang CY , Zhang YF , et al. , 2022 . DELTA: dynamically optimizing GPU memory beyond tensor recomputation . https://doi.org/10.48550/arXiv.2203.15980 https://doi.org/10.48550/arXiv.2203.15980

Unger C , Jia ZH , Wu W , et al. , 2022 . Unity: accelerating DNN training through joint optimization of algebraic transformations and parallelization . Proc 16 th USENIX Symp on Operating Systems Design and Implementation , p. 267 - 284 .

Vaswani A , Shazeer N , Parmar N , et al. , 2017 . Attention is all you need . Proc 31 st Int Conf on Neural Information Processing Systems , p. 6000 - 6010 .

Wang LN , Ye JM , Zhao YY , et al. , 2018 . SuperNeurons: dynamic GPU memory management for training deep neural networks . Proc 23 rd ACM SIGPLAN Symp on Principles and Practice of Parallel Programming , p. 41 - 53 . https://doi.org/10.1145/3178487.3178491 https://doi.org/10.1145/3178487.3178491

Wang NG , Choi J , Brand D , et al. , 2018 . Training deep neural networks with 8-bit floating point numbers . Proc 32 nd Int Conf on Neural Information Processing Systems , p. 7686 - 7695 .

Wang YZ , Han X , Zhao WL , et al. , 2023 . H3T: efficient integration of memory optimization and parallelism for high-throughput Transformer training . Proc 37 th Conf on Neural Information Processing Systems .

Weinberger K , Dasgupta A , Langford J , et al. , 2009 . Feature hashing for large scale multitask learning . Proc 26 th Annual Int Conf on Machine Learning , p. 1113 - 1120 . https://doi.org/10.1145/1553374.1553516 https://doi.org/10.1145/1553374.1553516

Xi HC , Li CH , Chen JF , et al. , 2023 . Training Transformers with 4-bit integers . Proc 37 th Conf on Neural Information Processing Systems .

Xiong RB , Yang YC , He D , et al. , 2020 . On layer normalization in the Transformer architecture . Proc 37 th Int Conf on Machine Learning , p. 10524 - 10533 .

Xu QM , Siyamwala H , Ghosh M , et al. , 2015 . Performance analysis of NVMe SSDs and their implication on real world databases . Proc 8 th ACM Int Systems and Storage Conf , Article 6 . https://doi.org/10.1145/2757667.2757684 https://doi.org/10.1145/2757667.2757684

Yao ZW , Aminabadi RY , Ruwase O , et al. , 2023 . DeepSpeed-Chat: easy, fast and affordable RLHF training of ChatGPT-like models at all scales . https://doi.org/10.48550/arXiv.2308.01320 https://doi.org/10.48550/arXiv.2308.01320

You Y , Li J , Reddi SJ , et al. , 2020 . Large batch optimization for deep learning: training BERT in 76 minutes . Proc 8 th Int Conf on Learning Representations .

Yuan JH , Li XQ , Cheng C , et al. , 2022 . OneFlow: redesign the distributed deep learning framework from scratch . https://doi.org/10.48550/arXiv.2110.15032 https://doi.org/10.48550/arXiv.2110.15032

Ze HG , Senior A , Schuster M , 2013 . Statistical parametric speech synthesis using deep neural networks . Proc IEEE Int Conf on Acoustics, Speech and Signal Processing , p. 7962 - 7966 . https://doi.org/10.1109/ICASSP.2013.6639215 https://doi.org/10.1109/ICASSP.2013.6639215

Zellers R , Bisk Y , Schwartz R , et al. , 2018 . SWAG: a large-scale adversarial dataset for grounded commonsense inference . Proc Conf on Empirical Methods in Natural Language Processing , p. 93 - 104 . https://doi.org/10.18653/v1/D18-1009 https://doi.org/10.18653/v1/D18-1009

Zeng W , Ren XZ , Su T , et al. , 2021 . PanGu- α : large-scale autoregressive pretrained Chinese language models with auto-parallel computation . https://doi.org/10.48550/arXiv.2104.12369 https://doi.org/10.48550/arXiv.2104.12369

Zhang DQ , Yang JL , Ye DQ , et al. , 2018 . LQ-Nets: learned quantization for highly accurate and compact deep neural networks . Proc 15 th European Conf on Computer Vision , p. 373 - 390 . https://doi.org/10.1007/978-3-030-01237-3_23 https://doi.org/10.1007/978-3-030-01237-3_23

Zhang JH , Ma SH , Liu PH , et al. , 2023 . Coop: memory is not a commodity . Proc 34 th Conf on Neural Information Processing Systems .

Zhang ZY , Han X , Zhou H , et al. , 2021 . CPM: a large-scale generative Chinese pre-trained language model . AI Open , 2 : 93 - 99 . https://doi.org/10.1016/j.aiopen.2021.07.001 https://doi.org/10.1016/j.aiopen.2021.07.001

Zhao XY , Le Hellard T , Eyraud-Dubois L , et al. , 2023 . Rockmate: an efficient, fast, automatic and generic tool for re-materialization in PyTorch . Proc 40 th Int Conf on Machine Learning , p. 42018 - 42045 .

Zhao YL , Gu A , Varma R , et al. , 2023 . PyTorch FSDP: experiences on scaling fully sharded Data Parallel . Proc VLDB Endow , 16 ( 12 ): 3848 - 3860 . https://doi.org/10.14778/3611540.3611569 https://doi.org/10.14778/3611540.3611569

Zheng LM , Li ZH , Zhang H , et al. , 2022 . Alpa: automating inter- and intra-operator parallelism for distributed deep learning . Proc 16 th USENIX Symp on Operating Systems Design and Implementation , p. 559 - 578 .

Zhong Y , Zhu JJ , Yang PC , et al. , 2023 . MQSP: micro-query sequence parallelism for linearly scaling long sequence Transformer . https://openreview.net/forum?id=gfr5yILQc7_ https://openreview.net/forum?id=gfr5yILQc7_ [Accessed on Sept. 17, 2023 ] .

Zhou J , Ke P , Qiu XP , et al. , 2024 . ChatGPT: potential, prospects, and limitations . Front Inform Technol Electron Eng 25 ( 1 ): 6 - 11 . https://doi.org/10.1631/FITEE.2300089 https://doi.org/10.1631/FITEE.2300089

Zhou SC , Wu YX , Ni ZK , et al. , 2018 . DoReFa-Net: training low bitwidth convolutional neural networks with low bitwidth gradients . https://doi.org/10.48550/arXiv.1606.06160 https://doi.org/10.48550/arXiv.1606.06160

Zhuang ZX , Liu MR , Cutkosky A , et al. , 2022 . Understanding AdamW through proximal methods and scale-freeness . https://arxiv.org/abs/2202.00089 https://arxiv.org/abs/2202.00089

浏览量

Downloads

CSCD

文章被引用时，请邮件提醒。

Submit

工具集

关联资源

暂无数据