FOLLOWUS
National Key Laboratory of Parallel and Distributed Computing, College of Computer, National University of Defense Technology, Changsha 410073, China
E-mail: tangyu14@nudt.edu.cn
‡Corresponding author
收稿日期:2023-10-17,
修回日期:2024-03-31,
网络出版日期:2025-03-17,
纸质出版日期:2025-03
Scan QR Code
唐宇, 乔林波, 尹路珈, 等. 有限GPU显存下的大语言模型训练技术综述[J]. 信息与电子工程前沿(英文), 2025,26(3):309-331.
Yu TANG, Linbo QIAO, Lujia YIN, et al. Training large-scale language models with limited GPU memory: a survey[J]. Frontiers of information technology & electronic engineering, 2025, 26(3): 309-331.
唐宇, 乔林波, 尹路珈, 等. 有限GPU显存下的大语言模型训练技术综述[J]. 信息与电子工程前沿(英文), 2025,26(3):309-331. DOI: 10.1631/FITEE.2300710.
Yu TANG, Linbo QIAO, Lujia YIN, et al. Training large-scale language models with limited GPU memory: a survey[J]. Frontiers of information technology & electronic engineering, 2025, 26(3): 309-331. DOI: 10.1631/FITEE.2300710.
大模型凭借其在多领域应用中的卓越性能,已在计算机视觉、自然语言处理等领域获得广泛关注。然而,此类模型的训练面临图形处理器(GPU)显存容量的显著制约。本文系统梳理了有限GPU显存条件下大模型训练的优化技术体系。首先深入解析训练过程中GPU显存占用的三大核心要素——模型参数、模型状态和模型激活;继而从这三个维度对现有研究成果进行多角度评述;最后展望了该领域未来的发展方向,强调持续创新显存优化技术对推动大语言模型发展的重要性。本综述为研究人员理解大语言模型训练中的显存优化挑战与技术演进提供了系统参考。
Large-scale models have gained significant attention in a wide range of fields
such as computer vision and natural language processing
due to their effectiveness across various applications. However
a notable hurdle in training these large-scale models is the limited memory capacity of graphics processing units (GPUs). In this paper
we present a comprehensive survey focused on training large-scale models with limited GPU memory. The exploration commences by scrutinizing the factors that contribute to the consumption of GPU memory during the training process
namely model parameters
model states
and model activations. Following this analysis
we present an in-depth overview of the relevant research work that addresses these aspects individually. Finally
the paper concludes by presenting an outlook on the future of memory optimization in training large-scale language models
emphasizing the necessity for continued research and innovation in this area. This survey serves as a valuable resource for researchers and practitioners keen on comprehending the challenges and advancements in training large-scale language models with limited GPU memory.
Abadi M , Barham P , Chen JM , et al. , 2016 . TensorFlow: a system for large-scale machine learning . Proc 12 th USENIX Conf on Operating Systems Design and Implementation , p. 265 - 283 .
Acun B , Murphy M , Wang XD , et al. , 2021 . Understanding training efficiency of deep learning recommendation models at scale . Proc IEEE Int Symp on High-Performance Computer Architecture , p. 802 - 814 . https://doi.org/10.1109/HPCA51647.2021.00072 https://doi.org/10.1109/HPCA51647.2021.00072
Ali Z , Kefalas P , Muhammad K , et al. , 2020 . Deep learning in citation recommendation models survey . Exp Syst Appl , 162 : 113790 . https://doi.org/10.1016/j.eswa.2020.113790 https://doi.org/10.1016/j.eswa.2020.113790
Amari SI , 1993 . Backpropagation and stochastic gradient descent method . Neurocomputing , 5 ( 4-5 ): 185 - 196 . https://doi.org/10.1016/0925-2312(93)90006-O https://doi.org/10.1016/0925-2312(93)90006-O
Bachlechner T , Majumder BP , Mao H , et al. , 2021 . ReZero is all you need: fast convergence at large depth . Proc 37 th Conf on Uncertainty in Artificial Intelligence , p. 1352 - 1361 .
Bae J , Lee J , Jin Y , et al. , 2021 . FlashNeuron: SSD-enabled large-batch training of very deep neural networks . Proc 19 th USENIX Conf on File and Storage Technologies , p. 387 - 401 .
Banner R , Hubara I , Hoffer E , et al. , 2018 . Scalable methods for 8-bit training of neural networks . Proc 32 nd I nt Conf on Neural Information Processing Systems , p. 5151 - 5159 .
Bartan B , Li H , Teague H , et al. , 2023 . MOCCASIN: efficient tensor rematerialization for neural networks . Int Conf on Machine Learning , p. 1826 - 1837 .
Beaumont O , Herrmann J , Pallez G , et al. , 2020 . Optimal memory-aware backpropagation of deep join networks . Phil Trans Roy Soc A , 378 ( 2166 ): 20190049 .
Beaumont O , Eyraud-Dubois L , Shilova A , 2021 . Efficient combination of rematerialization and offloading for training DNNs . Proc 35 th Conf on Neural Information Processing Systems , p. 23844 - 23857 .
Brown TB , Mann B , Ryder N , et al. , 2020 . Language models are few-shot learners . Proc 34 th Conf on Neural Information Processing Systems , p. 1877 - 1901 .
Chen AC , Zhang YM , Jia JH , et al. , 2024 . DeepZero: scaling up zeroth-order optimization for deep model training . https://doi.org/10.48550/arXiv.2310.02025 https://doi.org/10.48550/arXiv.2310.02025
Chen JF , Gai Y , Yao ZW , et al. , 2020 . A statistical framework for low-bitwidth training of deep neural networks . Proc 34 th Int Conf on Neural Information Processing Systems , Article 75 .
Chen JF , Zheng LM , Yao ZW , et al. , 2021 . ActNN: reducing training memory footprint via 2-bit activation compressed training . Proc 38 th Int Conf on Machine Learning , p. 1803 - 1813 .
Chen JF , Li SG , Gun R , et al. , 2023 . AutoDDL: automatic distributed deep learning with asymptotically optimal communication . https://doi.org/10.48550/arXiv.2301.06813 https://doi.org/10.48550/arXiv.2301.06813
Chen TQ , Li M , Li YT , et al. , 2015 . MXNet: a flexible and efficient machine learning library for heterogeneous distributed systems . https://doi.org/10.48550/arXiv.1512.01274 https://doi.org/10.48550/arXiv.1512.01274
Chen TQ , Xu B , Zhang CY , et al. , 2016 . Training deep nets with sublinear memory cost . https://doi.org/10.48550/arXiv.1604.06174 https://doi.org/10.48550/arXiv.1604.06174
Cho K , van Merriënboer B , Bahdanau D , et al. , 2014 . On the properties of neural machine translation: encoder–decoder approaches . Proc 8 th Workshop on Syntax, Semantics and Structure in Statistical Translation , p. 103 - 111 . https://doi.org/10.3115/v1/W14-4012 https://doi.org/10.3115/v1/W14-4012
Choquette J , Gandhi W , Giroux O , et al. , 2021 . NVIDIA A100 tensor core GPU: performance and innovation . IEEE Micro , 41 ( 2 ): 29 - 35 . https://doi.org/10.1109/MM.2021.3061394 https://doi.org/10.1109/MM.2021.3061394
Chowdhury GG , 2003 . Natural language processing . Annu Rev Inform Sci Technol , 37 ( 1 ): 51 - 89 . https://doi.org/10.1002/aris.1440370103 https://doi.org/10.1002/aris.1440370103
Cutkosky A , Mehta H , 2020 . Momentum improves normalized SGD . Proc 37 th Int Conf on Machine Learning , p. 2260 - 2268 .
Dao T , 2023 . FlashAttention-2: faster attention with better parallelism and work partitioning . https://doi.org/10.48550/arXiv.2307.08691 https://doi.org/10.48550/arXiv.2307.08691
Dao T , Fu D , Ermon S , et al. , 2022 . FlashAttention: fast and memory-efficient exact attention with IO-awareness . Proc 36 th Conf on Neural Information Processing Systems , p. 16344 - 16359 .
Dean J , Corrado GS , Monga R , et al. , 2012 . Large scale distributed deep networks . Proc 25 th Int Conf on Neural Information Processing Systems , p. 1223 - 1231 .
Devlin J , Chang MW , Lee K , et al. , 2019 . BERT: pre-training of deep bidirectional Transformers for language understanding . Proc Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , p. 4171 - 4186 . https://doi.org/10.18653/v1/N19-1423 https://doi.org/10.18653/v1/N19-1423
Dong L , Yang N , Wang WH , et al. , 2019 . Unified language model pre-training for natural language understanding and generation . Proc 33 rd Int Conf on Neural Information Processing Systems , Article 1170 .
Fan SQ , Rong Y , Meng C , et al. , 2021 . DAPPLE: a pipelined data parallel approach for training large models . Proc 26 th ACM SIGPLAN Symp on Principles and Practice of Parallel Programming , p. 431 - 445 . https://doi.org/10.1145/3437801.3441593 https://doi.org/10.1145/3437801.3441593
Fang JR , Zhu ZL , Li SG , et al. , 2023 . Parallel training of pre-trained models via chunk-based dynamic memory management . IEEE Trans Parall Distrib Syst , 34 ( 1 ): 304 - 315 . https://doi.org/10.1109/TPDS.2022.3219819 https://doi.org/10.1109/TPDS.2022.3219819
Fedus W , Zoph B , Shazeer N , 2022 . Switch Transformers: scaling to trillion parameter models with simple and efficient sparsity . J Mach Learn Res , 23 ( 1 ): 120 .
Fu FC , Hu YZ , He YH , et al. , 2020 . Don’t waste your bits! Squeeze activations and gradients for deep neural networks via TINYSCRIPT . Proc 37 th Int Conf on Machine Learning , Article 309 .
Gholami A , Yao ZW , Kim S , et al. , 2024 . AI and memory wall . IEEE Micro , 44 ( 3 ): 33 - 39 . https://doi.org/10.1109/MM.2024.3373763 https://doi.org/10.1109/MM.2024.3373763
Guan L , Yin WT , Li DS , et al. , 2019 . XPipe: efficient pipeline model parallelism for multi-GPU DNN training . https://doi.org/10.48550/arXiv.1911.04610 https://doi.org/10.48550/arXiv.1911.04610
Gusak J , Cherniuk D , Shilova A , et al. , 2022 . Survey on large scale neural network training . https://doi.org/10.48550/arXiv.2202.10435 https://doi.org/10.48550/arXiv.2202.10435
Gustafson JL , 1988 . Reevaluating Amdahl’s law . Commun ACM , 31 ( 5 ): 532 - 533 . https://doi.org/10.1145/42411.42415 https://doi.org/10.1145/42411.42415
Han K , Wang YH , Chen HT , et al. , 2023 . A survey on vision Transformer . IEEE Trans Patt Anal Mach Intell , 45 ( 1 ): 87 - 110 . https://doi.org/10.1109/TPAMI.2022.3152247 https://doi.org/10.1109/TPAMI.2022.3152247
Han S , Mao HZ , Dally WJ , 2016 . Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding . https://doi.org/10.48550/arXiv.1510.00149 https://doi.org/10.48550/arXiv.1510.00149
He CY , Li S , Soltanolkotabi M , et al. , 2021 . PipeTransformer: automated elastic pipelining for distributed training of large-scale models . Proc 38 th Int Conf on Machine Learning , p. 4150 - 4159 .
He KM , Zhang XY , Ren SQ , et al. , 2016 . Deep residual learning for image recognition . Proc IEEE Conf on Computer Vision and Pattern Recognition , p. 770 - 778 . https://doi.org/10.1109/CVPR.2016.90 https://doi.org/10.1109/CVPR.2016.90
Herrmann J , Beaumont O , Eyraud-Dubois L , et al. , 2019 . Optimal checkpointing for heterogeneous chains: how to train deep neural networks with limited memory . https://doi.org/10.48550/arXiv.1911.13214 https://doi.org/10.48550/arXiv.1911.13214
Hildebrand M , Khan J , Trika S , et al. , 2020 . AutoTM: automatic tensor movement in heterogeneous memory systems using integer linear programming . Proc 25 th Int Conf on Architectural Support for Programming Languages and Operating Systems , p. 875 - 890 . https://doi.org/10.1145/3373376.3378465 https://doi.org/10.1145/3373376.3378465
Holland JH , 1992 . Adaptation in Natural and Artificial Systems: an Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence . MIT Press , Cambridge, USA .
Huang CC , Jin G , Li JY , 2020 . SwapAdvisor: pushing deep learning beyond the GPU memory limit via smart swapping . Proc 25 th Int Conf on Architectural Support for Programming Languages and Operating Systems , p. 1341 - 1355 . https://doi.org/10.1145/3373376.3378530 https://doi.org/10.1145/3373376.3378530
Huang YP , Cheng YL , Bapna A , et al. , 2019 . GPipe: efficient training of giant neural networks using pipeline parallelism . Proc 33 rd Conf on Neural Information Processing Systems , Article 10 .
Jain P , Jain A , Nrusimha A , et al. , 2020 . Checkmate: breaking the memory wall with optimal tensor rematerialization . Proc 3 rd Conf on Machine Learning and Systems , p. 497 - 511 .
Ji SW , Xu W , Yang M , et al. , 2013 . 3D convolutional neural networks for human action recognition . IEEE Trans Patt Anal Mach Intell , 35 ( 1 ): 221 - 231 . https://doi.org/10.1109/TPAMI.2012.59 https://doi.org/10.1109/TPAMI.2012.59
Jia Z , Maggioni M , Staiger B , et al. , 2018 . Dissecting the NVIDIA Volta GPU architecture via microbenchmarking . https://doi.org/10.48550/arXiv.1804.06826 https://doi.org/10.48550/arXiv.1804.06826
Jia ZH , Zaharia M , Aiken A , 2019 . Beyond data and model parallelism for deep neural networks . Proc 2 nd Conf on Machine Learning and Systems , p. 1 - 13 .
Kang WC , Cheng DZ , Yao TS , et al. , 2021 . Learning to embed categorical features without embedding tables for recommendation . Proc 27 th ACM SIGKDD Conf on Knowledge Discovery & Data Mining , p. 840 - 850 . https://doi.org/10.1145/3447548.3467304 https://doi.org/10.1145/3447548.3467304
Kim C , Lee H , Jeong M , et al. , 2020 . torchgpipe: on-the-fly pipeline parallelism for training giant models . https://doi.org/10.48550/arXiv.2004.09910 https://doi.org/10.48550/arXiv.2004.09910
Kingma DP , Ba J , 2015 . Adam: a method for stochastic optimization . Proc 3 rd Int Conf on Learning Representations .
Kirisame M , Lyubomirsky S , Haan A , et al. , 2021 . Dynamic tensor rematerialization . Proc 9 th Int Conf on Learning Representations .
Kitaev N , Kaiser Ł , Levskaya A , 2020 . Reformer: the efficient Transformer . Proc 8 th Int Conf on Learning Representations .
Ko H , Lee S , Park Y , et al. , 2022 . A survey of recommendation systems: recommendation models, techniques, and application fields . Electronics , 11 ( 1 ): 141 . https://doi.org/10.3390/electronics11010141 https://doi.org/10.3390/electronics11010141
Korthikanti V , Casper J , Lym S , et al. , 2023 . Reducing activation recomputation in large Transformer models . Proc 6 th Conf on Machine Learning and Systems , p. 5 .
Krizhevsky A , Sutskever I , Hinton GE , 2012 . ImageNet classification with deep convolutional neural networks . Proc 25 th Int Conf on Neural Information Processing Systems , p. 1097 - 1105 .
LeCun Y , Bengio Y , Hinton G , 2015 . Deep learning . Nature , 521 ( 7553 ): 436 - 444 . https://doi.org/10.1038/nature14539 https://doi.org/10.1038/nature14539
Lee D , Choi J , Kim JH , et al. , 1999 . On the existence of a spectrum of policies that subsumes the least recently used (LRU) and least frequently used (LFU) policies . Proc ACM SIGMETRICS Int Conf on Measurement and Modeling of Computer Systems , p. 134 - 143 . https://doi.org/10.1145/301453.301487 https://doi.org/10.1145/301453.301487
Li M , Andersen DG , Park JW , et al. , 2014 . Scaling distributed machine learning with the parameter server . Proc 11 th USENIX Conf on Operating Systems Design and Implementation , p. 583 - 598 .
Li S , Zhao YL , Varma R , et al. , 2020 . PyTorch distributed: experiences on accelerating data parallel training . Proc VLDB Endow , 13 ( 12 ): 3005 - 3018 . https://doi.org/10.14778/3415478.3415530 https://doi.org/10.14778/3415478.3415530
Li SG , Hoefler T , 2021 . Chimera: efficiently training large-scale neural networks with bidirectional pipelines . Proc Int Conf for High Performance Computing, Networking, Storage and Analysis , p. 1 - 14 . https://doi.org/10.1145/3458817.3476145 https://doi.org/10.1145/3458817.3476145
Li SG , Xue FZ , Baranwal C , et al. , 2022 . Sequence parallelism: long sequence training from system perspective . https://doi.org/10.48550/arXiv.2105.13120 https://doi.org/10.48550/arXiv.2105.13120
Li SG , Liu HX , Bian ZD , et al. , 2023 . Colossal-AI: a unified deep learning system for large-scale parallel training . Proc 52 nd Int Conf on Parallel Processing , p. 766 - 775 . https://doi.org/10.1145/3605573.3605613 https://doi.org/10.1145/3605573.3605613
Liang P , Tang Y , Zhang XD , et al. , 2022 . A survey on auto-parallelism of neural networks training . https://doi.org/10.36227/techrxiv.19522414.v1 https://doi.org/10.36227/techrxiv.19522414.v1
Lin YJ , Han S , Mao HZ , et al. , 2018 . Deep gradient compression: reducing the communication bandwidth for distributed training . Proc 6 th Int Conf on Learning Representations .
Lin ZQ , Miao YS , Liu GD , et al. , 2023 . SuperScaler: supporting flexible DNN parallelization via a unified abstraction . https://doi.org/10.48550/arXiv.2301.08984 https://doi.org/10.48550/arXiv.2301.08984
Liu SJ , Kailkhura B , Chen PY , et al. , 2018 . Zeroth-order stochastic variance reduction for nonconvex optimization . Proc 32 nd Int Conf on Neural Information Processing Systems , p. 3731 - 3741 .
Liu Z , Lin YT , Cao Y , et al. , 2021 . Swin Transformer: hierarchical vision Transformer using shifted windows . Proc IEEE/CVF Int Conf on Computer Vision , p. 9992 - 10002 . https://doi.org/10.1109/ICCV48922.2021.00986 https://doi.org/10.1109/ICCV48922.2021.00986
Liu ZM , Cheng SG , Zhou HT , et al. , 2023 . Hanayo: harnessing wave-like pipeline parallelism for enhanced large model training efficiency . Proc Int Conf for High Performance Computing, Networking, Storage and Analysis , Article 56 . https://doi.org/10.1145/3581784.3607073 https://doi.org/10.1145/3581784.3607073
Luo Y , Ren XZ , Zheng ZW , et al. , 2023 . CAME: confidence-guided adaptive memory efficient optimization . Proc 61 st Annual Meeting of the Association for Computational Linguistics , p. 4442 - 4453 . https://doi.org/10.18653/v1/2023.acl-long.243 https://doi.org/10.18653/v1/2023.acl-long.243
Ma ZX , He JA , Qiu JZ , et al. , 2022 . BaGuaLu: targeting brain scale pretrained models with over 37 million cores . Proc 27 th ACM SIGPLAN Symp on Principles and Practice of Parallel Programming , p. 192 - 204 . https://doi.org/10.1145/3503221.3508417 https://doi.org/10.1145/3503221.3508417
Margot F , 2010 . Symmetry in integer linear programming . In: Jünger M , Liebling TM , Naddef D , et al. (Eds.), 50 Years of Integer Programming 1958–2008: from the Early Years to the State-of-the-Art . Springer , Berlin , p. 647 - 686 . https://doi.org/10.1007/978-3-540-68279-0_17 https://doi.org/10.1007/978-3-540-68279-0_17
Micikevicius P , Narang S , Alben J , et al. , 2018 . Mixed precision training . https://doi.org/10.48550/arXiv.1710.03740 https://doi.org/10.48550/arXiv.1710.03740
Narayanan D , Harlap A , Phanishayee A , et al. , 2019 . PipeDream: generalized pipeline parallelism for DNN training . Proc 27 th ACM Symp on Operating Systems Principles , p. 1 - 15 . https://doi.org/10.1145/3341301.3359646 https://doi.org/10.1145/3341301.3359646
Neugebauer R , Antichi G , Zazo JF , et al. , 2018 . Understanding PCIe performance for end host networking . Proc Conf of the ACM Special Interest Group on Data Communication , p. 327 - 341 . https://doi.org/10.1145/3230543.3230560 https://doi.org/10.1145/3230543.3230560
Nie XN , Miao XP , Yang Z , et al. , 2022 . TSPLIT: fine-grained GPU memory management for efficient DNN training via tensor splitting . Proc IEEE 38 th Int Conf on Data Engineering , p. 2615 - 2628 . https://doi.org/10.1109/ICDE53745.2022.00241 https://doi.org/10.1109/ICDE53745.2022.00241
OpenAI , Achiam J , Adler S , et al. , 2024 . GPT-4 technical report . https://doi.org/10.48550/arXiv.2303.08774 https://doi.org/10.48550/arXiv.2303.08774
Park JH , Yun G , Yi CM , et al. , 2020 . HetPipe: enabling large DNN training on (Whimpy) heterogeneous GPU clusters through integration of pipelined model parallelism and data parallelism . Proc USENIX Annual Technical Conf , p. 307 - 321 .
Paszke A , Gross S , Massa F , et al. , 2019 . PyTorch: an imperative style, high-performance deep learning library . Proc 33 rd Int Conf on Neural Information Processing Systems , Article 721 .
Peng X , Shi XH , Dai HL , et al. , 2020 . Capuchin: tensor-based GPU memory management for deep learning . Proc 25 th Int Conf on Architectural Support for Programming Languages and Operating Systems , p. 891 - 905 . https://doi.org/10.1145/3373376.3378505 https://doi.org/10.1145/3373376.3378505
Povey D , Ghoshal A , Boulianne G , et al. , 2011 . The Kaldi speech recognition toolkit . Proc IEEE Workshop on Automatic Speech Recognition and Understanding .
Pudipeddi B , Mesmakhosroshahi M , Xi JW , et al. , 2020 . Training large neural networks with constant memory using a new execution algorithm . https://doi.org/10.48550/arXiv.2002.05645 https://doi.org/10.48550/arXiv.2002.05645
Qiu XP , Sun TX , Xu YG , et al. , 2020 . Pre-trained models for natural language processing: a survey . Sci China Technol Sci , 63 ( 10 ): 1872 - 1897 . https://doi.org/10.1007/s11431-020-1647-3 https://doi.org/10.1007/s11431-020-1647-3
Raffel C , Shazeer N , Roberts A , et al. , 2020 . Exploring the limits of transfer learning with a unified text-to-text Transformer . J Mach Learn Res , 21 ( 1 ): 140 .
Rajbhandari S , Rasley J , Ruwase O , et al. , 2020 . ZeRO: memory optimizations toward training trillion parameter models . Proc SC20: Int Conf for High Performance Computing, Networking, Storage and Analysis , p. 1 - 16 . https://doi.org/10.1109/SC41405.2020.00024 https://doi.org/10.1109/SC41405.2020.00024
Rajbhandari S , Ruwase O , Rasley J , et al. , 2021 . ZeRO-Infinity: breaking the GPU memory wall for extreme scale deep learning . Proc Int Conf for High Performance Computing, Networking, Storage and Analysis , Article 595 . https://doi.org/10.1145/3458817.3476205 https://doi.org/10.1145/3458817.3476205
Rajbhandari S , Li CL , Yao ZW , et al. , 2022 . DeepSpeed-MoE: advancing mixture-of-experts inference and training to power next-generation AI scale . Proc 39 th Int Conf on Machine Learning , p. 18332 - 18346 .
Rajpurkar P , Zhang J , Lopyrev K , et al. , 2016 . SQuAD: 100,000+ questions for machine comprehension of text . Proc Conf on Empirical Methods in Natural Language Processing , p. 2383 - 2392 . https://doi.org/10.18653/v1/D16-1264 https://doi.org/10.18653/v1/D16-1264
Rasley J , Rajbhandari S , Ruwase O , et al. , 2020 . DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters . Proc 26 th ACM SIGKDD Int Conf on Knowledge Discovery & Data Mining , p. 3505 - 3506 . https://doi.org/10.1145/3394486.3406703 https://doi.org/10.1145/3394486.3406703
Ren J , Rajbhandari S , Aminabadi RY , et al. , 2021 . ZeRO-Offload: democratizing billion-scale model training . Proc USENIX Annual Technical Conf , p. 551 - 564 .
Ren SQ , He KM , Girshick R , et al. , 2015 . Faster R-CNN: towards real-time object detection with region proposal networks . Proc 28 th Int Conf on Neural Information Processing Systems , p. 91 - 99 .
Rhu M , Gimelshein N , Clemons J , et al. , 2016 . vDNN: virtualized deep neural networks for scalable, memory-efficient neural network design . Proc 49 th Annual IEEE/ACM Int Symp on Microarchitecture , p. 1 - 13 . https://doi.org/10.1109/MICRO.2016.7783721 https://doi.org/10.1109/MICRO.2016.7783721
Sergeev A , del Balso M , 2018 . Horovod: fast and easy distributed deep learning in TensorFlow . https://doi.org/10.48550/arXiv.1802.05799 https://doi.org/10.48550/arXiv.1802.05799
Sethi G , Acun B , Agarwal N , et al. , 2022 . RecShard: statistical feature-based memory optimization for industry-scale neural recommendation . Proc 27 th ACM Int Conf on Architectural Support for Programming Languages and Operating Systems , p. 344 - 358 . https://doi.org/10.1145/3503222.3507777 https://doi.org/10.1145/3503222.3507777
Shazeer N , Stern M , 2018 . Adafactor: adaptive learning rates with sublinear memory cost . Proc 35 th Int Conf on Machine Learning , p. 4596 - 4604 .
Shoeybi M , Patwary M , Puri R , et al. , 2020 . Megatron-LM: training multi-billion parameter language models using model parallelism . https://doi.org/10.48550/arXiv.1909.08053 https://doi.org/10.48550/arXiv.1909.08053
Sun X , Choi J , Chen CY , et al. , 2019 . Hybrid 8-bit floating point (HFP8) training and inference for deep neural networks . Proc 33 rd Int Conf on Neural Information Processing Systems , Article 441 .
Sun X , Wang NG , Chen CY , et al. , 2020 . Ultra-low precision 4-bit training of deep neural networks . Proc 34 th Int Conf on Neural Information Processing Systems , Article 152 .
Sun Y , Wang SH , Li YK , et al. , 2019 . ERNIE: enhanced representation through knowledge integration . https://doi.org/10.48550/arXiv.1904.09223 https://doi.org/10.48550/arXiv.1904.09223
Sun Y , Wang SH , Feng SK , et al. , 2021 . ERNIE 3.0: large-scale knowledge enhanced pre-training for language understanding and generation . https://doi.org/10.48550/arXiv.2107.02137 https://doi.org/10.48550/arXiv.2107.02137
Sutskever I , Martens J , Dahl G , et al. , 2013 . On the importance of initialization and momentum in deep learning . Proc 30 th Int Conf on Machine Learning , p. 1139 - 1147 .
Sutskever I , Vinyals O , Le QV , 2014 . Sequence to sequence learning with neural networks . Proc 27 th Int Conf on Neural Information Processing Systems , p. 3104 - 3112 .
Tang Y , Wang CY , Zhang YF , et al. , 2022 . DELTA: dynamically optimizing GPU memory beyond tensor recomputation . https://doi.org/10.48550/arXiv.2203.15980 https://doi.org/10.48550/arXiv.2203.15980
Unger C , Jia ZH , Wu W , et al. , 2022 . Unity: accelerating DNN training through joint optimization of algebraic transformations and parallelization . Proc 16 th USENIX Symp on Operating Systems Design and Implementation , p. 267 - 284 .
Vaswani A , Shazeer N , Parmar N , et al. , 2017 . Attention is all you need . Proc 31 st Int Conf on Neural Information Processing Systems , p. 6000 - 6010 .
Wang LN , Ye JM , Zhao YY , et al. , 2018 . SuperNeurons: dynamic GPU memory management for training deep neural networks . Proc 23 rd ACM SIGPLAN Symp on Principles and Practice of Parallel Programming , p. 41 - 53 . https://doi.org/10.1145/3178487.3178491 https://doi.org/10.1145/3178487.3178491
Wang NG , Choi J , Brand D , et al. , 2018 . Training deep neural networks with 8-bit floating point numbers . Proc 32 nd Int Conf on Neural Information Processing Systems , p. 7686 - 7695 .
Wang YZ , Han X , Zhao WL , et al. , 2023 . H3T: efficient integration of memory optimization and parallelism for high-throughput Transformer training . Proc 37 th Conf on Neural Information Processing Systems .
Weinberger K , Dasgupta A , Langford J , et al. , 2009 . Feature hashing for large scale multitask learning . Proc 26 th Annual Int Conf on Machine Learning , p. 1113 - 1120 . https://doi.org/10.1145/1553374.1553516 https://doi.org/10.1145/1553374.1553516
Xi HC , Li CH , Chen JF , et al. , 2023 . Training Transformers with 4-bit integers . Proc 37 th Conf on Neural Information Processing Systems .
Xiong RB , Yang YC , He D , et al. , 2020 . On layer normalization in the Transformer architecture . Proc 37 th Int Conf on Machine Learning , p. 10524 - 10533 .
Xu QM , Siyamwala H , Ghosh M , et al. , 2015 . Performance analysis of NVMe SSDs and their implication on real world databases . Proc 8 th ACM Int Systems and Storage Conf , Article 6 . https://doi.org/10.1145/2757667.2757684 https://doi.org/10.1145/2757667.2757684
Yao ZW , Aminabadi RY , Ruwase O , et al. , 2023 . DeepSpeed-Chat: easy, fast and affordable RLHF training of ChatGPT-like models at all scales . https://doi.org/10.48550/arXiv.2308.01320 https://doi.org/10.48550/arXiv.2308.01320
You Y , Li J , Reddi SJ , et al. , 2020 . Large batch optimization for deep learning: training BERT in 76 minutes . Proc 8 th Int Conf on Learning Representations .
Yuan JH , Li XQ , Cheng C , et al. , 2022 . OneFlow: redesign the distributed deep learning framework from scratch . https://doi.org/10.48550/arXiv.2110.15032 https://doi.org/10.48550/arXiv.2110.15032
Ze HG , Senior A , Schuster M , 2013 . Statistical parametric speech synthesis using deep neural networks . Proc IEEE Int Conf on Acoustics, Speech and Signal Processing , p. 7962 - 7966 . https://doi.org/10.1109/ICASSP.2013.6639215 https://doi.org/10.1109/ICASSP.2013.6639215
Zellers R , Bisk Y , Schwartz R , et al. , 2018 . SWAG: a large-scale adversarial dataset for grounded commonsense inference . Proc Conf on Empirical Methods in Natural Language Processing , p. 93 - 104 . https://doi.org/10.18653/v1/D18-1009 https://doi.org/10.18653/v1/D18-1009
Zeng W , Ren XZ , Su T , et al. , 2021 . PanGu- α : large-scale autoregressive pretrained Chinese language models with auto-parallel computation . https://doi.org/10.48550/arXiv.2104.12369 https://doi.org/10.48550/arXiv.2104.12369
Zhang DQ , Yang JL , Ye DQ , et al. , 2018 . LQ-Nets: learned quantization for highly accurate and compact deep neural networks . Proc 15 th European Conf on Computer Vision , p. 373 - 390 . https://doi.org/10.1007/978-3-030-01237-3_23 https://doi.org/10.1007/978-3-030-01237-3_23
Zhang JH , Ma SH , Liu PH , et al. , 2023 . Coop: memory is not a commodity . Proc 34 th Conf on Neural Information Processing Systems .
Zhang ZY , Han X , Zhou H , et al. , 2021 . CPM: a large-scale generative Chinese pre-trained language model . AI Open , 2 : 93 - 99 . https://doi.org/10.1016/j.aiopen.2021.07.001 https://doi.org/10.1016/j.aiopen.2021.07.001
Zhao XY , Le Hellard T , Eyraud-Dubois L , et al. , 2023 . Rockmate: an efficient, fast, automatic and generic tool for re-materialization in PyTorch . Proc 40 th Int Conf on Machine Learning , p. 42018 - 42045 .
Zhao YL , Gu A , Varma R , et al. , 2023 . PyTorch FSDP: experiences on scaling fully sharded Data Parallel . Proc VLDB Endow , 16 ( 12 ): 3848 - 3860 . https://doi.org/10.14778/3611540.3611569 https://doi.org/10.14778/3611540.3611569
Zheng LM , Li ZH , Zhang H , et al. , 2022 . Alpa: automating inter- and intra-operator parallelism for distributed deep learning . Proc 16 th USENIX Symp on Operating Systems Design and Implementation , p. 559 - 578 .
Zhong Y , Zhu JJ , Yang PC , et al. , 2023 . MQSP: micro-query sequence parallelism for linearly scaling long sequence Transformer . https://openreview.net/forum?id=gfr5yILQc7_ https://openreview.net/forum?id=gfr5yILQc7_ [Accessed on Sept. 17, 2023 ] .
Zhou J , Ke P , Qiu XP , et al. , 2024 . ChatGPT: potential, prospects, and limitations . Front Inform Technol Electron Eng 25 ( 1 ): 6 - 11 . https://doi.org/10.1631/FITEE.2300089 https://doi.org/10.1631/FITEE.2300089
Zhou SC , Wu YX , Ni ZK , et al. , 2018 . DoReFa-Net: training low bitwidth convolutional neural networks with low bitwidth gradients . https://doi.org/10.48550/arXiv.1606.06160 https://doi.org/10.48550/arXiv.1606.06160
Zhuang ZX , Liu MR , Cutkosky A , et al. , 2022 . Understanding AdamW through proximal methods and scale-freeness . https://arxiv.org/abs/2202.00089 https://arxiv.org/abs/2202.00089
关联资源
相关文章
相关作者
相关机构