Memory-efficient tensor parallelism for long-sequence Transformer training

Peng LIANG; Linbo QIAO; Yanqi SHI; Hao ZHENG; Yu TANG; Dongsheng LI

doi:10.1631/FITEE.2400602

Your Location：

Home >

Browse articles >

Memory-efficient tensor parallelism for long-sequence Transformer training

Regular Papers | Updated：2025-06-09

- Memory-efficient tensor parallelism for long-sequence Transformer training
  Enhanced Publication
- 面向长序列Transformer训练的内存高效张量并行方法
- Frontiers of Information Technology & Electronic Engineering Vol. 26, Issue 5, Pages: 770-787(2025)
- Affiliations：
  
  National Key Laboratory of Parallel and Distributed Computing, College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China
- Author bio：
  
  ‡ Corresponding author
- Funds：
- DOI：10.1631/FITEE.2400602
  CLC： TP183
- Received：17 July 2024，
  
  Revised：23 February 2025，
  
  Published Online：02 April 2025，
  
  Published：2025-05
- Accepted：
Scan QR Code
Peng LIANG, Linbo QIAO, Yanqi SHI, et al. Memory-efficient tensor parallelism for long-sequence Transformer training[J]. Frontiers of information technology & electronic engineering, 2025, 26(5): 770-787.
DOI：

Peng LIANG, Linbo QIAO, Yanqi SHI, et al. Memory-efficient tensor parallelism for long-sequence Transformer training[J]. Frontiers of information technology & electronic engineering, 2025, 26(5): 770-787. DOI： 10.1631/FITEE.2400602.

摘要

近年来，基于Transformer架构的大语言模型（LLM）凭借卓越性能引发广泛关注。工业级LLM需处理长序列输入以提供优质服务。然而，内存消耗随序列长度呈平方级增长，制约长序列训练的扩展能力。现有并行方法在执行过程中产生冗余张量，存在内存优化空间；同时，张量并行（TP）无法实现计算与通信的有效重叠。针对上述问题，本文提出一种通用并行方法——内存高效张量并行（METP），专为Transformer训练核心计算单元（即两个连续矩阵乘法及其间可能存在的函数运算

（

）

设计）。METP将计算

的子任务分配到多设备，采用点对点通信（send/recv）替代集合通信交换子矩阵完成计算，避免生成冗余张量。通过双缓冲技术实现计算与通信的深度重叠，并提出完全重叠的理论条件以指导长序列Transfor

mer训练。理论分析表明：当并行度为

时，METP在未使用FlashAttention计算注意力时的内存开销为

（1/

）；在使用FlashAttention计算多头自注意力时，相比TP至少可节省41.7%内存。实验证明，基于8块A100 GPU的配置，METP可使序列长度较其他方法提升2.38–2.99倍。

Abstract

Transformer-based models like large language models (LLMs) have attracted significant attention in recent years due to their superior performance. A long sequence of input tokens is essential for industrial LLMs to provide better user services. However

memory consumption increases quadratically with the increase of sequence length

posing challenges for scaling up long-sequence training. Current parallelism methods produce duplicated tensors during execution

leaving space for improving memory efficiency. Additionally

tensor para

llelism (TP) cannot achieve effective overlap between computation and communication. To solve these weaknesses

we propose a general parallelism method called memory-efficient tensor parallelism (METP)

designed for the computation of two consecutive matrix multiplications and a possible function between them (

(

)

which is the kernel computation component in Transformer training. METP distributes subtasks of computing

to multiple devices and uses send/recv instead of collective communication to exchange submatrices for finishing the computation

avoiding producing duplicated tensors. We also apply the double buffering technique to achieve better overlap between computation and communication. We present the theoretical condition of full overlap to help instruct the long-sequence training of Transformers. Suppose the parallel degree is

; through theoretical analysis

we prove that METP provides

(1/

) memory overhead when not using FlashAttention to compute attention and could save at least 41.7% memory compared to TP when using FlashAttention to compute multi-head self-attention. Our experimental results demonstrate that METP can increase the sequence length by 2.38–2.99 times compared to other methods when using eight A100 graphics processing units (GPUs).

关键词

Keywords

references

Achiam J , Adler S , Agarwal S , et al. , 2023 . GPT-4 technical report . https://doi.org/10.48550/arXiv.2303.08774 https://doi.org/10.48550/arXiv.2303.08774

Beltagy I , Peters ME , Cohan A , 2020 . Longformer: the long-document Transformer . https://doi.org/10.48550/arXiv.2004.05150 https://doi.org/10.48550/arXiv.2004.05150

Brown TB , Mann B , Ryder N , et al. , 2020 . Language models are few-shot learners . Proc 34 th Int Conf on Neural Information Processing Systems , Article 159 .

Chen TQ , Moreau T , Jiang ZH , et al. , 2018 . TVM: an automated end-to-end optimizing compiler for deep learning . 13 th USENIX Symp on Operating Systems Design and Implementation , p. 578 - 594 .

Chowdhery A , Narang S , Devlin J , et al. , 2022 . PaLM: scaling language modeling with pathways . J Mach Learn Res , 24 ( 1 ): 240 .

Dao T , 2024 . FlashAttention-2: faster attention with better parallelism and work partitioning . Proc 12 th Int Conf on Learning Representations .

Dao T , Fu DY , Ermon S , et al. , 2022 . FlashAttention: fast and memory-efficient exact attention with IO-awareness . Proc 36 th Int Conf on Neural Information Processing Systems , Article 1189 .

Devlin J , Chang MW , Lee K , et al. , 2019 . BERT: pre-training of deep bidirectional Transformers for language understanding . Proc Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , p. 4171 - 4186 . https://doi.org/10.18653/v1/N19-1423 https://doi.org/10.18653/v1/N19-1423

Huang YP , Cheng YL , Bapna A , et al. , 2019 . GPipe: efficient training of giant neural networks using pipeline parallelism . Proc 33 rd Int Conf on Neural Information Processing Systems , Article 10 .

Huang YP , Xu JW , Jiang ZX , et al. , 2023 . Advancing Transformer architecture in long-context large language models: a comprehensive survey . https://doi.org/10.48550/arXiv.2311.12351 https://doi.org/10.48550/arXiv.2311.12351

Jacobs SA , Tanaka M , Zhang CM , et al. , 2024 . System optimizations for enabling training of extreme long sequence Transformer models . Proc 43 rd ACM Symp on Principles of Distributed Computing , p. 121 - 130 . https://doi.org/10.1145/3662158.3662806 https://doi.org/10.1145/3662158.3662806

Kaddour J , Harris J , Mozes M , et al. , 2023 . Challenges and applications of large language models . https://doi.org/10.48550/arXiv.2307.10169 https://doi.org/10.48550/arXiv.2307.10169

Kingma DP , Ba J , 2015 . Adam: a method for stochastic optimization . Proc 3 rd Int Conf on Learning Representations .

Korthikanti VA , Casper J , Lym S , et al. , 2023 . Reducing activation recomputation in large Transformer models . Proc 6 th Conf on Machine Learning and Systems .

Lai ZQ , Li SW , Tang XD , et al. , 2023 . Merak: an efficient distributed DNN training framework with automated 3D parallelism for giant foundation models . IEEE Trans Parall Distrib Syst , 34 ( 5 ): 1466 - 1478 . https://doi.org/10.1109/TPDS.2023.3247001 https://doi.org/10.1109/TPDS.2023.3247001

Li A , Song SL , Chen JY , et al. , 2020 . Evaluating modern GPU interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect . IEEE Trans Parall Distrib Syst , 31 ( 1 ): 94 - 110 . https://doi.org/10.1109/TPDS.2019.2928289 https://doi.org/10.1109/TPDS.2019.2928289

Li SG , Xue FZ , Baranwal C , et al. , 2023 . Sequence parallelism: long sequence training from system perspective . Proc 61 st Annual Meeting of the Association for Computational Linguistics , p. 2391 - 2404 . https://doi.org/10.18653/v1/2023.acl-long.134 https://doi.org/10.18653/v1/2023.acl-long.134

Liang P , Tang Y , Zhang XD , et al. , 2023 . A survey on auto-parallelism of large-scale deep learning training . IEEE Trans Parall Distrib Syst , 34 ( 8 ): 2377 - 2390 . https://doi.org/10.1109/TPDS.2023.3281931 https://doi.org/10.1109/TPDS.2023.3281931

Liu H , Zaharia M , Abbeel P , 2023 . Ring Attention with blockwise Transformers for near-infinite context . https://doi.org/10.48550/arXiv.2310.01889 https://doi.org/10.48550/arXiv.2310.01889

Liu YX , Zhang K , Li Y , et al. , 2024 . Sora: a review on background, technology, limitations, and opportunities of large vision models . https://doi.org/10.48550/arXiv.2402.17177 https://doi.org/10.48550/arXiv.2402.17177

Liu ZM , Cheng SG , Zhou HT , et al. , 2023 . Hanayo: harnessing wave-like pipeline parallelism for enhanced large model training efficiency . Int Conf for High Performance Computing, Networking, Storage and Analysis , Article 56 . https://doi.org/10.1145/3581784.3607073 https://doi.org/10.1145/3581784.3607073

Narayanan D , Shoeybi M , Casper J , et al. , 2021a . Efficient large-scale language model training on GPU clusters using Megatron-LM . Proc Int Conf for High Performance Computing, Networking, Storage and Analysis , Article 58 . https://doi.org/10.1145/3458817.3476209 https://doi.org/10.1145/3458817.3476209

Narayanan D , Phanishayee A , Shi KY , et al. , 2021b . Memory-efficient pipeline-parallel DNN training . Proc 38 th Int Conf on Machine Learning , p. 7937 - 7947 .

Rajbhandari S , Rasley J , Ruwase O , et al. , 2020 . ZeRO: memory optimizations toward training trillion parameter models . Proc Int Conf for High Performance Computing, Networking, Storage and Analysis , Article 20 .

Rasley J , Rajbhandari S , Ruwase O , et al. , 2020 . DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters . Proc 26 th ACM SIGKDD Int Conf on Knowledge Discovery & Data Mining , p. 3505 - 3506 . https://doi.org/10.1145/3394486.3406703 https://doi.org/10.1145/3394486.3406703

Shah J , Bikshandi G , Zhang Y , et al. , 2024 . FlashAttention-3: fast and accurate attention with asynchrony and low-precision . Proc 38 th Int Conf on Neural Information Processing Systems .

Srivastava N , Hinton G , Krizhevsky A , et al. , 2014 . Dropout: a simple way to prevent neural networks from overfitting . J Mach Learn Res , 15 ( 1 ): 1929 - 1958 .

Tarassow A , 2023 . The potential of LLMs for coding with low-resource and domain-specific programming languages . https://doi.org/10.48550/arXiv.2307.13018 https://doi.org/10.48550/arXiv.2307.13018

Touvron H , Lavril T , Izacard G , et al. , 2023 . LLaMA: open and efficient foundation language models . https://doi.org/10.48550/arXiv.2302.13971 https://doi.org/10.48550/arXiv.2302.13971

Xu JJ , Sun X , Zhang ZY , et al. , 2019 . Understanding and improving layer normalization . Proc 33 rd Int Conf on Neural Information Processing Systems , Article 394 .

Zhao YL , Gu A , Varma R , et al. , 2023 . PyTorch FSDP: experiences on scaling fully sharded data parallel . Proc VLDB Endow , 16 ( 12 ): 3848 - 3860 . https://doi.org/10.14778/3611540.3611569 https://doi.org/10.14778/3611540.3611569

Views

Downloads

CSCD

Alert me when the article has been cited

Submit

Tools

Publicity Resources

No data

Related Author

No data

Related Institution

No data

Map

Chat

Address：Zhejiang University Press, 148 Tianmushan Road, Hangzhou, China Postal code：310028
Tel：+86-571-88273162 Email：fitee@zju.edu.cn
It is recommended to read the content of this site in Chrome&IE9+. Please switch to extreme mode in browser 360.
Cookies We use cookies to help provide and enhance our service and tailor content. By continuing, you agree to the use of cookies.

⁰