FastCheck: fast checkpointing and recovery for DNN training via parallel transmission and compression
Regular Papers|Updated:2026-02-11
|
FastCheck: fast checkpointing and recovery for DNN training via parallel transmission and compression
Enhanced Publication
FastCheck:一种基于并行传输与定制化压缩的深度神经网络训练检查点快速保存与恢复方法
“Training large-scale deep neural networks (DNNs) is prone to software and hardware failures, with critical failures often requiring full-machine reboots that substantially prolong training. In this paper, we propose FastCheck, a checkpoint–recovery framework that accelerates checkpointing and recovery through parallel transmission and tailored compression.”
Yun TENG, Dawei SUN, Shipeng HU, et al. FastCheck: fast checkpointing and recovery for DNN training via parallel transmission and compression[J]. ENGINEERING Information Technology & Electronic Engineering, 2026, 27(2): 1-13.
DOI:
Yun TENG, Dawei SUN, Shipeng HU, et al. FastCheck: fast checkpointing and recovery for DNN training via parallel transmission and compression[J]. ENGINEERING Information Technology & Electronic Engineering, 2026, 27(2): 1-13. DOI: 10.1631/ENG.ITEE.2025.0034.
FastCheck: fast checkpointing and recovery for DNN training via parallel transmission and compressionEnhanced Publication