面对E级超算系统的可扩展性和效率挑战:神威E级原型系统并行支撑环境的实践

何晓斌; 陈鑫; 郭恒; 刘鑫; 陈德训; 杨雨灵; 高洁; 冯赟龙; 陈龙得; 刁晓娜; 陈左宁

doi:10.1631/FITEE.2200412

Your Location：

Home >

Browse articles >

面对E级超算系统的可扩展性和效率挑战:神威E级原型系统并行支撑环境的实践

常规文章 | Updated：2023-01-28

- 面对E级超算系统的可扩展性和效率挑战:神威E级原型系统并行支撑环境的实践
  Cover Article Enhanced Publication
- Scalability and efficiency challenges for the exascale supercomputing system: practice of a parallel supporting environment on the Sunway exascale prototype system
- 信息与电子工程前沿（英文） 2023年24卷第1期页码：41-58
- Affiliations：
  
  National Research Center of Parallel Computer Engineering and Technology, Beijing 100190, China
- Author bio：
  
  [ "Xiaobin HE received his BE degree from the Harbin Institute of Technology, Harbin, China, in 2006, and his MS degree from Shanghai Jiao Tong University, Shanghai, China, in 2009. He is currently an associate researcher at the National Research Center of Parallel Computer Engineering and Technology, Beijing, China. His main research interests include high-performance computing and distributed storage systems." ]
  [ "Xin CHEN received his BE degree from the National Digital Switching System Engineering & Technological Research Center (NDSC), Zhengzhou, China, in 2016, and his MS degree from NDSC in 2018. He is a research assistant at the National Research Center of Parallel Computer Engineering and Technology, Beijing, China. His research activities focus on high-performance parallel computation and applications." ]
  [ "Xin LIU received her PhD degree from PLA Information Engineering University, Zhengzhou, China, in 2006. She is currently a research fellow at the National Research Center of Parallel Computer Engineering and Technology, Beijing, China. She is a designer of the scientific and engineering application platform of the Sunway TaihuLight System, responsible for the large-scale parallel algorithm research and application software development. Her research interests include parallel algorithms and parallel application software." ]
  [ "Dexun CHEN received his PhD degree from Tsinghua University, Beijing, China, in 2021. He is currently a research fellow at the National Research Center of Parallel Computer Engineering and Technology, Beijing, China. His research interests include high-performance computing and parallel application software." ]
- Funds：
  
  Key R&D Program of Zhejiang Province, China(2022C01250);National Key R&D Program of China(2019YFA0709402)
- DOI：10.1631/FITEE.2200412
  中图分类号： TP302
- 纸质出版日期：2023-01-0 ，
  
  收稿日期：2022-09-25，
  
  录用日期：2022-11-29
- Accepted：
Scan QR Code
何晓斌, 陈鑫, 郭恒, 等. 面对E级超算系统的可扩展性和效率挑战:神威E级原型系统并行支撑环境的实践[J]. 信息与电子工程前沿（英文）, 2023,24(1):41-58.

XIAOBIN HE, XIN CHEN, HENG GUO, et al. Scalability and efficiency challenges for the exascale supercomputing system: practice of a parallel supporting environment on the Sunway exascale prototype system. [J]. Frontiers of information technology & electronic engineering, 2023, 24(1): 41-58.
何晓斌, 陈鑫, 郭恒, 等. 面对E级超算系统的可扩展性和效率挑战:神威E级原型系统并行支撑环境的实践[J]. 信息与电子工程前沿（英文）, 2023,24(1):41-58. DOI： 10.1631/FITEE.2200412.

XIAOBIN HE, XIN CHEN, HENG GUO, et al. Scalability and efficiency challenges for the exascale supercomputing system: practice of a parallel supporting environment on the Sunway exascale prototype system. [J]. Frontiers of information technology & electronic engineering, 2023, 24(1): 41-58. DOI： 10.1631/FITEE.2200412.

摘要

随着超级计算机性能不断提高，人工智能与传统科学计算的进一步融合，应用的并行规模逐渐增加，从数百万个计算核心到数千万个计算核心，这对超大规模系统上实现并行应用的高可扩展性和高效率提出巨大挑战。本文首先以神威E级原型系统为例，分析了E级时代并行应用的高可扩展性和高效率面临的挑战。为克服这些挑战，重点介绍了神威E级原型系统上并行支撑环境软件的优化技术，包括并行操作系统、I/O优化技术、超大规模并行调试技术、千万核心并行算法、混合精度方法等。并行操作系统和I/O优化技术主要支持大规模系统扩展，而超大规模并行调试技术、千万核心并行算法和混合精度方法主要提升大规模应用的效率。最后，介绍了运行在神威E级原型系统上的应用程序取得的重要成果，从而验证了并行支撑环境设计的有效性。

Abstract

With the continuous improvement of supercomputer performance and the integration of artificial intelligence with traditional scientific computing

the scale of applications is gradually increasing

from millions to tens of millions of computing cores

which raises great challenges to achieve high scalability and efficiency of parallel applications on super-large-scale systems. Taking the Sunway exascale prototype system as an example

in this paper we first analyze the challenges of high scalability and high efficiency for parallel applications in the exascale era. To overcome these challenges

the optimization technologies used in the parallel supporting environment software on the Sunway exascale prototype system are highlighted

including the parallel operating system

input/output (I/O) optimization technology

ultra-large-scale parallel debugging technology

10-million-core parallel algorithm

and mixed-precision method. Parallel operating systems and I/O optimization technology mainly support large-scale system scaling

while the ultra-large-scale parallel debugging technology

10-million-core parallel algorithm

and mixed-precision method mainly enhance the efficiency of large-scale applications. Finally

the contributions to various applications running on the Sunway exascale prototype system are introduced

verifying the effectiveness of the parallel supporting environment design.

关键词

并行计算神威超大规模超级计算机

Keywords

Parallel computingSunwayUltra-large-scaleSupercomputer

references

Arute F, Arya K, Babbush R, et al., 2019. Quantum supremacy using a programmable superconducting processor. Nature, 574(7779):505-510. doi: 10.1038/s41586-019-1666-5http://doi.org/10.1038/s41586-019-1666-5

Berendsen HJC, van der Spoel D, van Drunen R, 1995. Gromacs: a message-passing parallel molecular dynamics implementation. Comput Phys Commun, 91(1-3):43-56. doi: 10.1016/0010-4655(95)00042-Ehttp://doi.org/10.1016/0010-4655(95)00042-E

Buluc A, Gilbert JR, 2012. Parallel sparse matrix-matrix multiplication and indexing: implementation and experiments. SIAM J Sci Comput, 34(4):C170-C191. doi: 10.1137/110848244http://doi.org/10.1137/110848244

Chen Q, Chen K, Chen ZN, et al., 2020. Lessons learned from optimizing the Sunway storage system for higher application I/O performance. J Comput Sci Technol, 35(1):47-60.doi: 10.1007/s11390-020-9798-5http://doi.org/10.1007/s11390-020-9798-5

Derouillat J, Beck A, Pérez F, et al., 2018. SMILEI: a collaborative, open-source, multi-purpose particle-in-cell code for plasma simulation. Comput Phys Commun, 222:351-373.doi: 10.1016/j.cpc.2017.09.024http://doi.org/10.1016/j.cpc.2017.09.024

Fu HH, Liao JF, Yang JZ, et al., 2016. The Sunway TaihuLight supercomputer: system and applications. Sci China Inform Sci, 59(7):072001. doi: 10.1007/s11432-016-5588-7http://doi.org/10.1007/s11432-016-5588-7

Gu J, Feng JW, Hao XY, et al., 2021. Establishing a non-hydrostatic global atmospheric modeling system (iAMAS) at 3-km horizontal resolution with online integrated aerosol feedbacks on the Sunway supercomputer of China. https://arxiv.org/abs/2112.04668v1https://arxiv.org/abs/2112.04668v1

Guo C, Liu Y, Xiong M, et al., 2019. General-purpose quantum circuit simulator with projected entangled-pair states and the quantum supremacy frontier. Phys Rev Lett, 123(19):190501.doi: 10.1103/PhysRevLett.123.190501http://doi.org/10.1103/PhysRevLett.123.190501

Guo C, Zhao YW, Huang HL, 2021. Verifying random quantum circuits with arbitrary geometry using tensor network states algorithm. Phys Rev Lett, 126(7):070502. doi: 10.1103/PhysRevLett.126.070502http://doi.org/10.1103/PhysRevLett.126.070502

Hluchý L, Bobák M, Müller H, et al., 2020. Heterogeneous exascale computing. In: Kovács L, Haidegger T, Szakál A (Eds.), Recent Advances in Intelligent Engineering. Springer, Chamr, p.81-110. doi: 10.1007/978-3-030-14350-3_5http://doi.org/10.1007/978-3-030-14350-3_5

Hofer P, Mössenböck H, 2014. Efficient and accurate stack trace sampling in the Java hotspot virtual machine. Proc 5th ACM/SPEC Int Conf on Performance Engineering, p.277-280. doi: 10.1145/2568088.2576759http://doi.org/10.1145/2568088.2576759

Hua Y, Shi X, Jin H, et al., 2019. Software-defined QoS for I/O in exascale computing. CCF Trans High Perform Comput, 1(1):49-59.doi: 10.1007/s42514-019-00005-9http://doi.org/10.1007/s42514-019-00005-9

Huang C, Zhang F, Newman M, et al., 2020. Classical simulation of quantum supremacy circuits. https://arxiv.org/abs/2005.06787https://arxiv.org/abs/2005.06787

Ji X, Yang B, Zhang TY, et al., 2019. Automatic, application-aware I/O forwarding resource allocation. Proc 17th USENIX Conf on File and Storage Technologies, p.265-279.

Jia WL, Wang H, Chen MH, et al., 2020. Pushing the limit of molecular dynamics with ab initio accuracy to 100 million atoms with machine learning. Proc Int Conf for High Performance Computing, Networking, Storage and Analysis, p.1-14. doi: 10.1109/SC41405.2020.00009http://doi.org/10.1109/SC41405.2020.00009

Kurth T, Treichler S, Romero J, et al., 2018. Exascale deep learning for climate analytics. Proc Int Conf for High Performance Computing, Networking, Storage and Analysis, p.649-660. doi: 10.1109/SC.2018.00054http://doi.org/10.1109/SC.2018.00054

Li F, Liu X, Liu Y, et al., 2021. SW_Qsim: a minimize-memory quantum simulator with high-performance on a new Sunway supercomputer. Proc Int Conf for High Performance Computing, Networking, Storage and Analysis, p.1-13.

Li MF, Chen JS, Xiao Q, et al., 2022. Bridging the gap between deep learning and frustrated quantum spin system for extreme-scale simulations on new generation of Sunway supercomputer. IEEE Trans Parall Distrib Syst, 33(11):2846-2859. doi: 10.1109/TPDS.2022.3145163http://doi.org/10.1109/TPDS.2022.3145163

Lin F, Liu Y, Guo YY, et al., 2021. ELS: emulation system for debugging and tuning large-scale parallel programs on small clusters. J Supercomput, 77(2):1635-1666. doi: 10.1007/s11227-020-03319-6http://doi.org/10.1007/s11227-020-03319-6

Lindahl E, Hess B, van der Spoel D, 2001. GROMACS 3.0: a package for molecular simulation and trajectory analysis. J Mol Model, 7(8):306-317. doi: 10.1007/s008940100045http://doi.org/10.1007/s008940100045

Liu S, Gao J, Liu X, et al., 2021. Establishing high performance AI ecosystem on Sunway platform. CCF Trans High Perform Comput, 3(3):224-241. doi: 10.1007/s42514-021-00072-xhttp://doi.org/10.1007/s42514-021-00072-x

Liu Y, Liu X, Li F, et al., 2021. Closing the “quantum supremacy” gap: achieving real-time simulation of a random quantum circuit using a new Sunway supercomputer. Proc Int Conf for High Performance Computing, Networking, Storage and Analysis, Article 3. doi: 10.1145/3458817.3487399http://doi.org/10.1145/3458817.3487399

Ma YJ, Lv S, Liu YQ, 2012. Introduction and application of cluster file system Lustre. Sci Technol Inform, (5):139-140 (in Chinese).

Madduri K, Ibrahim KZ, Williams S, et al., 2011. Gyrokinetic toroidal simulations on leading multi- and manycore HPC systems. Proc Int Conf for High Performance Computing, Networking, Storage and Analysis, p.1-12. doi: 10.1145/2063384.2063415http://doi.org/10.1145/2063384.2063415

Markov IL, Shi YY, 2008. Simulating quantum computation by contracting tensor networks. SIAM J Comput, 38(3):963-981. doi: 10.1137/050644756http://doi.org/10.1137/050644756

Merrill D, Garland M, 2017. Merge-based parallel sparse matrix-vector multiplication. Proc Int Conf for High Performance Computing, Networking, Storage and Analysis, p.678-689.doi: 10.1109/SC.2016.57http://doi.org/10.1109/SC.2016.57

Micikevicius P, Narang S, Alben J, et al., 2018. Mixed precision training. Proc 6th Int Conf on Learning Representations.

Pan F, Zhang P, 2021. Simulating the Sycamore quantum supremacy circuits. https://arxiv.org/abs/2103.03074v1https://arxiv.org/abs/2103.03074v1

Peng D, Feng Y, Liu Y, et al., 2022. Jdebug: a fast, non-intrusive and scalable fault locating tool for ten-million-scale parallel applications. IEEE Trans Parall Distrib Syst, 33(12):3491-3504.doi: 10.1109/TPDS.2022.3157690http://doi.org/10.1109/TPDS.2022.3157690

Shang HH, Li F, Zhang YQ, et al., 2021a. Extreme-scale ab initio quantum Raman spectra simulations on the leadership HPC system in China. Proc Int Conf for High Performance Computing, Networking, Storage and Analysis, Article 6. doi: 10.1145/3458817.3487402http://doi.org/10.1145/3458817.3487402

Shang HH, Li F, Zhang YQ, et al., 2021b. Accelerating all-electron ab initio simulation of Raman spectra for biological systems. Proc Int Conf for High Performance Computing, Networking, Storage and Analysis, Article 41. doi: 10.1145/3458817.3476160http://doi.org/10.1145/3458817.3476160

Shang HH, Chen X, Gao XY, et al., 2021c. TensorKMC: kinetic Monte Carlo simulation of 50 trillion atoms driven by deep learning on a new generation of Sunway supercomputer. Proc Int Conf for High Performance Computing, Networking, Storage and Analysis, Article 73. doi: 10.1145/3458817.3476174http://doi.org/10.1145/3458817.3476174

Shi X, Li M, Liu W, et al., 2017. SSDUP: a traffic-aware SSD burst buffer for HPC systems. Proc Int Conf on Supercomputing, p.1-10.doi: 10.1145/3079079.3079087http://doi.org/10.1145/3079079.3079087

Shoeybi M, Patwary M, Puri R, et al., 2019. Megatron-LM: training multi-billion parameter language models using model parallelism. https://arxiv.org/abs/1909.08053https://arxiv.org/abs/1909.08053

Trott O, Olson AJ, 2009. AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J Comput Chem, 31(2):455-461.doi: 10.1002/jcc.21334http://doi.org/10.1002/jcc.21334

Villalonga B, Boixo S, Nelson B, et al., 2019. A flexible high-performance simulator for verifying and benchmarking quantum circuits implemented on real hardware. NPJ Quant Inform, 5(1):86. doi: 10.1038/s41534-019-0196-1http://doi.org/10.1038/s41534-019-0196-1

Villalonga B, Lyakh D, Boixo S, et al., 2020. Establishing the quantum supremacy frontier with a 281 Pflop/s simulation. Quant Sci Technol, 5(3):034003. doi: 10.1088/2058-9565/ab7eebhttp://doi.org/10.1088/2058-9565/ab7eeb

Xiao JY, Chen JS, Zheng JS, et al., 2021. Symplectic structure-preserving particle-in-cell whole-volume simulation of tokamak plasmas to 111.3 trillion particles and 25.7 billion grids. Proc Int Conf for High Performance Computing, Networking, Storage and Analysis, Article 2. doi: 10.1145/3458817.3487398http://doi.org/10.1145/3458817.3487398

Yang B, Ji X, Ma XS, et al., 2019. End-to-end I/O monitoring on a leading supercomputer. Proc 16th USENIX Conf on Networked Systems Design and Implementation, p.379-394.

Yang B, Zou YL, Liu WG, et al., 2022. An end-to-end and adaptive I/O optimization tool for modern HPC storage systems. IEEE Int Parallel and Distributed Processing Symp, p.1294-1304. doi: 10.1109/IPDPS53621.2022.00128http://doi.org/10.1109/IPDPS53621.2022.00128

Ye YJ, Song ZY, Zhou SC, et al., 2022. swNEMO_v4.0: an ocean model based on NEMO4 for the new-generation Sunway supercomputer. Geosci Model Dev, 15(14):5739-5756. doi: 10.5194/gmd-15-5739-2022http://doi.org/10.5194/gmd-15-5739-2022

浏览量

Downloads

CSCD

文章被引用时，请邮件提醒。

Submit

工具集

关联资源

Extreme-scale parallel computing: bottlenecks and strategies

Task mapper and application-aware virtual machine scheduler oriented for parallel computing

A self-routing load balancing algorithm in parallel computing: comparison to the central algorithm