CPU-GPU异构系统感知和预测的批处理内存调度策略

方娟; 林胜; 杨会静; 徐艺翔; 苏醒

doi:10.1631/FITEE.2200449

Your Location：

Home >

Browse articles >

CPU-GPU异构系统感知和预测的批处理内存调度策略

常规文章 | Updated：2023-07-24

- CPU-GPU异构系统感知和预测的批处理内存调度策略
  Enhanced Publication
- A perceptual and predictive batch-processing memory scheduling strategy for a CPU-GPU heterogeneous system
- 信息与电子工程前沿（英文） 2023年24卷第7期页码：994-1006
- Affiliations：
  
  Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China
- Author bio：
  
  ‡Corresponding author
  lins@emails.bjut.edu.cn
  yangkx@emails.bjut.edu.cn
  xuyx@emails.bjut.edu.cn
  suxing@bjut.edu.cn
- Funds：
  
  National Natural Science Foundation of China(62276011;61202076);Natural Science Foundation of Beijing, China(4192007)
- DOI：10.1631/FITEE.2200449
  中图分类号： TP391.9
- 纸质出版日期：2023-07-0 ，
  
  收稿日期：2022-10-11，
  
  录用日期：2023-01-04
- Accepted：
Scan QR Code
方娟, 林胜, 杨会静, 等. CPU-GPU异构系统感知和预测的批处理内存调度策略[J]. 信息与电子工程前沿（英文）, 2023,24(7):994-1006.

JUAN FANG, SHENG LIN, HUIJING YANG, et al. A perceptual and predictive batch-processing memory scheduling strategy for a CPU-GPU heterogeneous system. [J]. Frontiers of information technology & electronic engineering, 2023, 24(7): 994-1006.
方娟, 林胜, 杨会静, 等. CPU-GPU异构系统感知和预测的批处理内存调度策略[J]. 信息与电子工程前沿（英文）, 2023,24(7):994-1006. DOI： 10.1631/FITEE.2200449.

JUAN FANG, SHENG LIN, HUIJING YANG, et al. A perceptual and predictive batch-processing memory scheduling strategy for a CPU-GPU heterogeneous system. [J]. Frontiers of information technology & electronic engineering, 2023, 24(7): 994-1006. DOI： 10.1631/FITEE.2200449.

摘要

当多个处理器（CPU）核心和集成图形处理器（GPU）共享片外主存时，CPU和GPU应用程序会竞争关键内存资源，导致严重的资源竞争，并对系统整体性能产生负面影响。本文描述了CPU-GPU异构多核架构下共享内存资源的竞争情况，提出一种基于感知和预测的批处理共享内存请求调度策略。该策略通过感知请求缓冲区中CPU和GPU内存请求情况，估计GPU延迟容忍度，并通过批量处理CPU或GPU内存请求减少CPU和GPU之间的相互干扰。实验结果表明，CPU性能提升8.53%，相互干扰降低10.38%，该调度策略具有较低硬件复杂度。

Abstract

When multiple central processing unit (CPU) cores and integrated graphics processing units (GPUs) share off-chip main memory

CPU and GPU applications compete for the critical memory resource. This causes serious resource competition and has a negative impact on the overall performance of the system. We describe the competition for shared-memory resources in a CPU-GPU heterogeneous multi-core architecture

and a shared-memory request scheduling strategy based on perceptual and predictive batch-processing is proposed. By sensing the CPU and GPU memory request conditions in the request buffer

the proposed scheduling strategy estimates the GPU latency tolerance and reduces mutual interference between CPU and GPU by processing CPU or GPU memory requests in batches. According to the simulation results

the scheduling strategy improves CPU performance by 8.53% and reduces mutual interference by 10.38% with low hardware complexity.

关键词

CPU-GPU异构多核共享内存访存调度

Keywords

CPU-GPU heterogeneousMulti-coreUnified memoryAccess scheduling

references

Ausavarungnirun R, Chang KKW, Subramanian L, et al., 2012. Staged memory scheduling: achieving high performance and scalability in heterogeneous systems. Proc 39th Annual Int Symp on Computer Architecture, p.416-427. doi: 10.1109/ISCA.2012.6237036http://doi.org/10.1109/ISCA.2012.6237036

Binkert N, Beckmann B, Black G, et al., 2011. The gem5 simulator. ACM SIGARCH Comput Archit News, 39(2):1-7. doi: 10.1145/2024716.2024718http://doi.org/10.1145/2024716.2024718

Bitalebi H, Safaei F, 2023. Criticality-aware priority to accelerate GPU memory access. J Supercomput, 79(1):188-213. doi: 10.1007/s11227-022-04657-3http://doi.org/10.1007/s11227-022-04657-3

Bouvier D, Cohen B, Fry W, et al., 2014. Kabini: an AMD accelerated processing unit system on a chip. IEEE Micro, 34(2):22-33. doi: 10.1109/MM.2014.3http://doi.org/10.1109/MM.2014.3

Chen W, Ray S, Bhadra J, et al., 2017. Challenges and trends in modern SoC design verification. IEEE Des Test, 34(5):7-22. doi: 10.1109/MDAT.2017.2735383http://doi.org/10.1109/MDAT.2017.2735383

di Sanzo P, Pellegrini A, Sannicandro M, et al., 2020. Adaptive model-based scheduling in software transactional memory. IEEE Trans Comput, 69(5):621-632. doi: 10.1109/TC.2019.2954139http://doi.org/10.1109/TC.2019.2954139

Fang J, Yu L, Liu ST, et al., 2015. KL_GA: an application mapping algorithm for mesh-of-tree (MoT) architecture in network-on-chip design. J Supercomput, 71(11):4056-4071. doi: 10.1007/s11227-015-1504-yhttp://doi.org/10.1007/s11227-015-1504-y

Fang J, Wang MX, Wei ZL, 2020. A memory scheduling strategy for eliminating memory access interference in heterogeneous system. J Supercomput, 76(4):3129-3154. doi: 10.1007/s11227-019-03135-7http://doi.org/10.1007/s11227-019-03135-7

Hazarika A, Poddar S, Rahaman H, 2020. Survey on memory management techniques in heterogeneous computing systems. IET Comput Dig Tech, 14(2):47-60. doi: 10.1049/iet-cdt.2019.0092http://doi.org/10.1049/iet-cdt.2019.0092

Jamieson C, Chandrashekar A, 2022. gem5 GPU accuracy profiler (GAP). Proc 4th gem5 Users Workshop, p.44.

Jeong MK, Erez M, Sudanthi C, et al., 2012. A QoS-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an MPSoC. Proc Design Automation Conf, p.850-855.

Jog A, Kayiran O, Nachiappan NC, et al., 2013. OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance. ACM SIGPLAN Not, 48(4):395-406. doi: 10.1145/2499368.2451158http://doi.org/10.1145/2499368.2451158

Jog A, Kayiran O, Pattnaik A, et al., 2016. Exploiting core criticality for enhanced GPU performance. Proc ACM SIGMETRICS Int Conf on Measurement and Modeling of Computer Science, p.351-363. doi: 10.1145/2896377.2901468http://doi.org/10.1145/2896377.2901468

Kim Y, Han D, Mutlu O, et al., 2010. ATLAS: a scalable and high-performance scheduling algorithm for multiple memory controllers. Proc 16th Int Symp on High-Performance Computer Architecture, p.1-12. doi: 10.1109/HPCA.2010.5416658http://doi.org/10.1109/HPCA.2010.5416658

Lin CH, Liu JC, Yang PK, 2020. Performance enhancement of GPU parallel computing using memory allocation optimization. Proc 14th Int Conf on Ubiquitous Information Management and Communication, p.1-5. doi: 10.1109/IMCOM48794.2020.9001771http://doi.org/10.1109/IMCOM48794.2020.9001771

Mittal S, Vetter JS, 2015. A survey of CPU-GPU heterogeneous computing techniques. ACM Comput Surv, 47(4):69. doi: 10.1145/2788396http://doi.org/10.1145/2788396

Mutlu O, Moscibroda T, 2008. Parallelism-aware batch scheduling: enhancing both performance and fairness of shared DRAM systems. Proc Int Symp on Computer Architecture, p.63-74. doi: 10.1109/ISCA.2008.7http://doi.org/10.1109/ISCA.2008.7

Power J, Basu A, Gu JL, et al., 2013. Heterogeneous system coherence for integrated CPU-GPU systems. Proc 46th Annual IEEE/ACM Int Symp on Microarchitecture, p.457-467. doi: 10.1145/2540708.2540747http://doi.org/10.1145/2540708.2540747

Rai S, Chaudhuri M, 2017. Using criticality of GPU accesses in memory management for CPU-GPU heterogeneous multi-core processors. ACM Trans Embed Comput Syst, 16(5s):133. doi: 10.1145/3126540http://doi.org/10.1145/3126540

Subramanian L, Lee D, Seshadri V, et al., 2015. The blacklisting memory scheduler: balancing performance, fairness and complexity. https://arxiv.org/abs/1504.00390v1https://arxiv.org/abs/1504.00390v1

Usui H, Subramanian L, Chang KKW, et al., 2016. DASH: deadline-aware high-performance memory scheduler for heterogeneous systems with hardware accelerators. ACM Trans Archit Code Optim, 12(4):65. doi: 10.1145/2847255http://doi.org/10.1145/2847255

Wang HN, Jog A, 2019. Exploiting latency and error tolerance of GPGPU applications for an energy-efficient DRAM. Proc 49th Annual IEEE/IFIP Int Conf on Dependable Systems and Networks, p.362-374. doi: 10.1109/DSN.2019.00046http://doi.org/10.1109/DSN.2019.00046

Wang QH, Peng Z, Ren B, et al., 2022. MemHC: an optimized GPU memory management framework for accelerating many-body correlation. ACM Trans Archit Code Optim, 19(2):24. doi: 10.1145/3506705http://doi.org/10.1145/3506705

Zhan XS, Bao YG, Bienia C, et al., 2016. PARSEC3.0: a multicore benchmark suite with network stacks and SPLASH-2X. ACM SIGARCH Comput Archit News, 44(5):1-16. doi: 10.1145/3053277.3053279http://doi.org/10.1145/3053277.3053279

Zhang F, Zhai JD, He BS, et al., 2017. Understanding co-running behaviors on integrated CPU/GPU architectures. IEEE Trans Parall Distrib Syst, 28(3):905-918. doi: 10.1109/TPDS.2016.2586074http://doi.org/10.1109/TPDS.2016.2586074

浏览量

110

Downloads

CSCD

文章被引用时，请邮件提醒。

Submit

工具集

关联资源

Asymmetry-aware load balancing for parallel applications in single-ISA multi-core systems