100 0

Compiler-based GPGPU Application Optimization Considering Supervised Learning and GPU Resource Utilization

Title
Compiler-based GPGPU Application Optimization Considering Supervised Learning and GPU Resource Utilization
Other Titles
지도 학습 기법 및 GPU 자원 활용도를 고려한 컴파일러 기반 GPGPU 응용 최적화
Author
유용승
Alternative Author(s)
Yongseung Yu
Advisor(s)
Yongjun Park
Issue Date
2024. 2
Publisher
한양대학교 대학원
Degree
Doctor
Abstract
Graphic Processing Units (GPUs) have emerged as highly ef- fective accelerators for massively data-parallel applications. To achieve optimal performance in GPGPU (General Purpose GPU) computing environments, GPU compilers must balance the usage of multiple resources. The GPU is composed of multiple core clus- ters known as Streaming Multiprocessors (SMs), as well as global memory resources including an L2-cache and global memory that are connected to all core clusters. Maximizing GPU performance now requires considering both intra-core (intra-SM) and inter-core resource utilization. GPU compilers generally utilize the maxi- mum intra- and inter-core resources to achieve the best GPGPU performance. However, GPU compilers frequently fail to balance resources due to excessive or inefficient allocation for the target environment. Therefore, to achieve the best GPGPU performance, it is important to ensure optimal balancing of GPU resources for the target application and GPU. In this thesis, we introduce two optimization systems for achiev- ing better GPGPU performance through compiler-based intra- and inter-core resource balancing. First, we introduce the CTA-limiter system, which balances the intra-core GPU resources based on the code instrumentation. Modern GPUs are highly successful accel- erators thanks to the exceptional performance gains afforded us- ing CUDA or OpenCL programming models. For optimal per- formance, programmers typically aim to maximize the number of thread blocks in their target programs. GPUs also strive to allocate the maximum number of thread blocks to their cores. However, nu- merous recent studies have indicated that assigning the maximum number of thread blocks to GPU cores does not always ensure op- timal performance, posing a significant challenge to determine the appropriate number of thread blocks for each GPU core. Despite previous studies, current GPU hardware cannot directly imple- ment most existing architectural techniques. We propose a solution for the issue at hand by introducing the CTA-limiter, a real-time system that enables the adjustment of the concurrent number of thread blocks per GPU core. CTA-limiter system dynamically de- creases the number of concurrent thread blocks in targeted CUDA workloads by adding shared memory allocation. Then CTA-limiter compares the execution time with the previous version to automat- ically detect the ideal number of co-running thread blocks per GPU core. Secondly, we propose the CUTLASS-tailor system, a supervised learning-based method that optimizes the tiling-based GEMM to balance both inter-core and intra-core GPU resources. GEneral Matrix Multiplication (GEMM) plays a pivotal role as a computa- tion kernel for deep neural networks. CUTLASS is a sophisticated tiling-based linear algebra template library that uses CUDA envi- ronment and is open-source. It offers a tiling-based GEMM that has been highly optimized. However, achieving optimal perfor- mance with CUTLASS GEMM may be difficult if tiling configu- ration is not carefully selected because of performance variances that depend on several factors, including tile size and shape, as well as the specific graphics processing unit (GPU) architecture being targeted. Thus, identifying the most effective tile parameter is a challenging problem in attaining optimum performance for a tiling-based GEMM. To address this issue, we suggest CUTLASS- tailor, an innovative end-to-end framework that employs a neural network model to predict the optimal tile parameters for target CUTLASS GEMM operations and underlying GPUs. We trained the prediction model using a suitable synthetic dataset that in- cludes various input matrix combinations with different sizes and structures. Furthermore, in order to create a universal model that encompasses various GPUs, we incorporated the number of GPU cores and shared memory as GPU hardware features into the input of CUTLASS-tailor network. Based on CTA-limiter system, we attained significant perfor- mance enhancements, with average improvements of 30%, 40%, and 44%, in GTX 960, GTX 1050, and GTX 1080 Ti, respectively. Based on CUTLASS-tailor system, we also achieved up to a 1.94× GEMM performance improvement over cuBLAS on an NVIDIA TitanXp GPU and demonstrated a mean model inference speedup by up to 1.21× over the baseline on an NVIDIA RTX 3090 GPU. With significant improvements in performance, both systems have demonstrated their practicality as a solution for achieving high- quality balancing of both intra- and inter-core GPU resources.
URI
http://hanyang.dcollection.net/common/orgView/200000730627https://repository.hanyang.ac.kr/handle/20.500.11754/188391
Appears in Collections:
GRADUATE SCHOOL[S](대학원) > COMPUTER SCIENCE(컴퓨터·소프트웨어학과) > Theses (Ph.D.)
Files in This Item:
There are no files associated with this item.
Export
RIS (EndNote)
XLS (Excel)
XML


qrcode

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

BROWSE