188 0

OpenCL 기반 FPGA 연산장치활용 및 성능향상 기법

Title
OpenCL 기반 FPGA 연산장치활용 및 성능향상 기법
Other Titles
Effective Compute Unit Utilization Techniquefor FPGA using OpenCL
Author
김지찬
Alternative Author(s)
Kim, Ji Chan
Advisor(s)
정기석
Issue Date
2016-02
Publisher
한양대학교
Degree
Master
Abstract
오늘날, 고성능 이기종 플랫폼은 기존의 이종 멀티코어 연산장치인 Central Processing Unit (CPU) 및 Graphic processing Unit (GPU) 뿐만 아니라 Field Programmable Gate Array (FPGA)와 같은 재구성 가능한 하드웨어를 포함하는 형태로 발전하고 있다. 이러한 이기종 프로세서들을 효율적으로 활용하기 위해 표준 컴퓨팅 프레임워크인 Open Computing Language (OpenCL)이 제안되었다. 본 논문에서는 FPGA를 타깃 가속장치로 정의하고 OpenCL 프레임워크를 활용해 신호처리 알고리즘을 병렬처리 하였다. OpenCL은 C기반의 OpenCL C를 이용하여 소프트웨어 프로그래머들이 FPGA를 가속장치로 활용할 수 있는 표준 환경을 제공한다. 소프트웨어 개발자들은 Hardware Description Language (HDL)를 익히거나 숙련하지 않아도, 하드웨어를 설계 할 수 있다. OpenCL C로 디자인한 하드웨어를 FPGA에 이식하는 경우, 설계 유연성이 높은 FPGA 장점을 극대화하여 시장의 트렌드와 개발방향의 변화에 대응할 수 있다. 본 논문은 OpenCL을 활용한 High Level Synthesis (HLS) 기법을 적용해 FPGA의 하드웨어 자원을 충분히 활용하기 위한 기법들을 제안하고, 성능 향상을 실험을 통해서 검증하였다. 본 논문에서는 Low Density Parity Check (LDPC) 디코더와 콘볼루션 필터를 FPGA로 설계하였다. 또한 성능향상을 위한 제안한 방법은 작업-그룹의 크기를 커널 실행 중에 동적으로 반영할 수 있는 기법과 콘볼루션 필터의 필터링 성능을 향상시키기 위해 파이프라이닝, 로컬 메모리 할당 방법, 루프언롤링 방법을 적용하였다. 그 결과 LDPC의 디코딩 성능은 최대 25%까지 증가하였고, 콘볼루션의 필터를 실행하는 실행시간을 12배 증가하였고, 임베디드 환경에서 가장 중요한 전력 소비 당 성능도 최대 31배 좋아졌다. 실험 결과, 제안하는 기법이 OpenCL을 이용해서 FPGA에 설계한 응용프로그램들의 성능향상에 기여했다는 것을 알 수가 있다.| Today, modern heterogeneous platforms often employ a field programmable gate array (FPGA) device in addition to a central processing unit (CPU) and a graphic processing unit (GPU). To fully utilize these heterogeneous processors, Open Computing Language (OpenCL) has been developed. In this thesis, a set of effective compute unit utilization techniques for FPGA is proposed. OpenCL provides a programmer-friendly solution for software programmers. Therefore, it is easy to design hardware without having to learn a hardware description language (HDL). By directly synthesizing a circuit for FPGA from OpenCL C language, programmers can implement a hardware easily. In this thesis, optimized kernel pipelines are implemented using an OpenCL framework for FPGAs, and the proposed methods are applied. Test designs for experiments are a Low Density Parity Check (LDPC) decoder and a convolution filter. The proposed optimization methods are dynamic work-group size method, pipelining, local memory allocation, and loop-unrolling. The proposed OpenCL implementation of an image filter is optimized for an FPGA device by utilizing effective data transfers. To improve performance, an effective local memory allocation scheme is proposed for a convolution kernel, and a loop-unrolling method is applied to increase the local memory allocation efficiency. By using the proposed method, LDPC decoding performance is increased by up to 25% and the average local memory access latency is improved significantly. Also, the proposed filter kernel shows a better performance-per-watt than a functionally equivalent GPU implementation by efficiently utilizing hardware resources on the target FPGA.; Today, modern heterogeneous platforms often employ a field programmable gate array (FPGA) device in addition to a central processing unit (CPU) and a graphic processing unit (GPU). To fully utilize these heterogeneous processors, Open Computing Language (OpenCL) has been developed. In this thesis, a set of effective compute unit utilization techniques for FPGA is proposed. OpenCL provides a programmer-friendly solution for software programmers. Therefore, it is easy to design hardware without having to learn a hardware description language (HDL). By directly synthesizing a circuit for FPGA from OpenCL C language, programmers can implement a hardware easily. In this thesis, optimized kernel pipelines are implemented using an OpenCL framework for FPGAs, and the proposed methods are applied. Test designs for experiments are a Low Density Parity Check (LDPC) decoder and a convolution filter. The proposed optimization methods are dynamic work-group size method, pipelining, local memory allocation, and loop-unrolling. The proposed OpenCL implementation of an image filter is optimized for an FPGA device by utilizing effective data transfers. To improve performance, an effective local memory allocation scheme is proposed for a convolution kernel, and a loop-unrolling method is applied to increase the local memory allocation efficiency. By using the proposed method, LDPC decoding performance is increased by up to 25% and the average local memory access latency is improved significantly. Also, the proposed filter kernel shows a better performance-per-watt than a functionally equivalent GPU implementation by efficiently utilizing hardware resources on the target FPGA.
URI
https://repository.hanyang.ac.kr/handle/20.500.11754/126367http://hanyang.dcollection.net/common/orgView/200000428114
Appears in Collections:
GRADUATE SCHOOL[S](대학원) > ELECTRONICS AND COMPUTER ENGINEERING(전자컴퓨터통신공학과) > Theses (Master)
Files in This Item:
There are no files associated with this item.
Export
RIS (EndNote)
XLS (Excel)
XML


qrcode

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

BROWSE