83 0

Optimizing Deep Learning Model Inference using Efficient Model Partitioning on Edge Devices

Title
Optimizing Deep Learning Model Inference using Efficient Model Partitioning on Edge Devices
Author
원준호
Alternative Author(s)
WOHN JUN HO
Advisor(s)
서지원
Issue Date
2024. 2
Publisher
한양대학교 대학원
Degree
Master
Abstract
As deep learning models advance, so does the interest in utilizing deep learning applications in our daily lives. Edge devices, which have a wide range of applications such as autonomous vehicles, smart home appliances, and the Internet of Things (IoT), are devices that are connected to a network and act as an entry point for sending data to perform computations or operate without a network connection. For edge devices that mainly process real-time data, the inference speed of deep learning models is one of the most important factors. Existing techniques include the on-device method, which optimizes the model for the edge device and processes all operations within the edge device, and the cloud method, which collects data from the edge device and transmits the data to the computation server using network communication, then performs the actual computation on the computation server and receives the result back on the edge device. However, the on-device method often suffers from the low computation performance of edge devices, and the cloud method has limitations such as overloading the computing server and privacy issues. Therefore, to compensate for this, collaborative inference has been proposed, which divides the deep learning model and performs computation on both the server and the edge device. However, as deep learning models have become increasingly complex, it has become very difficult for humans to partition deep learning models. This paper introduces a framework that compiles deep learning models using TVM, a deep learning compiler, optimizes various models, converts them into Intermediate Representation (IR) graphs, and automatically partitions deep learning models by analyzing the connections of the graphs. In an experiment where the partitioned model was distributed using an Nvidia Jetson Xavier NX board as an edge device and a GTX 1060 desktop as a server, it is found that in Resnet101 and Resnet152, the model partition was faster than the edge device-only execution time in the partition ratio of 40%~60% and 10~33%, respectively, and there was a point where the server overload could be reduced. In addition, to reduce the data transfer overhead, which is the most important bottleneck in collaborative inference between server and edge, the framework automatically captured the point where the data size was reduced in the model, which could be used as one of the criteria for deep learning model partitioning. It was found that the model data minimization point and automatic partitioning method can reduce the transmitted and received data by up to 33% for the basic partition and up to 82% for the quantized partition in Unet, Resnet50, Resnet101, and Resnet152.
URI
http://hanyang.dcollection.net/common/orgView/200000726139https://repository.hanyang.ac.kr/handle/20.500.11754/188388
Appears in Collections:
GRADUATE SCHOOL[S](대학원) > COMPUTER SCIENCE(컴퓨터·소프트웨어학과) > Theses (Master)
Files in This Item:
There are no files associated with this item.
Export
RIS (EndNote)
XLS (Excel)
XML


qrcode

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

BROWSE