221 0

Effective Approaches to Distributed Deep Learning: Methods, Analyses, and Evaluation

Effective Approaches to Distributed Deep Learning: Methods, Analyses, and Evaluation
Alternative Author(s)
Issue Date
2021. 8
The recent success of deep learning in various fields is under-lied by large models and datasets. Though the great advance of a computational accelerator such as GPU has dramatically increased the training speed, training a large model with a large amount of data still requires much time and massive computing power. To speed up the training, distributed training with data parallelism has been proposed and widely applied. In distributed training with data parallelism, the entire training data is split into partitions, each of which is stored at each worker, while all workers have the model with the same architecture. At each iteration, a worker node trains its local model based on given a partition of training data, and then the training results (gradients or parameters) at all workers are aggregated through network communication. The existing works to improve the performance of distributed training can be classified into the two categories: (1) to efficiently manage the intra/inter-node overhead that can occur in the process of distributed training, and (2) to extend the scale of distributed training itself. While the studies included in both directions are generally effective in accelerating distributed training, it is critical to pay close attention to the model accuracy since the model accuracy and the performance improvement of them are in a trade-off relationship. From this motivation, in this dissertation, we aim to improve the performance of distributed deep learning, while maintaining the model accuracy. Toward this goal, first, we conduct an in-depth analysis of existing distributed training alvii gorithms to understand the key ideas of those algorithms and their strength and weakness. We evaluate seven well-known distributed training algorithms in terms of various aspects such as the model accuracy, scalability with the increasing number of workers, hyperparameter sensitivity, and effects of other optimization techniques. Through comprehensive evaluation and analysis, we provide several interesting discoveries that can be useful in both industry and academia. Second, from the direction (1) of view, we identify the deficiencies of existing distributed training algorithms, and propose a novel centralized training algorithm, named as ALADDIN, that successfully resolves the problems of existing works. We also provide the theoretical analysis for the convergence of ALADDIN, comparable to those of existing state-of-the-art works. Through comprehensive evaluation with popular DNN models and datasets, we demonstrate that our ALADDIN outperforms the state-of-the-art algorithms in terms of convergence rate, scalability, and robustness to heterogeneous environments. Lastly, from the direction (2) of view, we analyze and identify the limitations of existing learning rate (LR) scaling techniques in large batch training. From this understanding, we proposed a novel layer-wise strategy for LR scaling, named as LEARNER. We also figure out the problem of the warm-up heuristic methods and propose a new layer-wise warm-up method to address the problem. Via extensive evaluation with a very large number of training samples (i.e., large batch), we demonstrate that LEARNER can successfully train the model with a much larger batch size than the state-of-the-art LR scaling method.
Appears in Collections:
GRADUATE SCHOOL[S](대학원) > COMPUTER SCIENCE(컴퓨터·소프트웨어학과) > Theses (Ph.D.)
Files in This Item:
There are no files associated with this item.
RIS (EndNote)
XLS (Excel)


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.