167 0

EWKO 알고리즘을 이용한 이상치가 있는 고차원 데이터 분석

Title
EWKO 알고리즘을 이용한 이상치가 있는 고차원 데이터 분석
Other Titles
EWKO Algorithm for High-Dimensional Data with Outlier
Author
김원희
Alternative Author(s)
Kim, Won Hee
Advisor(s)
차경준
Issue Date
2020-02
Publisher
한양대학교
Degree
Master
Abstract
In this paper, we propose an EWKO (Entropy Weighted K-means with Outlier) algorithm for clustering high-dimensional data. There are two main problems with high-dimensional data. The first is the presence of noise variables, and the second is the presence of outliers. The EWKO algorithm proposed in this paper solves the problem of cluster analysis of high-dimensional data by applying weights to variables and observations for each cluster. Using the EWKM (Entropy Weighting K-Means) algorithm among the K-Means algorithms to which the variable weights are applied, the variable weights are applied differently for each cluster. In addition, the weights for observations applied the LOF (Local Outlier Factor) used the WRSK (Weighted Robust and Sparse K-Means) algorithm. LOF indicates the degree of isolation between observations and surrounding observations. Therefore, the EWKO algorithm applies different variable weights for each cluster. As a result, each cluster reduces the impact of noise variables and increases the impact of important variables. In addition, observation weights are used to prevent unbalanced clustering and not to be sensitive to outliers. It was shown through simulations that the weights were correctly applied to the observations. If the observation weight is less than 0.5, it is classified as an outlier. It was confirmed that the EWKO algorithm also suitable for anomaly detection using F1-Score and MCC (Matthews correlation coefficient). In addition, two actual data were analyzed. The performance is compared through Entropy and F-Measure, and the results prove that EWKO algorithm has better performance than K-Means, EWKM and WRSK algorithms. | 본 논문에서는 고차원 데이터를 군집 분석하는 EWKO (Entropy Weighting K-Means with Outlier) 알고리즘을 제안하고자 한다. 고차원 데이터는 분석 과정 중에서 크게 2가지 문제점이 발생한다. 첫 번째는 잡음 변수의 존재이며, 두 번째는 이상치의 존재이다. 본 논문에서 제안하는 EWKO 알고리즘은 각 군집 별 변수와 관측치에 가중치를 적용하여 고차원 데이터를 군집 분석하면서 발생하는 문제점을 해결한다. 변수의 가중치를 적용한 K-Means 알고리즘 중에서 EWKM (Entropy Weighting K-Means) 알고리즘을 이용하여 각 군집 별로 변수의 가중치를 다르게 적용한다. 또한, 관측치에 대한 가중치는 WRSK (Weighted Robust and Sparse K-Means) 알고리즘에서 적용한 LOF (Local Outlier Factor)을 이용한다. LOF는 해당 관측치가 주변의 관측치에 대해서 얼마나 고립되어 있는지를 보여주는 지표이다. 따라서, EWKO 알고리즘은 군집 별로 변수의 가중치를 다르게 적용한다. 그 결과 각 군집마다 잡음 변수의 영향을 줄이고, 중요한 변수의 영향을 늘린다. 또한, 관측치의 가중치를 적용하여 이상치에 민감하지 않고, 불균형한 군집화를 하지 않도록 방지한다. 모의실험을 통해 관측치에 대해서 가중치가 올바르게 적용되는 것을 보여주었다. 그리고 관측치의 가중치가 0.5보다 작으면 이상치라고 구분한다. 이를 통해 F1-Score, MCC (Matthews Correlation Coefficient)를 이용하여 이상치 탐지에도 적합함을 확인하였다. 또한, 2개의 실제 데이터를 분석하였다. 엔트로피 (Entropy)와 F-Measure를 통해 성능을 비교하였으며, 기존 방법인 K-Means, EWKM, WRSK 알고리즘보다 EWKO 알고리즘이 좋은 성능을 가지는 것을 확인하였다.
In this paper, we propose an EWKO (Entropy Weighted K-means with Outlier) algorithm for clustering high-dimensional data. There are two main problems with high-dimensional data. The first is the presence of noise variables, and the second is the presence of outliers. The EWKO algorithm proposed in this paper solves the problem of cluster analysis of high-dimensional data by applying weights to variables and observations for each cluster. Using the EWKM (Entropy Weighting K-Means) algorithm among the K-Means algorithms to which the variable weights are applied, the variable weights are applied differently for each cluster. In addition, the weights for observations applied the LOF (Local Outlier Factor) used the WRSK (Weighted Robust and Sparse K-Means) algorithm. LOF indicates the degree of isolation between observations and surrounding observations. Therefore, the EWKO algorithm applies different variable weights for each cluster. As a result, each cluster reduces the impact of noise variables and increases the impact of important variables. In addition, observation weights are used to prevent unbalanced clustering and not to be sensitive to outliers. It was shown through simulations that the weights were correctly applied to the observations. If the observation weight is less than 0.5, it is classified as an outlier. It was confirmed that the EWKO algorithm also suitable for anomaly detection using F1-Score and MCC (Matthews correlation coefficient). In addition, two actual data were analyzed. The performance is compared through Entropy and F-Measure, and the results prove that EWKO algorithm has better performance than K-Means, EWKM and WRSK algorithms.
URI
https://repository.hanyang.ac.kr/handle/20.500.11754/123625http://hanyang.dcollection.net/common/orgView/200000436859
Appears in Collections:
GRADUATE SCHOOL[S](대학원) > APPLIED STATISTICS(응용통계학과) > Theses (Master)
Files in This Item:
There are no files associated with this item.
Export
RIS (EndNote)
XLS (Excel)
XML


qrcode

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

BROWSE