데이터 마이닝 분야에서 분류 기법 중 의사결정나무는 다양한 응용 분야에서 연구되어지고 사용되어져 왔다. 기존 연구의 초점은 알고리듬의 분류 수행도를 향상시키기 위한 시도로 단순화와 불필요한 가지를 제거하는 프루닝(pruning)에 맞춰져 있었다.
본 연구에서는 분류에 있어서 대상인 클래스가 있는 개체들에 초점을 맞추고 두 개체를 비교했을 때 발생할 수 있는 최대 네 가지 경우를 속성 선택 기준으로 사용하여 클래스가 다른 개체를 기존 연구에 비해 더 잘 구분하고자 새로운 알고리듬을 제안했다. 여기에 러프셋 이론을 배경으로 세 가지 의사결정나무 구성방법을 제안하였다. 첫 번째는 개체 간 속성 값을 고려한 러프셋 기반의 의사결정나무 생성 알고리듬인 EDT이고, 두 번째는 목표 클래스가 있는 러프셋 기반의 의사결정나무 생성 알고리듬인 FDT이고, 세 번째는 시간 증분 개체(time incremental entities) 의사결정시스템에서의 의사결정나무 생성 알고리듬인 IDT이다. Decision trees are often used in data mining and classification systems because they are easily interpreted, accurate, and fast. One important step in constructing a decision tree is the selection of node attributes, so that the tree has a minimum number of branches. Even though the problem of constructing the smallest optimal decision tree for classification purposes is believed to be computationally complex, there has been no formal proof of this complexity, as far as we know. Since no computationally efficient classification algorithm for the optimal decision tree exists, previous studies have focused on finding a heuristic classification algorithm, without evaluating its classification ability or the quality of knowledge generated from its decision tree. In this regard, researchers have concentrated on creating add-on algorithms to simplify or prune the decision tree in an attempt to improve its classification performance.
In this paper, we present a new measure for selecting attributes as nodes of a decision tree at each level of expansion. We use a discernibility matrix to find the core attribute, and compare objects to select the attributes that contribute most to the classification.
We have developed a classification contribution function. Previous studies on decision trees using rough sets relied on other concepts, such as entropy. While our work does use the concept of rough set theory, we also use our new measure, which considers attributes that are distinguishable between objects.
We propose three methods for constructing decision trees.
The first is a new classification decision tree algorithm called EDT (entity attributes decision tree). It is based on core attributes and reduct elements, and uses our new measure classification contribution function to select nodes. Experiments on UCI (University of California at Irvine) data proved that EDT was more effective and provided better accuracy than ID3, C4.5 and Rough set-ID3 algorithm in most cases.
For the second method, we assume that there is a focus class as well as the EDT algorithm. Focus class analysis is important for target marketing, analysis of special customer requirements, fraud detection, and cases with unusual patterns. This second algorithm is called FDT (focused class decision tree). Since it has a focus class, the FDT algorithm performs better than EDT. Experiments on UCI (University of California at Irvine) data proved that the FDT algorithm performs better than ID3, C4.5 and EDT algorithm.
The third method involves a new decision tree construction method for incremental entity information systems that we call IDT (incremental entity decision tree). Although in the real world databases are periodically and continually updated, traditional means of selecting the attributes of decision trees ignore the time of each entity and so are unfair to new ones. We assume that old entities eventually become useless while new ones may give rise to some implicitly valid patterns or rules. We use a time-weight and classification contribution function to calculate the relative importance of entities. Experiments on real data set for life insurance customer data proved that IDT has high accuracy and it is applicable and predictable for future object X.
These three algorithms are unique in that they use rough set theory and classify entities based on the concept of maximally contributing attributes. The algorithms maintain the inherent advantages of rough set theory.