Text Classification; Class Imbalance; Multi-Class Text Data; Feature Selection
Issue Date
2019-04
Publisher
대한산업공학회
Citation
대한산업공학회지, v. 45, No. 2, Page. 93-100
Abstract
A text classification model in which one of the class variables is biased to the majority class typically classifiesmost documents into the majority class to enhance the overall classification accuracy. It is called a classimbalance problem. This study proposes a feature selection method based on simplified chi-square statistics toselect features in each class for developing a robust model to the problem. Proposed method and typical featureselection methods are compared by Reuter21578 data. Experiment shows that the proposed method is superior totypical feature selection methods in terms of naïve Bayes and support vector machine which are robust to theclass imbalance problem.