中国邮电高校学报(英文) ›› 2019, Vol. 26 ›› Issue (2): 43-51.doi: 10.19682/j.cnki.1005-8885.2019.1005

• Artificial Intelligence • 上一篇    下一篇

Using boosting tree to learn imbalanced data

Yang Ridong, Zhang Shiyu, Li Lin, Wang Zhe, Zhou Yi   

  1. 1. Zhongshan School of Medicine, Sun Yat-Sen University, Guangdong 510080, China
    2. College of Public Health, Xinjiang Medical University, Xinjiang 830001, China
  • 收稿日期:2018-08-07 修回日期:2019-04-04 出版日期:2019-04-30 发布日期:2019-06-14
  • 通讯作者: Corresponding author: Zhou Yi, E-mail: zhouyi@mail.sysu.edu.cn E-mail:zhouyi@mail.sysu.edu.cn
  • 作者简介:Corresponding author: Zhou Yi, E-mail: zhouyi@mail.sysu.edu.cn
  • 基金资助:
    This work was supported by the National Key Research and Development Program of China (2018YFC0116902, 2016YFC0901602), the National Natural
    Science Foundation of China (NSFC) (61876194), Joint Foundation for the NSFC and Guangdong Science Center for Big Data (U1611261), Medical Scientific Research Foundation of Guangdong Province of China (C2017037), Science and Technology Program of Guangzhou (201604020016).

Using boosting tree to learn imbalanced data

Yang Ridong, Zhang Shiyu, Li Lin, Wang Zhe, Zhou Yi   

  1. 1. Zhongshan School of Medicine, Sun Yat-Sen University, Guangdong 510080, China
    2. College of Public Health, Xinjiang Medical University, Xinjiang 830001, China
  • Received:2018-08-07 Revised:2019-04-04 Online:2019-04-30 Published:2019-06-14
  • Contact: Corresponding author: Zhou Yi, E-mail: zhouyi@mail.sysu.edu.cn E-mail:zhouyi@mail.sysu.edu.cn
  • About author:Corresponding author: Zhou Yi, E-mail: zhouyi@mail.sysu.edu.cn
  • Supported by:
    This work was supported by the National Key Research and Development Program of China (2018YFC0116902, 2016YFC0901602), the National Natural
    Science Foundation of China (NSFC) (61876194), Joint Foundation for the NSFC and Guangdong Science Center for Big Data (U1611261), Medical Scientific Research Foundation of Guangdong Province of China (C2017037), Science and Technology Program of Guangzhou (201604020016).

摘要: In case of machine learning, the problem of class imbalance is always troubling, i. e. one class of the samples has a larger magnitude than the other classes. This problem brings a preference of the classifier to the majority class, which leads to worse performance of the classifier on the minority class. We proposed an improved boosting tree (BT) algorithm for learning imbalanced data, called cost BT. In each iteration of the cost BT, only the weights of the misclassified minority class samples are increased. Meanwhile, the error rate in the weight formula of the base classifier is replaced by 1 minus F-measure. In this study, the performance of the cost BT algorithm is compared with other known methods on 9 public data sets. The compared methods include the decision tree and
random forest algorithm, and both of them were combined with the sampling techniques such as synthetic minority oversampling technique (SMOTE), Borderline-SMOTE, adaptive synthetic sampling approach (ADASYN) and one sided selection. The cost BT algorithm performed better than the other compared methods in F-measure, G-mean and area under curve (AUC). In 6 of the 9 data sets, the cost BT algorithm has a superior performance to the other published methods. It can promote the prediction performance of the base classifiers by increasing the proportion of the minority class in the whole samples with only increasing the weights of the misclassified minority class samples in each iteration of the BT. In addition, computing the weights of the base classifiers with F-measure is helpful to the ensemble decisions.

关键词: machine learning, class imbalanced, BT, data sampling

Abstract: In case of machine learning, the problem of class imbalance is always troubling, i. e. one class of the samples has a larger magnitude than the other classes. This problem brings a preference of the classifier to the majority class, which leads to worse performance of the classifier on the minority class. We proposed an improved boosting tree (BT) algorithm for learning imbalanced data, called cost BT. In each iteration of the cost BT, only the weights of the misclassified minority class samples are increased. Meanwhile, the error rate in the weight formula of the base classifier is replaced by 1 minus F-measure. In this study, the performance of the cost BT algorithm is compared with other known methods on 9 public data sets. The compared methods include the decision tree and
random forest algorithm, and both of them were combined with the sampling techniques such as synthetic minority oversampling technique (SMOTE), Borderline-SMOTE, adaptive synthetic sampling approach (ADASYN) and one sided selection. The cost BT algorithm performed better than the other compared methods in F-measure, G-mean and area under curve (AUC). In 6 of the 9 data sets, the cost BT algorithm has a superior performance to the other published methods. It can promote the prediction performance of the base classifiers by increasing the proportion of the minority class in the whole samples with only increasing the weights of the misclassified minority class samples in each iteration of the BT. In addition, computing the weights of the base classifiers with F-measure is helpful to the ensemble decisions.

Key words: machine learning, class imbalanced, BT, data sampling

中图分类号: