Using boosting tree to learn imbalanced data

doi:10.19682/j.cnki.1005-8885.2019.1005

中国邮电高校学报(英文) ›› 2019, Vol. 26 ›› Issue (2): 43-51.doi: 10.19682/j.cnki.1005-8885.2019.1005

• Artificial Intelligence • 上一篇下一篇

Using boosting tree to learn imbalanced data

Yang Ridong, Zhang Shiyu, Li Lin, Wang Zhe, Zhou Yi

1. Zhongshan School of Medicine, Sun Yat-Sen University, Guangdong 510080, China
2. College of Public Health, Xinjiang Medical University, Xinjiang 830001, China

收稿日期:2018-08-07 修回日期:2019-04-04 出版日期:2019-04-30 发布日期:2019-06-14
通讯作者: Corresponding author: Zhou Yi, E-mail: zhouyi@mail.sysu.edu.cn E-mail:zhouyi@mail.sysu.edu.cn
作者简介:Corresponding author: Zhou Yi, E-mail: zhouyi@mail.sysu.edu.cn
基金资助:
This work was supported by the National Key Research and Development Program of China (2018YFC0116902, 2016YFC0901602), the National Natural
Science Foundation of China (NSFC) (61876194), Joint Foundation for the NSFC and Guangdong Science Center for Big Data (U1611261), Medical Scientific Research Foundation of Guangdong Province of China (C2017037), Science and Technology Program of Guangzhou (201604020016).

Using boosting tree to learn imbalanced data

Yang Ridong, Zhang Shiyu, Li Lin, Wang Zhe, Zhou Yi

1. Zhongshan School of Medicine, Sun Yat-Sen University, Guangdong 510080, China
2. College of Public Health, Xinjiang Medical University, Xinjiang 830001, China

Received:2018-08-07 Revised:2019-04-04 Online:2019-04-30 Published:2019-06-14
Contact: Corresponding author: Zhou Yi, E-mail: zhouyi@mail.sysu.edu.cn E-mail:zhouyi@mail.sysu.edu.cn
About author:Corresponding author: Zhou Yi, E-mail: zhouyi@mail.sysu.edu.cn
Supported by:
This work was supported by the National Key Research and Development Program of China (2018YFC0116902, 2016YFC0901602), the National Natural
Science Foundation of China (NSFC) (61876194), Joint Foundation for the NSFC and Guangdong Science Center for Big Data (U1611261), Medical Scientific Research Foundation of Guangdong Province of China (C2017037), Science and Technology Program of Guangzhou (201604020016).

摘要/Abstract

摘要： In case of machine learning, the problem of class imbalance is always troubling, i. e. one class of the samples has a larger magnitude than the other classes. This problem brings a preference of the classifier to the majority class, which leads to worse performance of the classifier on the minority class. We proposed an improved boosting tree (BT) algorithm for learning imbalanced data, called cost BT. In each iteration of the cost BT, only the weights of the misclassified minority class samples are increased. Meanwhile, the error rate in the weight formula of the base classifier is replaced by 1 minus F-measure. In this study, the performance of the cost BT algorithm is compared with other known methods on 9 public data sets. The compared methods include the decision tree and
random forest algorithm, and both of them were combined with the sampling techniques such as synthetic minority oversampling technique (SMOTE), Borderline-SMOTE, adaptive synthetic sampling approach (ADASYN) and one sided selection. The cost BT algorithm performed better than the other compared methods in F-measure, G-mean and area under curve (AUC). In 6 of the 9 data sets, the cost BT algorithm has a superior performance to the other published methods. It can promote the prediction performance of the base classifiers by increasing the proportion of the minority class in the whole samples with only increasing the weights of the misclassified minority class samples in each iteration of the BT. In addition, computing the weights of the base classifiers with F-measure is helpful to the ensemble decisions.

关键词: machine learning, class imbalanced, BT, data sampling

Abstract: In case of machine learning, the problem of class imbalance is always troubling, i. e. one class of the samples has a larger magnitude than the other classes. This problem brings a preference of the classifier to the majority class, which leads to worse performance of the classifier on the minority class. We proposed an improved boosting tree (BT) algorithm for learning imbalanced data, called cost BT. In each iteration of the cost BT, only the weights of the misclassified minority class samples are increased. Meanwhile, the error rate in the weight formula of the base classifier is replaced by 1 minus F-measure. In this study, the performance of the cost BT algorithm is compared with other known methods on 9 public data sets. The compared methods include the decision tree and
random forest algorithm, and both of them were combined with the sampling techniques such as synthetic minority oversampling technique (SMOTE), Borderline-SMOTE, adaptive synthetic sampling approach (ADASYN) and one sided selection. The cost BT algorithm performed better than the other compared methods in F-measure, G-mean and area under curve (AUC). In 6 of the 9 data sets, the cost BT algorithm has a superior performance to the other published methods. It can promote the prediction performance of the base classifiers by increasing the proportion of the minority class in the whole samples with only increasing the weights of the misclassified minority class samples in each iteration of the BT. In addition, computing the weights of the base classifiers with F-measure is helpful to the ensemble decisions.

Key words: machine learning, class imbalanced, BT, data sampling

中图分类号:

TP18

Yang Ridong, Zhang Shiyu, Li Lin, Wang Zhe, Zhou Yi. Using boosting tree to learn imbalanced data[J]. The Journal of China Universities of Posts and Telecommunications, 2019, 26(2): 43-51.

参考文献

References
1. He H, Garcia E A. Learning fromimbalanced data. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(9): 1263 -1284
2. Maloof M A. Learning when data sets are imbalanced and when costs are unequal and unknown. Proc Int’l Conf Machine Learning, Workshop Learning from Imbalanced Data Sets II, 2003, 4(2): 1 -2
3. Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 2002, 16(1): 321 -357
4. Chawla N V, Japkowicz N, Kotcz A. Editorial: special issue on learning from imbalanced data sets. ACM Sigkdd Explorations Newsletter, 2004, 6(1): 1 -6
5. Weiss G M. Mining with rarity: a unifying framework. ACM Sigkdd Explorations Newsletter, 2004, 6(1): 7 -196
6. He H, Bai Y, Garcia E A, et al. ADASYN: adaptive synthetic sampling approach for imbalanced learning. Proc Int’l J Conf Neural Networks, 2008: 1322 -1328
7. Han H, Wang W Y, Mao B H. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. International Conference on Advances in Intelligent Computing. Springer-Verlag, 2005: 878 -887
8. Chawla N V, Lazarevic A, Hall L O, et al. SMOTEBoost: improving prediction of the minority class in boosting. Lecture Notes in Computer Science, 2003, 2838: 107 -119
9. Kubat M, Matwin S. Addressing the curse of imbalanced training sets: one-sided selection. International Conference on Machine Learning, 1997: 179 -186
10. Liu X Y, Wu J, Zhou Z H. Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems Man and Cybernetics Part B, 2009, 39(2): 539 -550
11. Domingos P. MetaCost: a general method for making classifiers cost-sensitive. Proc Int’l Conf Knowledge Discovery and Data Mining, 1999: 155 -164
12. Guo H, Viktor H L. Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach. ACM Sigkdd Explorations Newsletter, 2004, 6(1): 30 -39
13. Liu W, Chawla S, Cieslak D, et al. A robust decision tree algorithms for imbalanced data sets. In Proceedings of the Tenth SIAM International Conference on Data Mining. Society for
Industrial and Applied Mathematics, 2010: 766 -777
14. Kwok J T, Zhou Z H, Xu L. Machine learning. Springer Handbook of Computational Intelligence. Springer Berlin Heidelberg, 2015: 495 -522
15. Pedregosa, et al. Scikit-learn: machine learning in Python. JMLR 12, 2011: 2825 -2830
16. Zhou Z H, Tang W. Selectiveensemble of decision trees. Lecture Notes in Computer Science, 2003, 2639: 476 -483

[1]	Li Hao, Zhang Linghua, Tong Cheng, Zhou Chenyang. Short-term load forecasting model based on gated recurrent unit and multi-head attention [J]. 中国邮电高校学报(英文版), 2023, 30(3): 25-31.
[2]	Du Rong, Chen Shudong, Li Weiwei, Zhang Xueting, Wang Xianhui, Ge Jin. Data augmentation via joint multi-scale CNN and multi-channel attention for bumblebee image generation [J]. 中国邮电高校学报(英文版), 2023, 30(3): 32-40.
[3]	吴青王凡范九伦侯静. L2,1-norm robust regularized extreme learning machine for regression using CCCP method [J]. 中国邮电高校学报(英文版), 2023, 30(2): 61-72.
[4]	Wu Qing, Li Feiyan, Zhang Hengchang, Fan Jiulun, Gao Xiaofeng. Least squares twin support vector machine with asymmetric squared loss[J]. 中国邮电高校学报(英文版), 2023, 30(1): 1-16.
[5]	Zhang Huibin, Li Tianzhu, Liu Haojiang, Li Zhuotong. Deep learning-based symbol detection algorithm in IMDD-OOFDM system [J]. 中国邮电高校学报(英文版), 2022, 29(6): 36-45.
[6]	Jiang Fan, Chen Jiajun, Gao Youjun, Sun Changyin. Research on ECG classification based on transfer learning [J]. 中国邮电高校学报(英文版), 2022, 29(6): 83-96.
[7]	段炼唐贵进. Low-light image enhancement algorithm using a residual network with semantic information[J]. 中国邮电高校学报(英文版), 2022, 29(2): 52-62.
[8]	Wu Qing, Fu Yanlin, Fan Jiulun, Ma Tianlu. Structural regularized twin support vector machine based on within-class scatter and between-class scatter[J]. 中国邮电高校学报(英文版), 2021, 28(4): 39-52.
[9]	李庆华张钊冯超沐雅琪尤越李研强. Human motion prediction using optimized sliding window polynomial fitting and recursive least squares [J]. 中国邮电高校学报(英文版), 2021, 28(3): 76-85.
[10]	焦继超陈新平管孟赵亚鑫. TCL: a taxi trajectory prediction model combining time and space features [J]. 中国邮电高校学报(英文版), 2021, 28(3): 63-75.
[11]	何明枢, 金磊, 王小娟, 李源. Web log classification framework with data augmentation based on GANs[J]. 中国邮电高校学报(英文版), 2020, 27(5): 34-46.
[12]	季一木, 李可, 刘尚东, 刘强, 尧海昌, 李奎. Collaborative filtering recommendation algorithm based on interactive data classification[J]. 中国邮电高校学报(英文版), 2020, 27(5): 1-12.
[13]	杨健健张强王晓林杜毅博王超吴淼. Research on equipment fault diagnosis method based on random stochastic adaptive particle swarm optimization [J]. 中国邮电高校学报(英文版), 2020, 27(4): 17-25.
[14]	Zhai Qi, Jiang Mingyan. Supervised learning of enhancing convolutional Hash for image retrieval[J]. 中国邮电高校学报(英文版), 2019, 26(4): 51-61.
[15]	Pang Hao, Bu Yunyun, Wang Cong, Xiao Hui. Automatic detection of breast nodule in the ultrasound images using CNN[J]. 中国邮电高校学报(英文版), 2019, 26(2): 9-16.

Using boosting tree to learn imbalanced data

Using boosting tree to learn imbalanced data

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价