Using boosting tree to learn imbalanced data

doi:10.19682/j.cnki.1005-8885.2019.1005

The Journal of China Universities of Posts and Telecommunications ›› 2019, Vol. 26 ›› Issue (2): 43-51.doi: 10.19682/j.cnki.1005-8885.2019.1005

• Artificial intelligence • Previous Articles Next Articles

Using boosting tree to learn imbalanced data

Yang Ridong, Zhang Shiyu, Li Lin, Wang Zhe, Zhou Yi

1. Zhongshan School of Medicine, Sun Yat-Sen University, Guangdong 510080, China
2. College of Public Health, Xinjiang Medical University, Xinjiang 830001, China

Received:2018-08-07 Revised:2019-04-04 Online:2019-04-30 Published:2019-06-14
Contact: Corresponding author: Zhou Yi, E-mail: zhouyi@mail.sysu.edu.cn E-mail:zhouyi@mail.sysu.edu.cn
About author:Corresponding author: Zhou Yi, E-mail: zhouyi@mail.sysu.edu.cn
Supported by:
This work was supported by the National Key Research and Development Program of China (2018YFC0116902, 2016YFC0901602), the National Natural
Science Foundation of China (NSFC) (61876194), Joint Foundation for the NSFC and Guangdong Science Center for Big Data (U1611261), Medical Scientific Research Foundation of Guangdong Province of China (C2017037), Science and Technology Program of Guangzhou (201604020016).

Abstract

Abstract: In case of machine learning, the problem of class imbalance is always troubling, i. e. one class of the samples has a larger magnitude than the other classes. This problem brings a preference of the classifier to the majority class, which leads to worse performance of the classifier on the minority class. We proposed an improved boosting tree (BT) algorithm for learning imbalanced data, called cost BT. In each iteration of the cost BT, only the weights of the misclassified minority class samples are increased. Meanwhile, the error rate in the weight formula of the base classifier is replaced by 1 minus F-measure. In this study, the performance of the cost BT algorithm is compared with other known methods on 9 public data sets. The compared methods include the decision tree and
random forest algorithm, and both of them were combined with the sampling techniques such as synthetic minority oversampling technique (SMOTE), Borderline-SMOTE, adaptive synthetic sampling approach (ADASYN) and one sided selection. The cost BT algorithm performed better than the other compared methods in F-measure, G-mean and area under curve (AUC). In 6 of the 9 data sets, the cost BT algorithm has a superior performance to the other published methods. It can promote the prediction performance of the base classifiers by increasing the proportion of the minority class in the whole samples with only increasing the weights of the misclassified minority class samples in each iteration of the BT. In addition, computing the weights of the base classifiers with F-measure is helpful to the ensemble decisions.

Key words: machine learning, class imbalanced, BT, data sampling

CLC Number:

TP18

Yang Ridong, Zhang Shiyu, Li Lin, Wang Zhe, Zhou Yi. Using boosting tree to learn imbalanced data[J]. The Journal of China Universities of Posts and Telecommunications, 2019, 26(2): 43-51.

References

References
1. He H, Garcia E A. Learning fromimbalanced data. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(9): 1263 -1284
2. Maloof M A. Learning when data sets are imbalanced and when costs are unequal and unknown. Proc Int’l Conf Machine Learning, Workshop Learning from Imbalanced Data Sets II, 2003, 4(2): 1 -2
3. Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 2002, 16(1): 321 -357
4. Chawla N V, Japkowicz N, Kotcz A. Editorial: special issue on learning from imbalanced data sets. ACM Sigkdd Explorations Newsletter, 2004, 6(1): 1 -6
5. Weiss G M. Mining with rarity: a unifying framework. ACM Sigkdd Explorations Newsletter, 2004, 6(1): 7 -196
6. He H, Bai Y, Garcia E A, et al. ADASYN: adaptive synthetic sampling approach for imbalanced learning. Proc Int’l J Conf Neural Networks, 2008: 1322 -1328
7. Han H, Wang W Y, Mao B H. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. International Conference on Advances in Intelligent Computing. Springer-Verlag, 2005: 878 -887
8. Chawla N V, Lazarevic A, Hall L O, et al. SMOTEBoost: improving prediction of the minority class in boosting. Lecture Notes in Computer Science, 2003, 2838: 107 -119
9. Kubat M, Matwin S. Addressing the curse of imbalanced training sets: one-sided selection. International Conference on Machine Learning, 1997: 179 -186
10. Liu X Y, Wu J, Zhou Z H. Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems Man and Cybernetics Part B, 2009, 39(2): 539 -550
11. Domingos P. MetaCost: a general method for making classifiers cost-sensitive. Proc Int’l Conf Knowledge Discovery and Data Mining, 1999: 155 -164
12. Guo H, Viktor H L. Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach. ACM Sigkdd Explorations Newsletter, 2004, 6(1): 30 -39
13. Liu W, Chawla S, Cieslak D, et al. A robust decision tree algorithms for imbalanced data sets. In Proceedings of the Tenth SIAM International Conference on Data Mining. Society for
Industrial and Applied Mathematics, 2010: 766 -777
14. Kwok J T, Zhou Z H, Xu L. Machine learning. Springer Handbook of Computational Intelligence. Springer Berlin Heidelberg, 2015: 495 -522
15. Pedregosa, et al. Scikit-learn: machine learning in Python. JMLR 12, 2011: 2825 -2830
16. Zhou Z H, Tang W. Selectiveensemble of decision trees. Lecture Notes in Computer Science, 2003, 2639: 476 -483

Metrics

Comments

Copyright © 2020 The Journal of China Universities of Posts and Telecommunications
　 Adress: P.O. Box 231,Beijing University of Posts and Telecommunications,10 Xi Tucheng Road,Beijing 100876,P.R.China　Post Code: 100081
Tel：86-010-62282493　Fax： 86-010-62283461　E-mail: jchupt@bupt.edu.cn
Support by: Beijing Magtech Co.Ltd

[1]	Li Hao, Zhang Linghua, Tong Cheng, Zhou Chenyang. Short-term load forecasting model based on gated recurrent unit and multi-head attention [J]. The Journal of China Universities of Posts and Telecommunications, 2023, 30(3): 25-31.
[2]	Du Rong, Chen Shudong, Li Weiwei, Zhang Xueting, Wang Xianhui, Ge Jin. Data augmentation via joint multi-scale CNN and multi-channel attention for bumblebee image generation [J]. The Journal of China Universities of Posts and Telecommunications, 2023, 30(3): 32-40.
[3]	Wu Qing, Wang Fan, Fan Jiulun, Hou Jing. L2,1-norm robust regularized extreme learning machine for regression using CCCP method [J]. The Journal of China Universities of Posts and Telecommunications, 2023, 30(2): 61-72.
[4]	Wu Qing, Li Feiyan, Zhang Hengchang, Fan Jiulun, Gao Xiaofeng. Least squares twin support vector machine with asymmetric squared loss [J]. The Journal of China Universities of Posts and Telecommunications, 2023, 30(1): 1-16.
[5]	Zhang Huibin, Li Tianzhu, Liu Haojiang, Li Zhuotong. Deep learning-based symbol detection algorithm in IMDD-OOFDM system [J]. The Journal of China Universities of Posts and Telecommunications, 2022, 29(6): 36-45.
[6]	Jiang Fan, Chen Jiajun, Gao Youjun, Sun Changyin. Research on ECG classification based on transfer learning [J]. The Journal of China Universities of Posts and Telecommunications, 2022, 29(6): 83-96.
[7]	DUAN Lian. Low-light image enhancement algorithm using a residual network with semantic information [J]. The Journal of China Universities of Posts and Telecommunications, 2022, 29(2): 52-62.
[8]	Wu Qing, Fu Yanlin, Fan Jiulun, Ma Tianlu. Structural regularized twin support vector machine based on within-class scatter and between-class scatter [J]. The Journal of China Universities of Posts and Telecommunications, 2021, 28(4): 39-52.
[9]	. Human motion prediction using optimized sliding window polynomial fitting and recursive least squares [J]. The Journal of China Universities of Posts and Telecommunications, 2021, 28(3): 76-85.
[10]	. TCL: a taxi trajectory prediction model combining time and space features [J]. The Journal of China Universities of Posts and Telecommunications, 2021, 28(3): 63-75.
[11]	He Mingshu, Jin Lei, Wang Xiaojuan, Li Yuan. Web log classification framework with data augmentation based on GANs [J]. The Journal of China Universities of Posts and Telecommunications, 2020, 27(5): 34-46.
[12]	Ji Yimu, Li Ke, Liu Shangdong, Liu Qiang, Yao Haichang, Li Kui. Collaborative filtering recommendation algorithm based on interactive data classification [J]. The Journal of China Universities of Posts and Telecommunications, 2020, 27(5): 1-12.
[13]	Yang Jianjian, Zhang Qiang, Wang Xiaolin, Du Yibo, Wang Chao, Wu Miao. Research on equipment fault diagnosis method based on random stochastic adaptive particle swarm optimization [J]. The Journal of China Universities of Posts and Telecommunications, 2020, 27(4): 17-25.
[14]	Zhai Qi, Jiang Mingyan. Supervised learning of enhancing convolutional Hash for image retrieval [J]. The Journal of China Universities of Posts and Telecommunications, 2019, 26(4): 51-61.
[15]	Pang Hao, Bu Yunyun, Wang Cong, Xiao Hui. Automatic detection of breast nodule in the ultrasound images using CNN [J]. The Journal of China Universities of Posts and Telecommunications, 2019, 26(2): 9-16.

Using boosting tree to learn imbalanced data

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

Comments