面向多类不平衡学习的一对多海林格距离决策树研究

董明刚; 刘明; 敬超

doi:10.1631/FITEE.2000417

Your Location：

Home >

Browse articles >

面向多类不平衡学习的一对多海林格距离决策树研究

常规文章 | Updated：2022-04-19

- 面向多类不平衡学习的一对多海林格距离决策树研究
  Enhanced Publication
- One-against-all-based Hellinger distance decision tree for multiclass imbalanced learning
- 信息与电子工程前沿（英文） 2022年23卷第2期页码：278-290
- Affiliations：
  
  1.School of Information Science and Engineering, Guilin University of Technology, Guilin541004, China
  2.Guangxi Key Laboratory of Embedded Technology and Intelligent System, Guilin541004, China
  3.Guangxi Key Laboratory of Trusted Software, Guilin University of Electronic Technology, Guilin541004, China
- Author bio：
  
  ‡Corresponding author
- Funds：
  
  National Natural Science Foundation of China(61802085;61563012);Guangxi Provincial Natural Science Foundation, China(2021GXNSFAA220074;2020GXNSFAA159038);Guangxi Key Laboratory of Embedded Technology and Intelligent System Foundation, China(2018A-04);Guangxi Key Laboratory of Trusted Software Foundation, China(kx202011)
- DOI：10.1631/FITEE.2000417
  中图分类号： TP301
- 纸质出版日期：2022-02-0 ，
  
  收稿日期：2020-08-17，
  
  录用日期：2021-02-14
- Accepted：
Scan QR Code
董明刚, 刘明, 敬超. 面向多类不平衡学习的一对多海林格距离决策树研究[J]. 信息与电子工程前沿（英文）, 2022,23(2):278-290.

MINGGANG DONG, MING LIU, CHAO JING. One-against-all-based Hellinger distance decision tree for multiclass imbalanced learning. [J]. Frontiers of information technology & electronic engineering, 2022, 23(2): 278-290.
董明刚, 刘明, 敬超. 面向多类不平衡学习的一对多海林格距离决策树研究[J]. 信息与电子工程前沿（英文）, 2022,23(2):278-290. DOI： 10.1631/FITEE.2000417.

MINGGANG DONG, MING LIU, CHAO JING. One-against-all-based Hellinger distance decision tree for multiclass imbalanced learning. [J]. Frontiers of information technology & electronic engineering, 2022, 23(2): 278-290. DOI： 10.1631/FITEE.2000417.

摘要

由于传统机器学习方法对偏斜分布很敏感，且未考虑多类不平衡问题的特点，多类偏斜分布对机器学习算法来说是一个巨大挑战。为解决这一问题，提出一种新的基于一对多的海林格距离（OAHD）决策树分割准则。OAHD主要由两部分组成。首先，将一对多思想集成到OAHD的海林格距离计算过程中，从而对海林格距离决策树进行扩展，使其能解决多类不平衡问题。其次，针对多类不平衡问题，考虑了不同类的分布和数量，设计了改进的基尼系数。此外，对OAHD的性质进行理论证明，包括偏斜不敏感性和在决策树中寻找更纯节点的能力。最后，从基于进化学习的知识抽取（KEEL）和加州大学欧文分校（UCI）数据库中收集20个公开的真实不平衡数据集进行实验。实验结果表明，与其他5种常用决策树相比，OAHD在精度、F值，和多类别接收者操作特征曲线下面积（MAUC）上有显著优势。此外，使用了Friedman和Nemenyi检验，统计结果表明OAHD优于其他5种决策树。

Abstract

Since traditional machine learning methods are sensitive to skewed distribution and do not consider the characteristics in multiclass imbalance problems

the skewed distribution of multiclass data poses a major challenge to machine learning algorithms. To tackle such issues

we propose a new splitting criterion of the decision tree based on the one-against-all-based Hellinger distance (OAHD). Two crucial elements are included in OAHD. First

the one-against-all scheme is integrated into the process of computing the Hellinger distance in OAHD

thereby extending the Hellinger distance decision tree to cope with the multiclass imbalance problem. Second

for the multiclass imbalance problem

the distribution and the number of distinct classes are taken into account

and a modified Gini index is designed. Moreover

we give theoretical proofs for the properties of OAHD

including skew insensitivity and the ability to seek a purer node in the decision tree. Finally

we collect 20 public real-world imbalanced data sets from the Knowledge Extraction based on Evolutionary Learning (KEEL) repository and the University of California

Irvine (UCI) repository. Experimental and statistical results show that OAHD significantly improves the performance compared with the five other well-known decision trees in terms of Precision

F-measure

and multiclass area under the receiver operating characteristic curve (MAUC). Moreover

through statistical analysis

the Friedman and Nemenyi tests are used to prove the advantage of OAHD over the five other decision trees.

关键词

决策树多类不平衡学习节点划分准则海林格距离一对多技术

Keywords

Decision treesMulticlass imbalanced learningNode splitting criterionHellinger distanceOne-against-all scheme

references

Abdi L,Hashemi S,2016.To combat multi-class imbalanced problems by means of over-sampling techniques.IEEE Trans Knowl Data Eng,28(1):238-251.doi:10.1109/TKDE.2015.2458858http://doi.org/10.1109/TKDE.2015.2458858

Akash PS,Kadir ME,Ali AA,et al.,2019.Inter-node Hellinger distance based decision tree.Proc 28th Int Joint Conf on Artificial Intelligence, p.1967-1973.doi:10.24963/ijcai.2019/272http://doi.org/10.24963/ijcai.2019/272

Alcala-Fdez J,Fernandez A,Luengo J,et al.,2011.KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework.J Multi-valued Logic Soft Comput,17(2-3):255-287.

Ali H,Salleh MNM,Saedudin R,et al.,2019.Imbalance class problems in data mining: a review.Indones J Elect Eng Comput Sci,14(3):1560-1571.doi:10.11591/ijeecs.v14.i3.pp1552-1563http://doi.org/10.11591/ijeecs.v14.i3.pp1552-1563

Anand R,Mehrotra K,Mohan CK,et al.,1995.Efficient classification for multiclass problems using modular neural networks.IEEE Trans Neur Netw,6(1):117-124.doi:10.1109/72.363444http://doi.org/10.1109/72.363444

Asuncion A,2007.UCI Machine Learning Repository.University of California,Irvine, USA.https://archive.ics.uci.edu/ml/index.phphttps://archive.ics.uci.edu/ml/index.php

Boonchuay K,Sinapiromsaran K,Lursinsap C,2017.Decision tree induction based on minority entropy for the class imbalance problem.Patt Anal Appl,20(3):769-782.doi:10.1007/s10044-016-0533-3http://doi.org/10.1007/s10044-016-0533-3

Bradley AP,1997.The use of the area under the ROC curve in the evaluation of machine learning algorithms.Patt Recogn,30(7):1145-1159.doi:10.1016/S0031-3203(96)00142-2http://doi.org/10.1016/S0031-3203(96)00142-2

Breiman L,Friedman JH,Olshen RA,et al.,1984.Classification and regression trees.Biometrics,40(3):874.doi:10.2307/2530946http://doi.org/10.2307/2530946

Chandra B,Kothari R,Paul P,2010.A new node splitting measure for decision tree construction.Patt Recogn,43(8):2725-2731.doi:10.1016/j.patcog.2010.02.025http://doi.org/10.1016/j.patcog.2010.02.025

Cichocki A,Amari SI,2010.Families of Alpha- Beta- and Gamma-divergences: flexible and robust measures of similarities.Entropy,12(6):1532-1568.doi:10.3390/e12061532http://doi.org/10.3390/e12061532

Cieslak DA,Chawla NV,2008.Learning decision trees for unbalanced data. In:Daelemans W,Goethals B,Morik K (Eds.),Machine Learning and Knowledge Discovery in Databases.Springer,Berlin, Germany, p.241-256.doi:10.1007/978-3-540-87479-9_34http://doi.org/10.1007/978-3-540-87479-9_34

Cieslak DA,Hoens TR,Chawla NV,et al.,2012.Hellinger distance decision trees are robust and skew-insensitive.Data Min Knowl Discov,24(1):136-158.doi:10.1007/s10618-011-0222-1http://doi.org/10.1007/s10618-011-0222-1

Feng L,Wang HB,Jin B,et al.,2019.Learning a distance metric by balancing KL-divergence for imbalanced datasets.IEEE Trans Syst Man Cybern Syst,49(12):2384-2395.doi:10.1109/TSMC.2018.2790914http://doi.org/10.1109/TSMC.2018.2790914

Flach PA,2003.The geometry of ROC space: understanding machine learning metrics through ROC isometrics.Proc 20th Int Conf on Machine Learning, p.194-201.

Friedman M,1937.The use of ranks to avoid the assumption of normality implicit in the analysis of variance.J Am Stat Assoc,32(200):675-701.doi:10.1080/01621459.1937.10503522http://doi.org/10.1080/01621459.1937.10503522

Friedman M,1940.A comparison of alternative tests of significance for the problem ofm rankings.Ann Math Stat,11(1):86-92.doi:10.1214/aoms/1177731944http://doi.org/10.1214/aoms/1177731944

Hanley JA,McNeil BJ,1982.The meaning and use of the area under a receiver operating characteristic (ROC) curve.Radiology,143(1):29-36.doi:10.1148/radiology.143.1.7063747http://doi.org/10.1148/radiology.143.1.7063747

He HB,Garcia EA,2009.Learning from imbalanced data.IEEE Trans Knowl Data Eng,21(9):1263-1284.doi:10.1109/TKDE.2008.239http://doi.org/10.1109/TKDE.2008.239

Iman RL,Davenport JM,1980.Approximations of the critical region of the fbietkan statistic.Commun Stat Theory Methods,9(6):571-595.doi:10.1080/03610928008827904http://doi.org/10.1080/03610928008827904

Kailath T,1967.The divergence and Bhattacharyya distance measures in signal selection.IEEE Trans Commun Technol,15(1):52-60.doi:10.1109/TCOM.1967.1089532http://doi.org/10.1109/TCOM.1967.1089532

Kotsiantis SB,2013.Decision trees: a recent overview.Artif Intell Rev,39(4):261-283.doi:10.1007/s10462-011-9272-4http://doi.org/10.1007/s10462-011-9272-4

Liu W,Chawla S,Cieslak DA,et al.,2010.A robust decision tree algorithm for imbalanced data sets.Proc SIAM Int Conf on Data Mining, p.766-777.doi:10.1137/1.9781611972801.67http://doi.org/10.1137/1.9781611972801.67

Nekooeimehr I,Lai-Yuen SK,2016.Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets.Expert Syst Appl,46:405-416.doi:10.1016/j.eswa.2015.10.031http://doi.org/10.1016/j.eswa.2015.10.031

Nemenyi P,1963.Distribution-Free Multiple Comparisons. MS Thesis,Princeton University,Princeton, USA.

Osei-Bryson KM,2014.Overview on decision tree induction. In:Osei-Bryson KM,Ngwenyama O (Eds.),Advances in Research Methods for Information Systems Research.Springer,Boston, USA, p.15-22.doi:10.1007/978-1-4614-9463-8_3http://doi.org/10.1007/978-1-4614-9463-8_3

Quinlan JR,1986.Induction of decision trees.Mach Learn,1(1):81-106.doi:10.1007/BF00116251http://doi.org/10.1007/BF00116251

Safavian SR,Landgrebe D,1991.A survey of decision tree classifier methodology.IEEE Trans Syst Man Cybern,21(3):660-674.doi:10.1109/21.97458http://doi.org/10.1109/21.97458

Sharmin S,Shoyaib M,Ali AA,et al.,2019.Simultaneous feature selection and discretization based on mutual information.Patt Recogn,91:162-174.doi:10.1016/j.patcog.2019.02.016http://doi.org/10.1016/j.patcog.2019.02.016

Su C,Cao J,2019.Improving lazy decision tree for imbalanced classification by using skew-insensitive criteria.Appl Intell,49(3):1127-1145.doi:10.1007/s10489-018-1314-zhttp://doi.org/10.1007/s10489-018-1314-z

Vilalta R,Oblinger D,2000.A quantification of distance bias between evaluation metrics in classification.Proc 17th Int Conf on Machine Learning, p.1087-1094.

Wan ZQ,Jiang C,Fahad M,et al.,2020.Robot-assisted pedestrian regulation based on deep reinforcement learning.IEEE Trans Cybern,50(4):1669-1682.doi:10.1109/TCYB.2018.2878977http://doi.org/10.1109/TCYB.2018.2878977

Wu XD,Kumar V,Quinlan JR,et al.,2008.Top 10 algorithms in data mining.Knowl Inform Syst,14(1):1-37.doi:10.1007/s10115-007-0114-2http://doi.org/10.1007/s10115-007-0114-2

浏览量

Downloads

CSCD

文章被引用时，请邮件提醒。

Submit

工具集

关联资源

暂无数据