FOLLOWUS
School of Information, Central University of Finance and Economics, Beijing 100081, China
School of Science and Engineering, Tianjin University of Finance and Economics, Tianjin 300222, China
You-wei WANG, E-mail: ywwang15@126.com
纸质出版日期:2018-02,
收稿日期:2016-11-30,
修回日期:2018-02-08,
Scan QR Code
王友卫, 凤丽洲. 一种用于文本分类的去冗余特征选择新方法[J]. 信息与电子工程前沿(英文), 2018,19(2):221-234.
YOU-WEI WANG, LI-ZHOU FENG. A new feature selection method for handling redundant information in text classification. [J]. Frontiers of information technology & electronic engineering, 2018, 19(2): 221-234.
王友卫, 凤丽洲. 一种用于文本分类的去冗余特征选择新方法[J]. 信息与电子工程前沿(英文), 2018,19(2):221-234. DOI: 10.1631/FITEE.1601761.
YOU-WEI WANG, LI-ZHOU FENG. A new feature selection method for handling redundant information in text classification. [J]. Frontiers of information technology & electronic engineering, 2018, 19(2): 221-234. DOI: 10.1631/FITEE.1601761.
特征选择是文本分类领域一种重要降维方法。针对传统特征选择方法所选特征集常包含冗余信息的问题,提出一种能够有效去除冗余信息的特征选择新方法。首先,为衡量两个词之间的关系,引入基于词频的相关性和相对冗余词集的概念;接着,选择一种最优特征选择方法并用其获得一个临时特征子集;最后,为提高算法执行效率,结合预设阈值去除临时特征子集中的冗余特征,并将结果存储在链表结构中。实验以支持向量机和朴素贝叶斯作为分类器,并以WebKB、20-Newsgroups和Reuters-21578作为测试数据集。实验结果表明,该方法分类精度高于传统特征选择方法;相对于基于互信息的方法而言,该方法能够在保证分类精度的同时,有效提高运行效率。
Feature selection is an important approach to dimensionality reduction in the field of text classification. Because of the difficulty in handling the problem that the selected features always contain redundant information
we propose a new simple feature selection method
which can effectively filter the redundant features. First
to calculate the relationship between two words
the definitions of word frequency based relevance and correlative redundancy are introduced. Furthermore
an optimal feature selection (OFS) method is chosen to obtain a feature subset FS
1
. Finally
to improve the execution speed
the redundant features in FS
1
are filtered by combining a predetermined threshold
and the filtered features are memorized in the linked lists. Experiments are carried out on three datasets (WebKB
20-Newsgroups
and Reuters-21578) where in support vector machines and nave Bayes are used. The results show that the classification accuracy of the proposed method is generally higher than that of typical traditional methods (information gain
improved Gini index
and improved comprehensively measured feature selection) and the OFS methods. Moreover
the proposed method runs faster than typical mutual information-based methods (improved and normalized mutual information-based feature selections
and multilabel feature selection based on maximum dependency and minimum redundancy) while simultaneously ensuring classification accuracy. Statistical results validate the effectiveness of the proposed method in handling redundant information in text classification.
特征选择降维文本分类冗余特征支持向量机朴素贝叶斯互信息
Feature selectionDimensionality reductionText classificationRedundant featuresSupport vector machineNave BayesMutual information
B Alatas. . Chaotic harmony search algorithms. . Appl Math Comput, , 2010. . 216((9):):2687--2699. . DOI:10.1016/j.amc.2010.03.114http://doi.org/10.1016/j.amc.2010.03.114..
C Apte, , , F Damerau, , , S Weiss. . Text mining with decision trees and decision rules. . Conf on Automated Learning and Discovery, , 1999. . p.169--198. . ..
R Battiti. . Using mutual information for selecting features in supervised neural net learning. . IEEE Trans Neur Netw, , 1994. . 5((4):):537--550. . DOI:10.1109/72.298224http://doi.org/10.1109/72.298224..
L Breiman, , , JH Friedman, , , RA Olshen, , , 等. . Classification and Regression Trees. . Wadsworth International Group, Monterey, USA, , 1984. ..
G Caruana, , , MZ Li, , , Y Liu. . An ontology enhanced parallel SVM for scalable spam filter training. . Neurocomputing, , 2013. . 10845--57. . DOI:10.1016/j.neucom.2012.12.001http://doi.org/10.1016/j.neucom.2012.12.001..
G Cevenini, , , E Barbini, , , MR Massai, , , 等. . A nave Bayes classifier for planning transfusion requirements in heart surgery. . J Eval Clin Pract, , 2013. . 19((1):):25--29. . DOI:10.1111/j.1365-2753.2011.01762.xhttp://doi.org/10.1111/j.1365-2753.2011.01762.x..
CC Chang, , , CJ Lin. . LIBSVM: a library for support vector machines. . ACM Trans Intell Syst Technol, , 2007. . 2((3):):Article 27DOI:10.1145/1961189.1961199http://doi.org/10.1145/1961189.1961199..
JN Chen, , , HK Huang, , , SF Tian, , , 等. . Feature selection for text classification with nave Bayes. . Exp Syst Appl, , 2009. . 36((3):):5432--5435. . DOI:10.1016/j.eswa.2008.06.054http://doi.org/10.1016/j.eswa.2008.06.054..
M Dallachiesa, , , T Palpanas, , , IF Ilyas. . Top-k nearest neighbor search in uncertain data series. . Proc VLDB Endowm, , 2014. . 8((1):):13--24. . DOI:10.14778/2735461.2735463http://doi.org/10.14778/2735461.2735463..
AF De Souza, , , F Pedroni, , , E Oliveira, , , 等. . Automated multi-label text categorization with VG-RAM weightless neural networks. . Neurocomputing, , 2009. . 72((10-12):):2209--2217. . DOI:10.1016/j.neucom.2008.06.028http://doi.org/10.1016/j.neucom.2008.06.028..
H Drucker, , , DH Wu, , , VN Vapnik. . Support vector machines for spam categorization. . IEEE Trans Neur Netw, , 1999. . 10((5):):1048--1054. . DOI:10.1109/72.788645http://doi.org/10.1109/72.788645..
H Elghazel, , , A Aussem, , , O Gharroudi, , , 等. . Ensemble multi-label text categorization based on rotation forest and latent semantic indexing. . Exp Syst Appl, , 2016. . 571--11. . DOI:10.1016/j.eswa.2016.03.041http://doi.org/10.1016/j.eswa.2016.03.041..
PA Estevez, , , M Tesmer, , , CA Perez, , , 等. . Normalized mutual information feature selection. . IEEE Trans Neur Netw, , 2009. . 20((2):):189--201. . DOI:10.1109/TNN.2008.2005601http://doi.org/10.1109/TNN.2008.2005601..
ZW Geem, , , JH Kim, , , GV Loganathan. . A new heuristic optimization algorithm: harmony search. . Simulation, , 2001. . 76((2):):60--68. . DOI:10.1177/003754970107600201http://doi.org/10.1177/003754970107600201..
M Han, , , WJ Ren. . Global mutual information-based feature selection approach using single-objective and multi-objective optimization. . Neurocomputing, , 2015. . 16847--54. . DOI:10.1016/j.neucom.2015.06.016http://doi.org/10.1016/j.neucom.2015.06.016..
N Hoque, , , DK Bhattacharyya, , , JK Kalita. . MIFS-ND: a mutual information-based feature selection method. . Exp Syst Appl, , 2014. . 41((14):):6371--6385. . DOI:10.1016/j.eswa.2014.04.019http://doi.org/10.1016/j.eswa.2014.04.019..
LP Jing, , , MK Ng, , , JZ Huang. . Knowledge-based vector space model for text clustering. . Knowl Inform Syst, , 2010. . 25((1):):35--55. . DOI:10.1007/s10115-009-0256-5http://doi.org/10.1007/s10115-009-0256-5..
T Joachims. . Text categorization with support vector machines: learning with many relevant features. . Proc 10th European Conf on Machine Learning, , 1998. . p.137--142. . DOI:10.1007/BFb0026683http://doi.org/10.1007/BFb0026683..
JB Kruskal, , , M Wish. . Multidimensional Scaling. . Sage, London, UK, , 1978. ..
YJ Lin, , , QH Hu, , , JH Liu, , , 等. . Multi-label feature selection based on max-dependency and min-redundancy. . Neurocomputing, , 2015. . 16892--103. . DOI:10.1016/j.neucom.2015.06.010http://doi.org/10.1016/j.neucom.2015.06.010..
H Liu, , , L Yu. . Toward integrating feature selection algorithms for classification and clustering. . IEEE Trans Knowl Data Eng, , 2005. . 17((4):):491--502. . DOI:10.1109/TKDE.2005.66http://doi.org/10.1109/TKDE.2005.66..
A McCallum, , , K Nigam. . A comparison of event models for naive Bayes text classification. . AAAI-98 Workshop on Learning for Text Categorization, , 2001. . p.41--48. . ..
P Napoletano, , , F Colace, , , M De Santo, , , 等. . Text classification using a graph of terms. . 6th Int Conf on Complex, Intelligent and Software Intensive Systems, , 2012. . p.1030--1035. . DOI:10.1109/CISIS.2012.183http://doi.org/10.1109/CISIS.2012.183..
HC Peng, , , FH Long, , , C Ding. . Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. . IEEE Trans Patt Anal Mach Intell, , 2005. . 27((8):):1226--1238. . DOI:10.1109/TPAMI.2005.159http://doi.org/10.1109/TPAMI.2005.159..
MF Porter. . An algorithm for suffix stripping. In: Jones KS, Willett P (Eds.), Readings in Information Retrieval. Morgan Kaufmann Publishers Inc.. . San Francisco, USA, , 1997. . p.313--316. . ..
KM Schneider. . A comparison of event models for naive Bayes anti-spam e-mail filtering. . Proc 10th Conf on European Chapter of the Association for Computational Linguistics, , 2003. . p.307--314. . DOI:10.3115/1067807.1067848http://doi.org/10.3115/1067807.1067848..
F Sebastiani. . Machine learning in automated text categorization. . ACM Comput Surv, , 2002. . 34((1):):1--47. . DOI:10.1145/505282.505283http://doi.org/10.1145/505282.505283..
WQ Shang, , , HK Huang, , , HB Zhu, , , 等. . A novel feature selection algorithm for text categorization. . Exp Syst Appl, , 2007. . 33((1):):1--5. . DOI:10.1016/j.eswa.2006.04.001http://doi.org/10.1016/j.eswa.2006.04.001..
SM Taheri, , , G Hesamian. . A generalization of the Wilcoxon signed-rank test and its applications. . Stat Paper, , 2013. . 54((2):):457--470. . DOI:10.1007/s00362-012-0443-4http://doi.org/10.1007/s00362-012-0443-4..
M Tenenhaus, , , VE Vinzi, , , YM Chatelin, , , 等. . PLS path modeling. . Comput Stat Data Anal, , 2005. . 48((1):):159--205. . DOI:10.1016/j.csda.2004.03.005http://doi.org/10.1016/j.csda.2004.03.005..
H Uğuz. . A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. . Knowl-Based Syst, , 2011. . 24((7):):1024--1032. . DOI:10.1016/j.knosys.2011.04.014http://doi.org/10.1016/j.knosys.2011.04.014..
DQ Wang, , , H Zhang, , , R Liu, , , 等. . Feature selection based on term frequency and T-test for text categorization. . Proc 21st ACM Int Conf on Information and Knowledge Management, , 2012. . p.1482--1486. . DOI:10.1145/2396761.2398457http://doi.org/10.1145/2396761.2398457..
YW Wang, , , YN Liu, , , LZ Feng, , , 等. . Novel feature selection method based on harmony search for email classification. . Knowl-Based Syst, , 2014. . 73311--323. . DOI:10.1016/j.knosys.2014.10.013http://doi.org/10.1016/j.knosys.2014.10.013..
F Wilcoxon. . Individual comparisons by ranking methods. . Biom Bull, , 1945. . 1((6):):80--83. . DOI:10.2307/3001968http://doi.org/10.2307/3001968..
JM Yang, , , YN Liu, , , XD Zhu, , , 等. . A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization. . Inform Process Manag, , 2012. . 48((4):):741--754. . DOI:10.1016/j.ipm.2011.12.005http://doi.org/10.1016/j.ipm.2011.12.005..
J Yan, , , N Liu, , , B Zhang, , , 等. . OCFS: optimal orthogonal centroid feature selection for text categorization. . Int ACM SIGIR Conf on Research and Development in Information Retrieval, , 2005. . p.122--129. . DOI:10.1145/1076034.1076058http://doi.org/10.1145/1076034.1076058..
JM Yang, , , ZY Qu, , , ZY Liu. . Improved feature-selection method considering the imbalance problem in text categorization. . Sci World J, , 2014. . 625342DOI:10.1155/2014/625342http://doi.org/10.1155/2014/625342..
YM Yang, , , JO Pedersen. . A comparative study on feature selection in text categorization. . Proc 14th Int Conf on Machine Learning, , 1997. . p.412--420. . ..
W Zhang, , , T Yoshida, , , XJ Tang. . A comparative study of TF*IDF, LSI and multi-words for text classification. . Exp Syst Appl, , 2011. . 38((3):):2758--2765. . DOI:10.1016/j.eswa.2010.08.066http://doi.org/10.1016/j.eswa.2010.08.066..
W Zhang, , , RAJ Clark, , , YY Wang, , , 等. . Unsupervised language identification based on latent Dirichlet Allocation. . Comput Speech Lang, , 2016. . 3947--66. . DOI:10.1016/j.csl.2016.02.001http://doi.org/10.1016/j.csl.2016.02.001..
YS Zhang, , , ZG Zhang. . Feature subset selection with cumulate conditional mutual information minimization. . Exp Syst Appl, , 2012. . 39((5):):6078--6088. . DOI:10.1016/j.eswa.2011.12.003http://doi.org/10.1016/j.eswa.2011.12.003..
关联资源
相关文章
相关作者
相关机构