Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2016, Vol. 10 Issue (6) : 1082-1102    https://doi.org/10.1007/s11704-016-5203-5
RESEARCH ARTICLE
Impact of preprocessing on medical data classification
Sarab ALMUHAIDEB(),Mohamed El Bachir MENAI
Computer Science Department, College of Computer and Information Sciences, Prince Sultan University, Riyadh 11586, Saudi Arabia
 Download: PDF(680 KB)  
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

The significance of the preprocessing stage in any data mining task is well known. Before attempting medical data classification, characteristics ofmedical datasets, including noise, incompleteness, and the existence of multiple and possibly irrelevant features, need to be addressed. In this paper, we show that selecting the right combination of preprocessing methods has a considerable impact on the classification potential of a dataset. The preprocessing operations considered include the discretization of numeric attributes, the selection of attribute subset(s), and the handling of missing values. The classification is performed by an ant colony optimization algorithm as a case study. Experimental results on 25 real-world medical datasets show that a significant relative improvement in predictive accuracy, exceeding 60% in some cases, is obtained.

Keywords classification      ant colony optimization      medical data classification      preprocessing      feature subset selection      discretization     
Corresponding Author(s): Sarab ALMUHAIDEB   
Just Accepted Date: 11 January 2016   Online First Date: 14 September 2016    Issue Date: 11 October 2016
 Cite this article:   
Sarab ALMUHAIDEB,Mohamed El Bachir MENAI. Impact of preprocessing on medical data classification[J]. Front. Comput. Sci., 2016, 10(6): 1082-1102.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-016-5203-5
https://academic.hep.com.cn/fcs/EN/Y2016/V10/I6/1082
1 Pham H N A, Triantaphyllou E. An application of a new metaheuristic for optimizing the classification accuracy when analyzing some medical datasets. Expert Systems with Applications, 2009, 36: 9240–9249
https://doi.org/10.1016/j.eswa.2008.12.007
2 Almuhaideb S, El-Bachir Menai M. Hybrid metaheuristics for medical data classification. In: El-Ghazali T, ed. Hybrid Metaheuristics. Springer, 2013, 187–217
https://doi.org/10.1007/978-3-642-30671-6_7
3 Penã-Reyes C A, Sipper M. Evolutionary computation in medicine: an overview. Artificial Intelligence in Medicine, 2000, 19(1): 1–23
https://doi.org/10.1016/S0933-3657(99)00047-0
4 Tanwani A K, Afridi J, Shafiq M Z, Farooq M. Guidelines to select machine learning scheme for classification of biomedical datasets. In: Pizzuti C, Ritchie M D, Giacobini M, eds. Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics. Springer, 2009, 28–139
https://doi.org/10.1007/978-3-642-01184-9_12
5 Almuhaideb S, El-Bachir Menai M. A new hybrid metaheuristic for medical data classification. International Journal of Metaheuristics, 2014, 3(1): 59–80
https://doi.org/10.1504/IJMHEUR.2014.058860
6 Milne D, Witten I H. An open-source toolkit for mining Wikipedia. Artificial Intelligence, 2013, 194: 222–239
https://doi.org/10.1016/j.artint.2012.06.007
7 Alcalá-fdez J, L. Sánchez L, García S, del Jesus M J, Ventura S, Garrell J M, Otero J, Bacardit J, Rivas V M, Fernández J C, Herrera F. KEEL: a software tool to assess evolutionary algorithms to data mining problems. Soft Computing, 2009, 13(3): 307–318
https://doi.org/10.1007/s00500-008-0323-y
8 Martens D, de Backer M, Haesen R, Vanthienen J, Snoeck M, Baesens B. Classification with ant colony optimization. IEEE Transactions on Evolutionary Computation, 2007, 11(5): 651–665
https://doi.org/10.1109/TEVC.2006.890229
9 Tanwani A K, Farooq M. Performance evaluation of evolutionary algorithms in classification of biomedical datasets. In: Proceedings of the 11th Annual Conference Companion on Genetic and Evolutionary Computation: Late Breaking Papers. 2009, 2617–2624
https://doi.org/10.1145/1570256.1570371
10 Tanwani A K, Farooq M. The role of biomedical dataset inclassification. In: Proceedings of Conference on Artificial Intelligence in Medicine in Europe. 2009
https://doi.org/10.1007/978-3-642-02976-9_51
11 Tanwani A K, Farooq M. Classification potential vs. classification accuracy: a comprehensive study of evolutionary algorithms with biomedical datasets. Learning Classifier System, 2010: 127–144
12 Kotsiantis S B. Feature selection for machine learning classification problems: a recent overview. Artificial Intelligence Review, 2011: 249–268
https://doi.org/10.1007/s10462-011-9211-4
13 Whitney A W. A direct method of nonparametric measurement selection. IEEE Transactions on Computers, 1971, 20(9): 1100–1103
https://doi.org/10.1109/T-C.1971.223410
14 Marill T, Green D. On the effectiveness of receptors in recognition systems. IEEE Transactions on Information Theory, 1963, 9(1): 11–17
https://doi.org/10.1109/TIT.1963.1057810
15 Pudil P, Novoviˇcová J, Kittler J. Floating search methods in features election. Pattern Recognition Letters, 1994, 15(10): 1119–1125
https://doi.org/10.1016/0167-8655(94)90127-9
16 Yusta S C. Different metaheuristic strategies to solve the feature selection problem. Pattern Recognition Letters, 2009, 30(5): 525–534
https://doi.org/10.1016/j.patrec.2008.11.012
17 Jourdan L, Dhaenens C, Talbi E G. A genetic algorithm for features election in datamining for genetics. In: Proceedings of the 4th Metaheuristics International Conference Porto. 2010: 29–34
18 Huang J J, Cai Y Z, Xu X M. A hybrid genetic algorithm for features election wrapper based on mutual information. Pattern Recognition Letters, 2007, 28(13): 1825–1844
https://doi.org/10.1016/j.patrec.2007.05.011
19 AI-Ani A. Feature subset selection using ant colony optimization. International Journal of Computational Intelligence, 2005, 2(1): 53–58
20 Unler A, Murat A. A discrete particle swarm optimization method for feature selection in binary classification problems. European Journal of Operational Research, 2010, 206(3): 528–539
https://doi.org/10.1016/j.ejor.2010.02.032
21 Bekkerman R, El-Yaniv R, Tishby N, Winter Y. Distributional word clusters vs. words for text categorization. Journal of Machine Learning Research, 2003, 3: 1183–1208
22 Liu H, Yu L. Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge Discovery and Data Engineering, 2005, 17(4): 491–502
https://doi.org/10.1109/TKDE.2005.66
23 Shin K, Fernandes D, Miyazaki S. Consistency measures for features election: a formal definition, relative sensitivity comparison, and a fast algorithm. In: Proceedings of International Conference on Artificial Intelligence (IJCAI). 2011, 1491–1497
24 Kerber R. ChiMerge: discretization of numeric attributes. In: Proceedings of the 10th National Conference on Artificial Intelligence. 1992, 123–128
25 Liu H, Setiono R. Feature selection via discretization. IEEE Transactions on Knowledge and Data Engineering, 1997, 9(4): 642–645
https://doi.org/10.1109/69.617056
26 Fayyad U M, Irani K B. Multi-interval discretization of continuousvalued attributes for classification learning. In: Proceedings of International Conference on Artificial Intelligence. 1993, 1022–1029
27 Jin R M, Breitbart Y, Muoh C. Data discretization unification. Knowledge and Information Systems, 2009, 19(1): 1–29
https://doi.org/10.1007/s10115-008-0142-6
28 Quinlan R. C4.5: Programs for Machine Learning. San Mateo,CA: Morgan Kaufmann Publishers, 1993
29 Guyon I, Elisseeff A. An introduction to variable and feature selection. The Journal of Machine Learning Research, 2003, 3: 1157–1182
30 Kohavi R, John G H. Wrappers for feature subsets election. Artificial Intelligence, 1997, 97(1–2): 273–324
https://doi.org/10.1016/S0004-3702(97)00043-X
31 Caruana R, Freitag D. Greedy attribute selection. In: Proceedings of International Conference on Machine Learning. 1994, 28–36
https://doi.org/10.1016/b978-1-55860-335-6.50012-x
32 Koza J R. Genetic Programming: On the Programming of Computers by Means of Natural Selection. Cambridge, MA: MIT Press, 1992
33 Breiman L, Friedman J H, Olshen R A, Stone C J. Classification and Regression Trees. New York, NY: Chapman & Hall, 1984
34 Das S. Filters, wrappers and a boosting-based hybrid for feature selection. In: Proceedings of International Conference on Machine Learning. 2001, 74–81
35 Han J W, Kamber M. Data Mining: Concepts and Techniques. 2nd edition. London, UK: Morgan Kaufmann Publishers, 2006
36 Chlebus B S, Nguyen S H. On finding optimal discretizations for two attributes. In: Polkowski L, Skowron A, eds. Rough Sets and Current Trends in Computing. Springer, 1998, 537–544
https://doi.org/10.1007/3-540-69115-4_74
37 García S, Luengo J, Sáez J A, López V, Herrera F. A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(4): 734–750
https://doi.org/10.1109/TKDE.2012.35
38 Wong A K C, Chiu D K Y. Synthesizing statistical knowledge from incomplete mixed-mode data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1987, 9(6): 796–805
https://doi.org/10.1109/TPAMI.1987.4767986
39 Garcá-Laencina P J, Sancho-Gómez J L, Figueiras-Vidal A R. Pattern classification with missing data: a review. Neural Computing and Ap plications, 2010, 19(2): 263–282
https://doi.org/10.1007/s00521-009-0295-6
40 Grzymala-Busse J W, Goodwin L K, Grzymala-Busse W J, Zheng X Q. Handling missing attribute values in preterm birth data sets. In: Slezak D, Yao J T, Peters J F, Ziarko W, Hu X H, eds. Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing. Springer, 2005, 342–351
https://doi.org/10.1007/11548706_36
41 Batista G E A P A, Monard M C. An analysis of four missing data treatment methods for supervised learning. Applied Artificial Intelligence, 2003, 17(5–6): 519–533
https://doi.org/10.1080/713827181
42 Feng H H, Chen G S, Yin C, Yang B R, Chen Y M. A SVM regression based approach to filling in missing values. In: Khosla R, Howlett R J, Jain L C, eds. Knowledge-Based Intelligent Information and Engineering Systems. Springer, 2005, 581–587
43 Gupta A, Lam M S. Estimating missing values using neural networks. Journal of the Operational Research Society, 1996, 47(2): 229–238
https://doi.org/10.1057/jors.1996.21
44 Dempster A P, Laird N M, Rubin D B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B (Methodological), 1977, 39(1): 1–38
45 Schneider T. Analysis of incomplete climate data: estimation of mean values and covariance matrices and imputation of missing values. Journal of Climate, 2001, 14: 853–871
https://doi.org/10.1175/1520-0442(2001)014<0853:AOICDE>2.0.CO;2
46 Gourraud P A, Génin E, Cambon-Thomsen A. Handling missing values in population data: consequences for maximum likelihood estimation of haplotype frequencies. European Journal of Human Genetics, 2004, 12: 805–812
https://doi.org/10.1038/sj.ejhg.5201233
47 Mcculloch W, Pitts W. A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 1943, 5: 115–133
https://doi.org/10.1007/BF02478259
48 Holland J H. Adaptation in Natural and Artificial Systems. Ann Arbor: The University of Michigan Press, 1975
49 Dorigo M. Optimization, learning and natural algorithms. Dissertation for the Doctoral Degree. Politecnico di Milano, Italy, 1992
50 Kennedy J, Eberhart R. Particle swarm optimization. In: Proceedings of IEEE International Conference on Neural Networks. 1995, 1942–1948
https://doi.org/10.1109/ICNN.1995.488968
51 Sato T, Hagiwara M. Bee system: finding solution by a concentrated search. In: Proceedings of IEEE International Conference on Systems, Man, and Cybernetics. 1997
https://doi.org/10.1109/icsmc.1997.633289
52 Karaboga D. An idea based on honey bee swarm for numerical optimization. Technical Report TR06, Erciyes University, 2005
53 Dorigo M, Gambardella L M. Ant colony system: a cooperative learning approach to the traveling salesman problem. IEEE Transactions on Evolutionary Computation, 1997, 1(1): 53–66
https://doi.org/10.1109/4235.585892
54 Parpinelli R S, Lopes H S, Freitas A A. Data mining with an ant colony optimization algorithm. IEEE Transactions Evolutionary Computation, 2002, 6(4): 321–332
https://doi.org/10.1109/TEVC.2002.802452
55 Stützle T, Hoos H H. MAX-MIN ant system. Future Generation Computer Systems, 2000, 16(8): 889–914
https://doi.org/10.1016/S0167-739X(00)00043-1
56 Pellegrini P, Ellero A. The small world of pheromone trails. In: Dorigo M, Birattari M, Blum C, Clerc M, Stützle T, Winfield A F T, eds. Ant Colony Optimzation and Swarm Intelligence. Springer, 2008, 387–394
https://doi.org/10.1007/978-3-540-87527-7_41
57 Cohen W W. Fast effective rule induction. In: Prieditis A, Russell S J, eds. International Conference on Machine Learning. Morgan Kaufmann, 1995, 115–123
https://doi.org/10.1016/b978-1-55860-377-6.50023-2
58 Minnaert B, Martens D, de Baker M, Baesens B. To tune or not to tune: rule evaluation for metaheuristic-based sequential covering algorithms. Data Mining and Knowledge Discovery, 2015, 29(1): 237–272
https://doi.org/10.1007/s10618-013-0339-5
59 Almuhaideb S, ElBachir Menai M. A new hybrid metaheuristic for medical data classification. International Journal of Metaheuristics, 2014: 1–17
60 Rissanen J. Modeling by shortest data description. Automatica, 1978, 14(5): 465–471
https://doi.org/10.1016/0005-1098(78)90005-5
61 Kononenko I. On biases in estimating multi-valued attributes. In: Proceedings of International Conference on Artificial Intelligence. 1995, 1034–1040
62 Kira K, Rendell L A. A practical approach to feature selection. In: Proceedings of the 9th International Workshop on Machine Learning. 1992
https://doi.org/10.1016/b978-1-55860-247-2.50037-1
63 Kononenko I. Estimating attributes: analysis and extensions of RELIEF. In: Proceedings of European Conference on Machine Learning. 1994, 171–182
https://doi.org/10.1007/3-540-57868-4_57
64 Hall M A. Correlation-based feature selection for machine learning. Dissertation for the Dotoral Degree. Hamilton, New Zealand: University of Waikato, 1999
65 Liu H, Setiono R. A probabilistic approach to feature selection—a filter solution. In: Proceedings of International Conference on Machine Learning. 1996, 319–327
66 Frank E, Witten I H. Generating accurate rule sets without global optimization. In: Proceedings of the 15th International Conference on Machine Learning. 1998, 144–151
67 Holte R C. Very simple classification rules perform well on most commonly used datasets. Machine Learning, 1993, 11(1): 63–91
https://doi.org/10.1023/A:1022631118932
68 Klösgan W. Problems for knowledge discovery in databases and their treatment in the statistics interpreter explora. International Journal of Intelligent Systems, 1992, 7(7): 649–673
https://doi.org/10.1002/int.4550070707
69 Janssen F, Fürnkranz J. On the quest for optimal rule learning heuristics. Machine Learning, 2010, 78(3): 343–379
https://doi.org/10.1007/s10994-009-5162-2
70 Martens D, Baesens B, Fawcett T. Editorial survey: swarm intelligence for data mining. Machine Learning, 2010, 82(1): 1–42
https://doi.org/10.1007/s10994-010-5216-5
71 Hanczara B, Dougherty E R. The reliability of estimated confidence intervals for classification error rates when only a single sample is available. Pattern Recognition, 2013, 64(3): 1067–1077
https://doi.org/10.1016/j.patcog.2012.09.019
72 Kohavi R. A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of International Conference on Artificial Intelligence. 1995, 1137–1145
73 García S, Fernández A, Luengo J, Herrera F. A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability. Soft Computing, 2009, 13(10): 959–977
https://doi.org/10.1007/s00500-008-0392-y
74 Wilcoxon F. Individual comparisons by ranking methods. Biometrics Bulletin, 1945, 1(6): 80–83
https://doi.org/10.2307/3001968
75 Friedman M. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. American Statistical Association, 1937, 32(200): 675–701
https://doi.org/10.1080/01621459.1937.10503522
76 Frank A, Asuncion A. UCI machine learning repository. Irvine, CA: University of California, 2010
77 Napierala K, Stefanowski J. BRACID: a comprehensive approach to learning rules from imbalanced data. Journal of Intelligent Information Systems, 2012, 39(2): 335–373
https://doi.org/10.1007/s10844-011-0193-0
78 Orriols-Puig A, Bernadó-Mansilla E. The class imbalance problem in UCS classifier system: a preliminary study. In: Proceedings of the 2003–2005 International Conference on Learning Classifier Systems. 2007, 161–180
https://doi.org/10.1007/978-3-540-71231-2_12
79 Pazzani M J, Mani S, Shankle W R. Acceptance of rules generated by machine learning among medical experts. Methods of Information in Medicine, 2001, 40(5): 380–385
80 Vapnik V N. Estimation of Dependences Based on Empirical Data. Springer-Verlag, 1982
81 Vapnik V N. The Nature of Statistical Learning Theory. New York: Springer, 1995
https://doi.org/10.1007/978-1-4757-2440-0
82 Lim T S, Loh W Y, Shih Y S. A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine Learning, 2000, 40(3): 203–228
https://doi.org/10.1023/A:1007608224229
83 Gonzalez A, Perez R. Slave: a genetic learning system based on an iterative approach. IEEE Transactions on Fuzzy Systems, 1999, 7(2): 176–191
https://doi.org/10.1109/91.755399
84 Bernadó-Mansilla E, Garrell-Guiu J M. Accuracy based learning classifier systems: models, analysis and applications to classification tasks. Evolutionary Computation, 2003, 11(3): 209–238
https://doi.org/10.1162/106365603322365289
85 Wilson S W. Classifier fitness based on accuracy. Evolutionary Computation, 1995, 3(2): 149–175
https://doi.org/10.1162/evco.1995.3.2.149
86 Orriols-Puig A, Casillas J, Bernadó-Mansilla E. A comparative study of several geneticbased supervised learning systems. In: Bull L, Bernadó-Mansilla E, Holmes J H, eds. Learning Classifier Systems in Data Mining. Springer, 2008, 205–230
https://doi.org/10.1007/978-3-540-78979-6_10
87 Troyanskaya O G, Cantor M, Sherlock G, Brown P O, Hastie T, Tibshirani R, Botstein D, Altman R B. Missing value estimation methods for DNA microarrays. Bioinformatics, 2001, 17(6): 520–525
https://doi.org/10.1093/bioinformatics/17.6.520
88 Amaldi E, Kann V. On the approximability of minimizing nonzero variables or unsatisfied relations in linear systems. Theoretical Computer Science, 1998, 209(1–2): 237–260
https://doi.org/10.1016/S0304-3975(97)00115-1
89 Bacardit J, Butz M. Data mining in learning classifier systems: comparing XCS with gassist. In: Proceedings of International Conference on Learning Classifier Systems (IWLCS 2003–2005). 2004, 282–290
[1] Yunyun WANG, Jiao HAN, Yating SHEN, Hui XUE. Pointwise manifold regularization for semi-supervised learning[J]. Front. Comput. Sci., 2021, 15(1): 151303-.
[2] Panthadeep BHATTACHARJEE, Pinaki MITRA. A survey of density based clustering algorithms[J]. Front. Comput. Sci., 2021, 15(1): 151308-.
[3] Qianchen YU, Zhiwen YU, Zhu WANG, Xiaofeng WANG, Yongzhi WANG. Estimating posterior inference quality of the relational infinite latent feature model for overlapping community detection[J]. Front. Comput. Sci., 2020, 14(6): 146323-.
[4] Parnika PARANJAPE, Meera DHABU, Parag DESHPANDE. A novel classifier for multivariate instance using graph class signatures[J]. Front. Comput. Sci., 2020, 14(4): 144307-.
[5] Muhammad Aminur RAHAMAN, Mahmood JASIM, Md. Haider ALI, Md. HASANUZZAMAN. Bangla language modeling algorithm for automatic recognition of hand-sign-spelled Bangla sign language[J]. Front. Comput. Sci., 2020, 14(3): 143302-.
[6] Xibin DONG, Zhiwen YU, Wenming CAO, Yifan SHI, Qianli MA. A survey on ensemble learning[J]. Front. Comput. Sci., 2020, 14(2): 241-258.
[7] Hui XUE, Haiming XU, Xiaohong CHEN, Yunyun WANG. A primal perspective for indefinite kernel SVM problem[J]. Front. Comput. Sci., 2020, 14(2): 349-363.
[8] Rizwan Ahmed KHAN, Alexandre MEYER, Hubert KONIK, Saida BOUAKAZ. Saliency-based framework for facial expression recognition[J]. Front. Comput. Sci., 2019, 13(1): 183-198.
[9] Changlong WANG, Zhiyong FENG, Xiaowang ZHANG, Xin WANG, Guozheng RAO, Daoxun FU. ComR: a combined OWL reasoner for ontology classification[J]. Front. Comput. Sci., 2019, 13(1): 139-156.
[10] Munish SAINI, Kuljit Kaur CHAHAL. Change profile analysis of open-source software systems to understand their evolutionary behavior[J]. Front. Comput. Sci., 2018, 12(6): 1105-1124.
[11] Shuaiqiang WANG, Yilong YIN. Polygene-based evolutionary algorithms with frequent pattern mining[J]. Front. Comput. Sci., 2018, 12(5): 950-965.
[12] Sudipta ROY, Debnath BHATTACHARYYA, Samir Kumar BANDYOPADHYAY, Tai-Hoon KIM. An improved brain MR image binarization method as a preprocessing for abnormality detection and features extraction[J]. Front. Comput. Sci., 2017, 11(4): 717-727.
[13] Wei SHAO,Yi DING,Hong-Bin SHEN,Daoqiang ZHANG. Deep model-based feature extraction for predicting protein subcellular localizations from bio-images[J]. Front. Comput. Sci., 2017, 11(2): 243-252.
[14] Jian-Hao LUO,Wang ZHOU,Jianxin WU. Image categorization with resource constraints: introduction, challenges and advances[J]. Front. Comput. Sci., 2017, 11(1): 13-26.
[15] Xin XU,Wei WANG,Jianhong WANG. A three-way incremental-learning algorithm for radar emitter identification[J]. Front. Comput. Sci., 2016, 10(4): 673-688.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed