|
|
|
Performance measures in evaluating machine learning based bioinformatics predictors for classifications |
Yasen Jiao,Pufeng Du( ) |
| School of Computer Science and Technology, Tianjin University, Tianjin 300350, China |
|
|
|
|
Abstract Background: Many existing bioinformatics predictors are based on machine learning technology. When applying these predictors in practical studies, their predictive performances should be well understood. Different performance measures are applied in various studies as well as different evaluation methods. Even for the same performance measure, different terms, nomenclatures or notations may appear in different context.
Results: We carried out a review on the most commonly used performance measures and the evaluation methods for bioinformatics predictors.
Conclusions: It is important in bioinformatics to correctly understand and interpret the performance, as it is the key to rigorously compare performances of different predictors and to choose the right predictor.
|
| Author Summary It is important in bioinformatics to correctly understand and interpret the performance, as it is the key to rigorously compare performances of different predictors and to choose the right predictor. We present a comprehensive review to the performance measures in evaluating bioinformatics predictors, especially, the classification predictors. |
| Keywords
machine learning
performance measures
evaluation methods
|
|
|
| Fund: |
|
Corresponding Author(s):
Pufeng Du
|
|
Online First Date: 23 November 2016
Issue Date: 01 December 2016
|
|
| 1 |
Eberwine, J., Sul, J.-Y., Bartfai, T. and Kim, J. (2014) The promise of single-cell sequencing. Nat. Methods, 11, 25–27
https://doi.org/10.1038/nmeth.2769
pmid: 24524134
|
| 2 |
Ashley, E. A. (2015) The precision medicine initiative: a new national effort. JAMA, 313, 2119–2120
https://doi.org/10.1001/jama.2015.3595
pmid: 25928209
|
| 3 |
Chou, K.-C. (2009) Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology. Curr. Proteomics, 6, 262–274
https://doi.org/10.2174/157016409789973707
|
| 4 |
Chou, K.-C. (2015) Impacts of bioinformatics to medicinal chemistry. Med. Chem., 11, 218–234
https://doi.org/10.2174/1573406411666141229162834
pmid: 25548930
|
| 5 |
Jiao, Y.-S. and Du, P.-F. (2016) Predicting Golgi-resident protein types using pseudo amino acid compositions: approaches with positional specific physicochemical properties. J. Theor. Biol., 391, 35–42
https://doi.org/10.1016/j.jtbi.2015.11.009
pmid: 26702543
|
| 6 |
Wang, Y. and Zeng, J. (2013) Predicting drug-target interactions using restricted Boltzmann machines. Bioinformatics, 29, i126–i134
https://doi.org/10.1093/bioinformatics/btt234
pmid: 23812976
|
| 7 |
Lee, K., Byun, K., Hong, W., Chuang, H. Y., Pack, C. G., Bayarsaikhan, E., Paek, S. H., Kim, H., Shin, H. Y., Ideker, T., (2013) Proteome-wide discovery of mislocated proteins in cancer. Genome Res., 23, 1283–1294
https://doi.org/10.1101/gr.155499.113
pmid: 23674306
|
| 8 |
Shao, J., Xu, D., Hu, L., Kwan, Y. W., Wang, Y., Kong, X. and Ngai, S. M. (2012) Systematic analysis of human lysine acetylation proteins and accurate prediction of human lysine acetylation through bi-relative adapted binomial score Bayes feature representation. Mol. Biosyst., 8, 2964–2973
https://doi.org/10.1039/c2mb25251a
pmid: 22936054
|
| 9 |
Libbrecht, M. W. and Noble, W. S. (2015) Machine learning applications in genetics and genomics. Nat. Rev. Genet., 16, 321–332
https://doi.org/10.1038/nrg3920
pmid: 25948244
|
| 10 |
Kohavi, R. and Provost, F. (1998) Glossary of terms. Mach. Learn., 30, 271–274
https://doi.org/10.1023/A:1017181826899
|
| 11 |
Simon P. (2013) Too Big to Ignore: The Business Case for Big Data. New Jersey: Wiley
|
| 12 |
Fan, Y.-X., Zhang, Y. and Shen, H.-B. (2013) LabCaS: labeling calpain substrate cleavage sites from amino acid sequence using conditional random fields. Proteins, 81, 622–634
https://doi.org/10.1002/prot.24217
pmid: 23180633
|
| 13 |
Song, J., Tan, H., Shen, H., Mahmood, K., Boyd, S. E., Webb, G. I., Akutsu, T. and Whisstock, J. C. (2010) Cascleave: towards more accurate prediction of caspase substrate cleavage sites. Bioinformatics, 26, 752–760
https://doi.org/10.1093/bioinformatics/btq043
pmid: 20130033
|
| 14 |
Chou, K.-C. and Shen, H.-B. (2008) Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms. Nat. Protoc., 3, 153–162
https://doi.org/10.1038/nprot.2007.494
pmid: 18274516
|
| 15 |
Li X, Liu T, Tao P, Wang, C., Chen, L. (2015) A highly accurate protein structural class prediction approach using auto cross covariance transformation and recursive feature elimination. Comput. Biol. Chem., 59, 95–100
|
| 16 |
Kong, L., Zhang, L. and Lv, J. (2014) Accurate prediction of protein structural classes by incorporating predicted secondary structure information into the general form of Chou’s pseudo amino acid composition. J. Theor. Biol., 344, 12–18
https://doi.org/10.1016/j.jtbi.2013.11.021
pmid: 24316044
|
| 17 |
Guo, S.-H., Deng, E.-Z., Xu, L.-Q., Ding, H., Lin, H., Chen, W. and Chou, K. C. (2014) iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition. Bioinformatics, 30, 1522–1529
https://doi.org/10.1093/bioinformatics/btu083
pmid: 24504871
|
| 18 |
Xu, Y., Wen, X., Wen, L.-S., Wu, L. Y., Deng, N. Y. and Chou, K. C. (2014) iNitro-Tyr: prediction of nitrotyrosine sites in proteins with general pseudo amino acid composition. PLoS One, 9, e105018
https://doi.org/10.1371/journal.pone.0105018
pmid: 25121969
|
| 19 |
Xu, Y. and Chou, K.-C. (2016) Recent progress in predicting posttranslational modification sites in proteins. Curr. Top. Med. Chem., 16, 591–603
https://doi.org/10.2174/1568026615666150819110421
pmid: 26286211
|
| 20 |
Jiang, R., Tang, W., Wu, X. and Fu, W. (2009) A random forest approach to the detection of epistatic interactions in case-control studies. BMC Bioinformatics, 10, S65
https://doi.org/10.1186/1471-2105-10-S1-S65
pmid: 19208169
|
| 21 |
Tang, W., Wu, X., Jiang, R. and Li, Y. (2009) Epistatic module detection for case-control studies: a Bayesian model with a Gibbs sampling strategy. PLoS Genet., 5, e1000464
https://doi.org/10.1371/journal.pgen.1000464
pmid: 19412524
|
| 22 |
Wu, X., Jiang, R., Zhang, M. Q. and Li, S. (2008) Network-based global inference of human disease genes. Mol. Syst. Biol., 4, 189
https://doi.org/10.1038/msb.2008.27
pmid: 18463613
|
| 23 |
Li, T., Du, P. and Xu, N. (2010) Identifying human kinase-specific protein phosphorylation sites by integrating heterogeneous information from various sources. PLoS One, 5, e15411
https://doi.org/10.1371/journal.pone.0015411
pmid: 21085571
|
| 24 |
Xue, Y., Liu, Z., Cao, J., Ma, Q., Gao, X., Wang, Q., Jin, C., Zhou, Y., Wen, L. and Ren, J. (2011) GPS 2.1: enhanced prediction of kinase-specific phosphorylation sites with an algorithm of motif length selection. Protein Eng. Des. Sel., 24, 255–260
https://doi.org/10.1093/protein/gzq094
pmid: 21062758
|
| 25 |
Zhao, Q., Xie, Y., Zheng, Y., Jiang, S., Liu, W., Mu, W., Liu, Z., Zhao, Y., Xue, Y. and Ren, J. (2014) GPS-SUMO: a tool for the prediction of sumoylation sites and SUMO-interaction motifs. Nucleic Acids Res., 42, W325–W330
https://doi.org/10.1093/nar/gku383
pmid: 24880689
|
| 26 |
Nanni, L., Brahnam, S. and Lumini, A. (2012) Combining multiple approaches for gene microarray classification. Bioinformatics, 28, 1151–1157
https://doi.org/10.1093/bioinformatics/bts108
pmid: 22390939
|
| 27 |
Dong, X. and Weng, Z. (2013) The correlation between histone modifications and gene expression. Epigenomics, 5, 113–116
https://doi.org/10.2217/epi.13.13
pmid: 23566087
|
| 28 |
Dong, X., Greven, M. C., Kundaje, A., Djebali, S., Brown, J. B., Cheng, C., Gingeras, T. R., Gerstein, M., Guig�, R., Birney, E., (2012) Modeling gene expression using chromatin features in various cellular contexts. Genome Biol., 13, R53
https://doi.org/10.1186/gb-2012-13-9-r53
pmid: 22950368
|
| 29 |
Cheng, C., Shou, C., Yip, K. Y. and Gerstein, M. B. (2011) Genome-wide analysis of chromatin features identifies histone modification sensitive and insensitive yeast transcription factors. Genome Biol., 12, R111
https://doi.org/10.1186/gb-2011-12-11-r111
pmid: 22060676
|
| 30 |
Huang, J., Marco, E., Pinello, L. and Yuan, G. C. (2015) Predicting chromatin organization using histone marks. Genome Biol., 16, 162
https://doi.org/10.1186/s13059-015-0740-z
pmid: 26272203
|
| 31 |
Bishop CM. (2006) Pattern Recognition and Machine Learning. New York: Springer
|
| 32 |
Zhang, M.-L. and Zhou, Z.-H. (2007) ML-KNN: a lazy learning approach to multi-label learning. Pattern Recognit., 40, 2038–2048
https://doi.org/10.1016/j.patcog.2006.12.019
|
| 33 |
Chou, K.-C. (2013) Some remarks on predicting multi-label attributes in molecular biosystems. Mol. Biosyst., 9, 1092–1100
https://doi.org/10.1039/c3mb25555g
pmid: 23536215
|
| 34 |
Chou, K.-C. and Shen, H.-B. (2006) Hum-PLoc: a novel ensemble classifier for predicting human protein subcellular localization. Biochem. Biophys. Res. Commun., 347, 150–157
https://doi.org/10.1016/j.bbrc.2006.06.059
pmid: 16808903
|
| 35 |
Chou, K.-C., Wu, Z.-C. and Xiao, X. (2012) iLoc-Hum: using the accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites. Mol. Biosyst., 8, 629–641
https://doi.org/10.1039/C1MB05420A
pmid: 22134333
|
| 36 |
Du, P. and Li, Y. (2006) Prediction of protein submitochondria locations by hybridizing pseudo-amino acid composition with various physicochemical features of segmented sequence. BMC Bioinformatics, 7, 518
https://doi.org/10.1186/1471-2105-7-518
pmid: 17134515
|
| 37 |
Du, P., Tian, Y. and Yan, Y. (2012) Subcellular localization prediction for human internal and organelle membrane proteins with projected gene ontology scores. J. Theor. Biol., 313, 61–67
https://doi.org/10.1016/j.jtbi.2012.08.016
pmid: 22960368
|
| 38 |
Lin, H., Deng, E.-Z., Ding, H., Chen, W. and Chou, K. C. (2014) iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition. Nucleic Acids Res., 42, 12961–12972
https://doi.org/10.1093/nar/gku1019
pmid: 25361964
|
| 39 |
Chou, K.-C. (2011) Some remarks on protein attribute prediction and pseudo amino acid composition. J. Theor. Biol., 273, 236–247
https://doi.org/10.1016/j.jtbi.2010.12.024
pmid: 21168420
|
| 40 |
Chou, K. C. and Zhang, C. T. (1995) Prediction of protein structural classes. Crit. Rev. Biochem. Mol. Biol., 30, 275–349
https://doi.org/10.3109/10409239509083488
pmid: 7587280
|
| 41 |
Du, P., Li, T. and Wang, X. (2011) Recent progress in predicting protein sub-subcellular locations. Expert Rev. Proteomics, 8, 391–404
https://doi.org/10.1586/epr.11.20
pmid: 21679119
|
| 42 |
Hastie, T., Tibshirani, R. and Friedman, J. (2009) Model Assessment and Selection. In The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 219–260, New York: Springer-Verlag
|
| 43 |
Chou, K. C. (2001) Using subsite coupling to predict signal peptides. Protein Eng., 14, 75–79
https://doi.org/10.1093/protein/14.2.75
pmid: 11297664
|
| 44 |
Chen, W., Feng, P., Ding, H., Lin, H. and Chou, K. C. (2015) iRNA-Methyl: identifying N(6)-methyladenosine sites using pseudo nucleotide composition. Anal. Biochem., 490, 26–33
https://doi.org/10.1016/j.ab.2015.08.021
pmid: 26314792
|
| 45 |
Powers, D. M. W. (2011) Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. Inter. J. Mach. Learn. Tech., 2, 37–63
|
| 46 |
Li, J., Witten, D. M., Johnstone, I. M. and Tibshirani, R. (2012) Normalization, testing, and false discovery rate estimation for RNA-sequencing data. Biostatistics, 13, 523–538
https://doi.org/10.1093/biostatistics/kxr031
pmid: 22003245
|
| 47 |
Andreassen, O. A., Thompson, W. K., Schork, A. J., Ripke, S., Mattingsdal, M., Kelsoe, J. R., Kendler, K. S., O’Donovan, M. C., Rujescu, D., Werge, T., (2013) Improved detection of common variants associated with schizophrenia and bipolar disorder using pleiotropy-informed conditional false discovery rate. PLoS Genet., 9, e1003455
https://doi.org/10.1371/journal.pgen.1003455
pmid: 23637625
|
| 48 |
Chen, J. J., Roberson, P. K. and Schell, M. J. (2010) The false discovery rate: a key concept in large-scale genetic studies. Cancer Control, 17, 58–62
pmid: 20010520
|
| 49 |
Brodersen, K. H., Ong, C. S., Stephan, K. E., Buhmann, J. M. (2010) The Balanced Accuracy and Its Posterior Distribution. In 2010 20th International Conference on Pattern Recognition (ICPR). 3121–3124
|
| 50 |
Mower, J. P. (2005) PREP-Mt: predictive RNA editor for plant mitochondrial genes. BMC Bioinformatics, 6, 96
https://doi.org/10.1186/1471-2105-6-96
pmid: 15826309
|
| 51 |
Dayarian, A., Romero, R., Wang, Z., Biehl, M., Bilal, E., Hormoz, S., Meyer, P., Norel, R., Rhrissorrakrai, K., Bhanot, G., (2015) Predicting protein phosphorylation from gene expression: top methods from the IMPROVER Species Translation Challenge. Bioinformatics, 31, 462–470
https://doi.org/10.1093/bioinformatics/btu490
pmid: 25061067
|
| 52 |
Matthews, B. W. (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. BBA – Protein Structure, 405, 442–451
https://doi.org/10.1016/0005-2795(75)90109-9
|
| 53 |
Saito, T. and Rehmsmeier, M. (2015) The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One, 10, e0118432
https://doi.org/10.1371/journal.pone.0118432
pmid: 25738806
|
| 54 |
Davis, J. and Goadrich, M. (2006) The relationship between precision-recall and ROC curves. In Proceedings of the 23rd international conference on Machine learning. 233–240, New York: the Association for Computing Machinery
|
| 55 |
Du, P. and Xu, C. (2013) Predicting multisite protein subcellular locations: progress and challenges. Expert Rev. Proteomics, 10, 227–237
https://doi.org/10.1586/epr.13.16
pmid: 23777214
|
| 56 |
Tsoumakas, G., Katakis, I. and Vlahavas, I. (2010) Mining Multi-label Data. In Data Mining and Knowledge Discovery Handbook. 667–685, New York: Springer US
|
| 57 |
Tsoumakas, G. and Katakis, I. (2007) Multi-label classification: an overview. Int. J. Data Warehous. Min., 3, 1–13
https://doi.org/10.4018/jdwm.2007070101
|
| 58 |
Sprenger, J., Fink, J. L. and Teasdale, R. D. (2006) Evaluation and comparison of mammalian subcellular localization prediction methods. BMC Bioinformatics, 7, S3
https://doi.org/10.1186/1471-2105-7-S5-S3
pmid: 17254308
|
| 59 |
Bermingham, M. L., Pong-Wong, R., Spiliopoulou, A., Hayward, C., Rudan, I., Campbell, H., Wright, A. F., Wilson, J. F., Agakov, F., Navarro, P., (2015) Application of high-dimensional feature selection: evaluation for genomic prediction in man. Sci. Rep., 5, 10312
https://doi.org/10.1038/srep10312
pmid: 25988841
|
| 60 |
Varma, S. and Simon, R. (2006) Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics, 7, 91
https://doi.org/10.1186/1471-2105-7-91
pmid: 16504092
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
| |
Shared |
|
|
|
|
| |
Discussed |
|
|
|
|