Please wait a minute...
Quantitative Biology

ISSN 2095-4689

ISSN 2095-4697(Online)

CN 10-1028/TM

Postal Subscription Code 80-971

Quant. Biol.    2016, Vol. 4 Issue (4) : 320-330    https://doi.org/10.1007/s40484-016-0081-2
REVIEW
Performance measures in evaluating machine learning based bioinformatics predictors for classifications
Yasen Jiao,Pufeng Du()
School of Computer Science and Technology, Tianjin University, Tianjin 300350, China
 Download: PDF(289 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

Background: Many existing bioinformatics predictors are based on machine learning technology. When applying these predictors in practical studies, their predictive performances should be well understood. Different performance measures are applied in various studies as well as different evaluation methods. Even for the same performance measure, different terms, nomenclatures or notations may appear in different context.

Results: We carried out a review on the most commonly used performance measures and the evaluation methods for bioinformatics predictors.

Conclusions: It is important in bioinformatics to correctly understand and interpret the performance, as it is the key to rigorously compare performances of different predictors and to choose the right predictor.

Author Summary  It is important in bioinformatics to correctly understand and interpret the performance, as it is the key to rigorously compare performances of different predictors and to choose the right predictor. We present a comprehensive review to the performance measures in evaluating bioinformatics predictors, especially, the classification predictors.
Keywords machine learning      performance measures      evaluation methods     
PACS:     
Fund: 
Corresponding Author(s): Pufeng Du   
Online First Date: 23 November 2016    Issue Date: 01 December 2016
 Cite this article:   
Yasen Jiao,Pufeng Du. Performance measures in evaluating machine learning based bioinformatics predictors for classifications[J]. Quant. Biol., 2016, 4(4): 320-330.
 URL:  
https://academic.hep.com.cn/qb/EN/10.1007/s40484-016-0081-2
https://academic.hep.com.cn/qb/EN/Y2016/V4/I4/320
Fig.1  A diagram of a machine learning system.S represents a biological system. Its mechanism cannot be directly obtained. The details of f(x) are unknown. We use a serial of x as the input of S. A serial of y can be obtained from the output of S. By using pairs of x and y, the machine learning system LM can be trained. A function fe(x) can be established. For the x that has never been seen by both S and LM, we expect that LM can produce a result ye that is close to y as much as possible.
Fig.2  Taxonomy of evaluation methods. There are three different evaluation methods: Independent dataset test, re-substituting test and cross validation, which can be further divided into two different methods, the leave-one-out cross validation and the n-fold cross validation.
Fig.3  An illustration of the process of a 5-fold cross validation. The whole process contains five steps. Step I: the whole dataset was obtained. Step II: the whole dataset is randomly partitioned into five different parts (A, B, C, D and E). Step III: five rounds of training and testing are carried out. Every part is used as the testing dataset, while the remaining four parts are the training dataset. The shadowed part indicates that this part is used as testing dataset. Step IV: the testing results are collected from five rounds of training and testing. Step V: The testing results are pooled together to estimate the predictive performances.
Fig.4  The confusion matrix in testing a predictor. All the testing samples are divided into four categories, according to the real labels and the prediction results. There are altogether eight basic counts: RP, RN, PP, PN, TP, TN, FP and FN. The relationships between these counts are marked on the figure.
Fig.5  An ROC curve. The horizontal axis is the false positive rate (FPR). The vertical axis is the Sensitivity, which can be termed as true positive rate (TPR). The solid curve is the ROC curve. The dashed diagonal is called the line of no-discrimination. An ROC curve, which is close to the top left corner, indicates the predictor has a good performance. The closer the curve to the top left corner, the better performance the predictor has. The area under curve (AUC) of ROC curve can be used as performance measures.
1 Eberwine, J., Sul, J.-Y., Bartfai, T. and Kim, J. (2014) The promise of single-cell sequencing. Nat. Methods, 11, 25–27
https://doi.org/10.1038/nmeth.2769 pmid: 24524134
2 Ashley, E. A. (2015) The precision medicine initiative: a new national effort. JAMA, 313, 2119–2120
https://doi.org/10.1001/jama.2015.3595 pmid: 25928209
3 Chou, K.-C. (2009) Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology. Curr. Proteomics, 6, 262–274
https://doi.org/10.2174/157016409789973707
4 Chou, K.-C. (2015) Impacts of bioinformatics to medicinal chemistry. Med. Chem., 11, 218–234
https://doi.org/10.2174/1573406411666141229162834 pmid: 25548930
5 Jiao, Y.-S. and Du, P.-F. (2016) Predicting Golgi-resident protein types using pseudo amino acid compositions: approaches with positional specific physicochemical properties. J. Theor. Biol., 391, 35–42
https://doi.org/10.1016/j.jtbi.2015.11.009 pmid: 26702543
6 Wang, Y. and Zeng, J. (2013) Predicting drug-target interactions using restricted Boltzmann machines. Bioinformatics, 29, i126–i134
https://doi.org/10.1093/bioinformatics/btt234 pmid: 23812976
7 Lee, K., Byun, K., Hong, W., Chuang, H. Y., Pack, C. G., Bayarsaikhan, E., Paek, S. H., Kim, H., Shin, H. Y., Ideker, T., (2013) Proteome-wide discovery of mislocated proteins in cancer. Genome Res., 23, 1283–1294
https://doi.org/10.1101/gr.155499.113 pmid: 23674306
8 Shao, J., Xu, D., Hu, L., Kwan, Y. W., Wang, Y., Kong, X. and Ngai, S. M. (2012) Systematic analysis of human lysine acetylation proteins and accurate prediction of human lysine acetylation through bi-relative adapted binomial score Bayes feature representation. Mol. Biosyst., 8, 2964–2973
https://doi.org/10.1039/c2mb25251a pmid: 22936054
9 Libbrecht, M. W. and Noble, W. S. (2015) Machine learning applications in genetics and genomics. Nat. Rev. Genet., 16, 321–332
https://doi.org/10.1038/nrg3920 pmid: 25948244
10 Kohavi, R. and Provost, F. (1998) Glossary of terms. Mach. Learn., 30, 271–274
https://doi.org/10.1023/A:1017181826899
11 Simon P. (2013) Too Big to Ignore: The Business Case for Big Data. New Jersey: Wiley
12 Fan, Y.-X., Zhang, Y. and Shen, H.-B. (2013) LabCaS: labeling calpain substrate cleavage sites from amino acid sequence using conditional random fields. Proteins, 81, 622–634
https://doi.org/10.1002/prot.24217 pmid: 23180633
13 Song, J., Tan, H., Shen, H., Mahmood, K., Boyd, S. E., Webb, G. I., Akutsu, T. and Whisstock, J. C. (2010) Cascleave: towards more accurate prediction of caspase substrate cleavage sites. Bioinformatics, 26, 752–760
https://doi.org/10.1093/bioinformatics/btq043 pmid: 20130033
14 Chou, K.-C. and Shen, H.-B. (2008) Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms. Nat. Protoc., 3, 153–162
https://doi.org/10.1038/nprot.2007.494 pmid: 18274516
15 Li X, Liu T, Tao P, Wang, C., Chen, L. (2015) A highly accurate protein structural class prediction approach using auto cross covariance transformation and recursive feature elimination. Comput. Biol. Chem., 59, 95–100
16 Kong, L., Zhang, L. and Lv, J. (2014) Accurate prediction of protein structural classes by incorporating predicted secondary structure information into the general form of Chou’s pseudo amino acid composition. J. Theor. Biol., 344, 12–18
https://doi.org/10.1016/j.jtbi.2013.11.021 pmid: 24316044
17 Guo, S.-H., Deng, E.-Z., Xu, L.-Q., Ding, H., Lin, H., Chen, W. and Chou, K. C. (2014) iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition. Bioinformatics, 30, 1522–1529
https://doi.org/10.1093/bioinformatics/btu083 pmid: 24504871
18 Xu, Y., Wen, X., Wen, L.-S., Wu, L. Y., Deng, N. Y. and Chou, K. C. (2014) iNitro-Tyr: prediction of nitrotyrosine sites in proteins with general pseudo amino acid composition. PLoS One, 9, e105018
https://doi.org/10.1371/journal.pone.0105018 pmid: 25121969
19 Xu, Y. and Chou, K.-C. (2016) Recent progress in predicting posttranslational modification sites in proteins. Curr. Top. Med. Chem., 16, 591–603
https://doi.org/10.2174/1568026615666150819110421 pmid: 26286211
20 Jiang, R., Tang, W., Wu, X. and Fu, W. (2009) A random forest approach to the detection of epistatic interactions in case-control studies. BMC Bioinformatics, 10, S65
https://doi.org/10.1186/1471-2105-10-S1-S65 pmid: 19208169
21 Tang, W., Wu, X., Jiang, R. and Li, Y. (2009) Epistatic module detection for case-control studies: a Bayesian model with a Gibbs sampling strategy. PLoS Genet., 5, e1000464
https://doi.org/10.1371/journal.pgen.1000464 pmid: 19412524
22 Wu, X., Jiang, R., Zhang, M. Q. and Li, S. (2008) Network-based global inference of human disease genes. Mol. Syst. Biol., 4, 189
https://doi.org/10.1038/msb.2008.27 pmid: 18463613
23 Li, T., Du, P. and Xu, N. (2010) Identifying human kinase-specific protein phosphorylation sites by integrating heterogeneous information from various sources. PLoS One, 5, e15411
https://doi.org/10.1371/journal.pone.0015411 pmid: 21085571
24 Xue, Y., Liu, Z., Cao, J., Ma, Q., Gao, X., Wang, Q., Jin, C., Zhou, Y., Wen, L. and Ren, J. (2011) GPS 2.1: enhanced prediction of kinase-specific phosphorylation sites with an algorithm of motif length selection. Protein Eng. Des. Sel., 24, 255–260
https://doi.org/10.1093/protein/gzq094 pmid: 21062758
25 Zhao, Q., Xie, Y., Zheng, Y., Jiang, S., Liu, W., Mu, W., Liu, Z., Zhao, Y., Xue, Y. and Ren, J. (2014) GPS-SUMO: a tool for the prediction of sumoylation sites and SUMO-interaction motifs. Nucleic Acids Res., 42, W325–W330
https://doi.org/10.1093/nar/gku383 pmid: 24880689
26 Nanni, L., Brahnam, S. and Lumini, A. (2012) Combining multiple approaches for gene microarray classification. Bioinformatics, 28, 1151–1157
https://doi.org/10.1093/bioinformatics/bts108 pmid: 22390939
27 Dong, X. and Weng, Z. (2013) The correlation between histone modifications and gene expression. Epigenomics, 5, 113–116
https://doi.org/10.2217/epi.13.13 pmid: 23566087
28 Dong, X., Greven, M. C., Kundaje, A., Djebali, S., Brown, J. B., Cheng, C., Gingeras, T. R., Gerstein, M., Guig�, R., Birney, E., (2012) Modeling gene expression using chromatin features in various cellular contexts. Genome Biol., 13, R53
https://doi.org/10.1186/gb-2012-13-9-r53 pmid: 22950368
29 Cheng, C., Shou, C., Yip, K. Y. and Gerstein, M. B. (2011) Genome-wide analysis of chromatin features identifies histone modification sensitive and insensitive yeast transcription factors. Genome Biol., 12, R111
https://doi.org/10.1186/gb-2011-12-11-r111 pmid: 22060676
30 Huang, J., Marco, E., Pinello, L. and Yuan, G. C. (2015) Predicting chromatin organization using histone marks. Genome Biol., 16, 162
https://doi.org/10.1186/s13059-015-0740-z pmid: 26272203
31 Bishop CM. (2006) Pattern Recognition and Machine Learning. New York: Springer
32 Zhang, M.-L. and Zhou, Z.-H. (2007) ML-KNN: a lazy learning approach to multi-label learning. Pattern Recognit., 40, 2038–2048
https://doi.org/10.1016/j.patcog.2006.12.019
33 Chou, K.-C. (2013) Some remarks on predicting multi-label attributes in molecular biosystems. Mol. Biosyst., 9, 1092–1100
https://doi.org/10.1039/c3mb25555g pmid: 23536215
34 Chou, K.-C. and Shen, H.-B. (2006) Hum-PLoc: a novel ensemble classifier for predicting human protein subcellular localization. Biochem. Biophys. Res. Commun., 347, 150–157
https://doi.org/10.1016/j.bbrc.2006.06.059 pmid: 16808903
35 Chou, K.-C., Wu, Z.-C. and Xiao, X. (2012) iLoc-Hum: using the accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites. Mol. Biosyst., 8, 629–641
https://doi.org/10.1039/C1MB05420A pmid: 22134333
36 Du, P. and Li, Y. (2006) Prediction of protein submitochondria locations by hybridizing pseudo-amino acid composition with various physicochemical features of segmented sequence. BMC Bioinformatics, 7, 518
https://doi.org/10.1186/1471-2105-7-518 pmid: 17134515
37 Du, P., Tian, Y. and Yan, Y. (2012) Subcellular localization prediction for human internal and organelle membrane proteins with projected gene ontology scores. J. Theor. Biol., 313, 61–67
https://doi.org/10.1016/j.jtbi.2012.08.016 pmid: 22960368
38 Lin, H., Deng, E.-Z., Ding, H., Chen, W. and Chou, K. C. (2014) iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition. Nucleic Acids Res., 42, 12961–12972
https://doi.org/10.1093/nar/gku1019 pmid: 25361964
39 Chou, K.-C. (2011) Some remarks on protein attribute prediction and pseudo amino acid composition. J. Theor. Biol., 273, 236–247
https://doi.org/10.1016/j.jtbi.2010.12.024 pmid: 21168420
40 Chou, K. C. and Zhang, C. T. (1995) Prediction of protein structural classes. Crit. Rev. Biochem. Mol. Biol., 30, 275–349
https://doi.org/10.3109/10409239509083488 pmid: 7587280
41 Du, P., Li, T. and Wang, X. (2011) Recent progress in predicting protein sub-subcellular locations. Expert Rev. Proteomics, 8, 391–404
https://doi.org/10.1586/epr.11.20 pmid: 21679119
42 Hastie, T., Tibshirani, R. and Friedman, J. (2009) Model Assessment and Selection. In The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 219–260, New York: Springer-Verlag
43 Chou, K. C. (2001) Using subsite coupling to predict signal peptides. Protein Eng., 14, 75–79
https://doi.org/10.1093/protein/14.2.75 pmid: 11297664
44 Chen, W., Feng, P., Ding, H., Lin, H. and Chou, K. C. (2015) iRNA-Methyl: identifying N(6)-methyladenosine sites using pseudo nucleotide composition. Anal. Biochem., 490, 26–33
https://doi.org/10.1016/j.ab.2015.08.021 pmid: 26314792
45 Powers, D. M. W. (2011) Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. Inter. J. Mach. Learn. Tech., 2, 37–63
46 Li, J., Witten, D. M., Johnstone, I. M. and Tibshirani, R. (2012) Normalization, testing, and false discovery rate estimation for RNA-sequencing data. Biostatistics, 13, 523–538
https://doi.org/10.1093/biostatistics/kxr031 pmid: 22003245
47 Andreassen, O. A., Thompson, W. K., Schork, A. J., Ripke, S., Mattingsdal, M., Kelsoe, J. R., Kendler, K. S., O’Donovan, M. C., Rujescu, D., Werge, T., (2013) Improved detection of common variants associated with schizophrenia and bipolar disorder using pleiotropy-informed conditional false discovery rate. PLoS Genet., 9, e1003455
https://doi.org/10.1371/journal.pgen.1003455 pmid: 23637625
48 Chen, J. J., Roberson, P. K. and Schell, M. J. (2010) The false discovery rate: a key concept in large-scale genetic studies. Cancer Control, 17, 58–62
pmid: 20010520
49 Brodersen, K. H., Ong, C. S., Stephan, K. E., Buhmann, J. M. (2010) The Balanced Accuracy and Its Posterior Distribution. In 2010 20th International Conference on Pattern Recognition (ICPR). 3121–3124
50 Mower, J. P. (2005) PREP-Mt: predictive RNA editor for plant mitochondrial genes. BMC Bioinformatics, 6, 96
https://doi.org/10.1186/1471-2105-6-96 pmid: 15826309
51 Dayarian, A., Romero, R., Wang, Z., Biehl, M., Bilal, E., Hormoz, S., Meyer, P., Norel, R., Rhrissorrakrai, K., Bhanot, G., (2015) Predicting protein phosphorylation from gene expression: top methods from the IMPROVER Species Translation Challenge. Bioinformatics, 31, 462–470
https://doi.org/10.1093/bioinformatics/btu490 pmid: 25061067
52 Matthews, B. W. (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. BBA – Protein Structure, 405, 442–451
https://doi.org/10.1016/0005-2795(75)90109-9
53 Saito, T. and Rehmsmeier, M. (2015) The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One, 10, e0118432
https://doi.org/10.1371/journal.pone.0118432 pmid: 25738806
54 Davis, J. and Goadrich, M. (2006) The relationship between precision-recall and ROC curves. In Proceedings of the 23rd international conference on Machine learning. 233–240, New York: the Association for Computing Machinery
55 Du, P. and Xu, C. (2013) Predicting multisite protein subcellular locations: progress and challenges. Expert Rev. Proteomics, 10, 227–237
https://doi.org/10.1586/epr.13.16 pmid: 23777214
56 Tsoumakas, G., Katakis, I. and Vlahavas, I. (2010) Mining Multi-label Data. In Data Mining and Knowledge Discovery Handbook. 667–685, New York: Springer US
57 Tsoumakas, G. and Katakis, I. (2007) Multi-label classification: an overview. Int. J. Data Warehous. Min., 3, 1–13
https://doi.org/10.4018/jdwm.2007070101
58 Sprenger, J., Fink, J. L. and Teasdale, R. D. (2006) Evaluation and comparison of mammalian subcellular localization prediction methods. BMC Bioinformatics, 7, S3
https://doi.org/10.1186/1471-2105-7-S5-S3 pmid: 17254308
59 Bermingham, M. L., Pong-Wong, R., Spiliopoulou, A., Hayward, C., Rudan, I., Campbell, H., Wright, A. F., Wilson, J. F., Agakov, F., Navarro, P., (2015) Application of high-dimensional feature selection: evaluation for genomic prediction in man. Sci. Rep., 5, 10312
https://doi.org/10.1038/srep10312 pmid: 25988841
60 Varma, S. and Simon, R. (2006) Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics, 7, 91
https://doi.org/10.1186/1471-2105-7-91 pmid: 16504092
[1] Jie Ren, Kai Song, Chao Deng, Nathan A. Ahlgren, Jed A. Fuhrman, Yi Li, Xiaohui Xie, Ryan Poplin, Fengzhu Sun. Identifying viruses from metagenomic data using deep learning[J]. Quant. Biol., 2020, 8(1): 64-77.
[2] Hailin Meng, Yingfei Ma, Guoqin Mai, Yong Wang, Chenli Liu. Construction of precise support vector machine based models for predicting promoter strength[J]. Quant. Biol., 2017, 5(1): 90-98.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed