|
|
|
Performance-weighted-voting model: An ensemble machine learning method for cancer type classification using whole-exome sequencing mutation |
Yawei Li, Yuan Luo( ) |
Department of Preventive Medicine, Northwestern University, Feinberg School of Medicine, Chicago, IL 60611, USA |
|
|
Abstract: Background: With improvements in next-generation DNA sequencing technology, lower cost is needed to collect genetic data. More machine learning techniques can be used to help with cancer analysis and diagnosis. Methods: We developed an ensemble machine learning system named performance-weighted-voting model for cancer type classification in 6,249 samples across 14 cancer types. Our ensemble system consists of five weak classifiers (logistic regression, SVM, random forest, XGBoost and neural networks). We first used cross-validation to get the predicted results for the five classifiers. The weights of the five weak classifiers can be obtained based on their predictive performance by solving linear regression functions. The final predicted probability of the performance-weighted-voting model for a cancer type can be determined by the summation of each classifier’s weight multiplied by its predicted probability. Results: Using the somatic mutation count of each gene as the input feature, the overall accuracy of the performance-weighted-voting model reached 71.46%, which was significantly higher than the five weak classifiers and two other ensemble models: the hard-voting model and the soft-voting model. In addition, by analyzing the predictive pattern of the performance-weighted-voting model, we found that in most cancer types, higher tumor mutational burden can improve overall accuracy. Conclusion: This study has important clinical significance for identifying the origin of cancer, especially for those where the primary cannot be determined. In addition, our model presents a good strategy for using ensemble systems for cancer type classification. |
Key words:
cancer type classification
ensemble method
performance-weighted-voting model
linear regression
single-nucleotide polymorphism
|
收稿日期: 2020-07-15
出版日期: 2020-12-24
|
Corresponding Author(s):
Yuan Luo
|
1 |
B. Vogelstein, and K. W. Kinzler, (2004) Cancer genes and the pathways they control. Nat. Med., 10, 789–799
https://doi.org/10.1038/nm1087.
pmid: 15286780
|
2 |
A. G. Knudson, (2002) Cancer genetics. Am. J. Med. Genet., 111, 96–102
https://doi.org/10.1002/ajmg.10320.
pmid: 12124744
|
3 |
S. Ling, , Z. Hu, , Z. Yang, , F. Yang, , Y. Li, , P. Lin, , K. Chen, , L. Dong, , L. Cao, , Y. Tao, , et al. (2015) Extremely high genetic diversity in a single tumor points to prevalence of non-Darwinian cell evolution. Proc. Natl. Acad. Sci. USA, 112, E6496–E6505
https://doi.org/10.1073/pnas.1519556112.
pmid: 26561581
|
4 |
Y. Zhang, , Y. Li, , T. Li, , X. Shen, , T. Zhu, , Y. Tao, , X. Li, , D. Wang, , Q. Ma, , Z. Hu, , et al. (2019) Genetic load and potential mutational meltdown in cancer cell populations. Mol. Biol. Evol., 36, 541–552
https://doi.org/10.1093/molbev/msy231.
pmid: 30649444
|
5 |
I. Bozic, , T. Antal, , H. Ohtsuki, , H. Carter, , D. Kim, , S. Chen, , R. Karchin, , K. W. Kinzler, , B. Vogelstein, and M. A. Nowak, (2010) Accumulation of driver and passenger mutations during tumor progression. Proc. Natl. Acad. Sci. USA, 107, 18545–18550
https://doi.org/10.1073/pnas.1010978107.
pmid: 20876136
|
6 |
Z. Hu, , J. Ding, , Z. Ma, , R. Sun, , J. A. Seoane, , J. Scott Shaffer, , C. J. Suarez, , A. S. Berghoff, , C. Cremolini, , A. Falcone, , et al. (2019) Quantitative evidence for early metastatic seeding in colorectal cancer. Nat. Genet., 51, 1113–1122
https://doi.org/10.1038/s41588-019-0423-x.
pmid: 31209394
|
7 |
S. Yachida, , S. Jones, , I. Bozic, , T. Antal, , R. Leary, , B. Fu, , M. Kamiyama, , R. H. Hruban, , J. R. Eshleman, , M. A. Nowak, , et al. (2010) Distant metastasis occurs late during the genetic evolution of pancreatic cancer. Nature, 467, 1114–1117
https://doi.org/10.1038/nature09515.
pmid: 20981102
|
8 |
LR Yates, S Knappskog, D Wedge, JHR Farmery, S Gonzalez, I Martincorena, LB Alexandrov, P Van Loo, HK Haugland, PK Lilleng, et al. (2017) Genomic evolution of breast cancer metastasis and relapse. Cancer Cell, 32,169-84 e7
|
9 |
G. R. Varadhachary, and M. N. Raber, (2014) Cancer of unknown primary site. N. Engl. J. Med., 371, 757–765
https://doi.org/10.1056/NEJMra1303917.
pmid: 25140961
|
10 |
T. J. Hudson, , W. Anderson, , A. Artez, , A. D. Barker, , C. Bell, , R. R. Bernabé, , M. K. Bhan, , F. Calvo, , I. Eerola, , D. S. Gerhard, , et al. (2010) International network of cancer genome projects. Nature, 464, 993–998
https://doi.org/10.1038/nature08987.
pmid: 20393554
|
11 |
The Cancer Genome Atlas Research N, J.N. Weinstein,, E.A. Collisson,, G.B. Mills,, K.R. Shaw,, B.A. Ozenberger,, K. Ellrott,, I. Shmulevich,, C. Sander, and J.M. Stuart, (2013)The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet., 45, 1113–1120
|
12 |
ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium. (2020) Pan-cancer analysis of whole genomes. Nature, 578, 82–93
https://doi.org/10.1038/s41586-020-1969-6.
pmid: 32025007
|
13 |
L. B. Alexandrov, , S. Nik-Zainal, , D. C. Wedge, , S. A. Aparicio, , S. Behjati, , A. V. Biankin, , G. R. Bignell, , N. Bolli, , A. Borg, , A. L. Børresen-Dale, , et al. (2013) Signatures of mutational processes in human cancer. Nature, 500, 415–421
https://doi.org/10.1038/nature12477.
pmid: 23945592
|
14 |
R. A. Burrell, , N. McGranahan, , J. Bartek, and C. Swanton, (2013) The causes and consequences of genetic heterogeneity in cancer evolution. Nature, 501, 338–345
https://doi.org/10.1038/nature12625.
pmid: 24048066
|
15 |
D. V. Cicchetti, (1992) Neural networks and diagnosis in the clinical laboratory: state of the art. Clin. Chem., 38, 9–10
https://doi.org/10.1093/clinchem/38.1.9.
pmid: 1733613
|
16 |
A. J. Cochran, (1997) Prediction of outcome for patients with cutaneous melanoma. Pigment Cell Res., 10, 162–167
https://doi.org/10.1111/j.1600-0749.1997.tb00479.x.
pmid: 9266604
|
17 |
J. A. Cruz, and D. S. Wishart, (2007) Applications of machine learning in cancer prediction and prognosis. Cancer Inform, 2, 59–77
pmid: 19458758.
|
18 |
K. Kourou, , T. P. Exarchos, , K. P. Exarchos, , M. V. Karamouzis, and D. I. Fotiadis, (2015) Machine learning applications in cancer prognosis and prediction. Comput. Struct. Biotechnol. J., 13, 8–17
https://doi.org/10.1016/j.csbj.2014.11.005.
pmid: 25750696
|
19 |
G. Eraslan, , Ž. Avsec, , J. Gagneur, and F. J. Theis, (2019) Deep learning: new computational modelling techniques for genomics. Nat. Rev. Genet., 20, 389–403
https://doi.org/10.1038/s41576-019-0122-6.
pmid: 30971806
|
20 |
R. Fakoor,, F. Ladhak,, A. Nazi,, M Huber,. (2013) Using deep learning to enhance cancer diagnosis and classification. In: 2018 IEEE International Conference on System, Computation, Automation and Networking (icscan). IEEE
|
21 |
M. A. Shipp, , K. N. Ross, , P. Tamayo, , A. P. Weng, , J. L. Kutok, , R. C. Aguiar, , M. Gaasenbeek, , M. Angelo, , M. Reich, , G. S. Pinkus, , et al.et al. (2002) Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat. Med., 8, 68–74
https://doi.org/10.1038/nm0102-68.
pmid: 11786909
|
22 |
M. P. S. Brown, , W. N. Grundy, , D. Lin, , N. Cristianini, , C. W. Sugnet, , T. S. Furey, , M. Ares, Jr and D. Haussler, (2000) Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl. Acad. Sci. USA, 97, 262–267
https://doi.org/10.1073/pnas.97.1.262.
pmid: 10618406
|
23 |
T. R. Golub, , D. K. Slonim, , P. Tamayo, , C. Huard, , M. Gaasenbeek, , J. P. Mesirov, , H. Coller, , M. L. Loh, , J. R. Downing, , M. A. Caligiuri, , et al. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531–537
https://doi.org/10.1126/science.286.5439.531.
pmid: 10521349
|
24 |
A. Ben-Dor, , L. Bruhn, , N. Friedman, , I. Nachman, , M. Schummer, and Z. Yakhini, (2000) Tissue classification with gene expression profiles. J. Comput. Biol., 7, 559–583
https://doi.org/10.1089/106652700750050943.
pmid: 11108479
|
25 |
P. Danaee, , R. Ghaeini, and D. A. Hendrix, (2017) A deep learning approach for cancer detection and relevant gene identification. Pac. Symp. Biocomput., 22, 219–229
https://doi.org/10.1142/9789813207813_0022.
pmid: 27896977
|
26 |
Y. Wang, , I. V. Tetko, , M. A. Hall, , E. Frank, , A. Facius, , K. F. Mayer, and H. W. Mewes, (2005) Gene selection from microarray data for cancer classification‒a machine learning approach. Comput. Biol. Chem., 29, 37–46
https://doi.org/10.1016/j.compbiolchem.2004.11.001.
pmid: 15680584
|
27 |
Y. Liang, , C. Liu, , X. Z. Luan, , K. S. Leung, , T. M. Chan, , Z. B. Xu, and H. Zhang, (2013) Sparse logistic regression with a L1/2 penalty for gene selection in cancer classification. BMC Bioinformatics, 14, 198
https://doi.org/10.1186/1471-2105-14-198.
pmid: 23777239
|
28 |
Z. Zeng, , A. H. Vo, , C. Mao, , S. E. Clare, , S. A. Khan, and Y. Luo, (2019) Cancer classification and pathway discovery using non-negative matrix factorization. J. Biomed. Inform., 96, 103247
https://doi.org/10.1016/j.jbi.2019.103247.
pmid: 31271844
|
29 |
P. Milanez-Almeida, , A. J. Martins, , R. N. Germain, and J. S. Tsang, (2020) Cancer prognosis with shallow tumor RNA sequencing. Nat. Med., 26, 188–192
https://doi.org/10.1038/s41591-019-0729-3.
pmid: 32042193
|
30 |
S. Moran, , A. Martínez-Cardús, , S. Sayols, , E. Musulén, , C. Balañá, , A. Estival-Gonzalez, , C. Moutinho, , H. Heyn, , A. Diaz-Lagares, , M. C. de Moura, , et al. (2016) Epigenetic profiling to classify cancer of unknown primary: a multicentre, retrospective analysis. Lancet Oncol., 17, 1386–1395
https://doi.org/10.1016/S1470-2045(16)30297-2.
pmid: 27575023
|
31 |
A. M. Marquard, , N. J. Birkbak, , C. E. Thomas, , F. Favero, , M. Krzystanek, , C. Lefebvre, , C. Ferté, , M. Jamal-Hanjani, , G. A. Wilson, , S. Shafi, ,et al. (2015) TumorTracer: a method to identify the tissue of origin from the somatic mutations of a tumor specimen. BMC Med. Genomics, 8, 58
https://doi.org/10.1186/s12920-015-0130-0.
pmid: 26429708
|
32 |
W. Jiao, , G. Atwal, , P. Polak, , R. Karlic, , E. Cuppen, , A. Danyi, , J. de Ridder, , C. van Herpen, , M. P. Lolkema, , N. Steeghs, , et al. (2020) A deep learning system accurately classifies primary and metastatic cancers using passenger mutation patterns. Nat. Commun., 11, 728doi:
https://doi.org/10.1038/s41467-019-13825-8.
pmid: 32024849
|
33 |
C. Zhang,, Y. Ma, (2012) Ensemble Machine Learning: Methods and Applications. New York: Springer-Verlag
|
34 |
A. C. Tan, and D. Gilbert, (2003) Ensemble machine learning on gene expression data for cancer classification. Appl. Bioinformatics, 2, S75–S83
pmid: 15130820.
|
35 |
Z. R. Chalmers, , C. F. Connelly, , D. Fabrizio, , L. Gay, , S. M. Ali, , R. Ennis, , A. Schrock, , B. Campbell, , A. Shlien, , J. Chmielecki, , et al.et al. (2017) Analysis of 100,000 human cancer genomes reveals the landscape of tumor mutational burden. Genome Med., 9, 34
https://doi.org/10.1186/s13073-017-0424-2.
pmid: 28420421
|
36 |
M. Ceccarelli, , F. P. Barthel, , T. M. Malta, , T. S. Sabedot, , S. R. Salama, , B. A. Murray, , O. Morozova, , Y. Newton, , A. Radenbaugh, , S. M. Pagnotta, ,et al. (2016) Molecular profiling reveals biologically discrete subsets and pathways of progression in diffuse glioma. Cell, 164, 550–563
https://doi.org/10.1016/j.cell.2015.12.028.
pmid: 26824661
|
37 |
G. P. Risbridger, , I. D. Davis, , S. N. Birrell, and W. D. Tilley, (2010) Breast and prostate cancer: more similar than different. Nat. Rev. Cancer, 10, 205–212
https://doi.org/10.1038/nrc2795.
pmid: 20147902
|
38 |
M. D. Long, and M. J. Campbell, (2015) Pan-cancer analyses of the nuclear receptor superfamily. Nucl. Receptor Res., 2, 2
https://doi.org/10.11131/2015/101182.
pmid: 27200367
|
39 |
L. B. Alexandrov, , Y. S. Ju, , K. Haase, , P. Van Loo, , I. Martincorena, , S. Nik-Zainal, , Y. Totoki, , A. Fujimoto, , H. Nakagawa, , T. Shibata, , et al. (2016) Mutational signatures associated with tobacco smoking in human cancer. Science, 354, 618–622
https://doi.org/10.1126/science.aag0299.
pmid: 27811275
|
40 |
D. L. Hartl, and A. G. Clark, (2007) Principles of Population Genetics. Sunderland: Sinauer Associates
|
41 |
M. H. Bailey, , C. Tokheim, , E. Porta-Pardo, , S. Sengupta, , D. Bertrand, , A. Weerasinghe, , A. Colaprico, , M. C. Wendl, , J. Kim, , B. Reardon, , et al. (2018) Comprehensive characterization of cancer driver genes and mutations. Cell, 174, 1034–1035
https://doi.org/10.1016/j.cell.2018.07.034.
pmid: 30096302
|
42 |
K. Lee, , H. O. Jeong, , S. Lee, and W. K. Jeong, (2019) CPEM: Accurate cancer type classification based on somatic alterations using an ensemble of a random forest and a deep neural network. Sci. Rep., 9, 16927
|
43 |
ESMO Guidelines Task Force. (2005) ESMO Minimum Clinical Recommendations for diagnosis, treatment and follow-up of cancers of unknown primary site (CUP). Ann. Oncol., 16, i75–i76
https://doi.org/10.1093/annonc/mdi804.
pmid: 15888766
|
44 |
E. Mnatsakanyan, , W. C. Tung, , B. Caine, and J. Smith-Gagen, (2014) Cancer of unknown primary: time trends in incidence, United States. Cancer Causes Control, 25, 747–757
https://doi.org/10.1007/s10552-014-0378-2.
pmid: 24710663
|
45 |
N. Pavlidis, , H. Khaled, and R. Gaafar, (2015) A mini review on cancer of unknown primary site: A clinical puzzle for the oncologists. J. Adv. Res., 6, 375–382
https://doi.org/10.1016/j.jare.2014.11.007.
pmid: 26257935
|
46 |
N. Sänger, , K. E. Effenberger, , S. Riethdorf, , V. Van Haasteren, , J. Gauwerky, , I. Wiegratz, , K. Strebhardt, , M. Kaufmann, and K. Pantel, (2011) Disseminated tumor cells in the bone marrow of patients with ductal carcinoma in situ. Int. J. Cancer, 129, 2522–2526
https://doi.org/10.1002/ijc.25895.
pmid: 21207426
|
47 |
H. Hosseini, , M. M. S. Obradović, , M. Hoffmann, , K. L. Harper, , M. S. Sosa, , M. Werner-Klein, , L. K. Nanduri, , C. Werno, , C. Ehrl, , M. Maneck, , et al. (2016) Early dissemination seeds metastasis in breast cancer. Nature, 540, 552–558
https://doi.org/10.1038/nature20785.
pmid: 27974799
|
48 |
A. D. Rhim, , E. T. Mirek, , N. M. Aiello, , A. Maitra, , J. M. Bailey, , F. McAllister, , M. Reichert, , G. L. Beatty, , A. K. Rustgi, , R. H. Vonderheide, ,et al. (2012) EMT and dissemination precede pancreatic tumor formation. Cell, 148, 349–361
https://doi.org/10.1016/j.cell.2011.11.025.
pmid: 22265420
|
49 |
Y. Hüsemann, , J. B. Geigl, , F. Schubert, , P. Musiani, , M. Meyer, , E. Burghart, , G. Forni, , R. Eils, , T. Fehm, , G. Riethmüller, , et al.et al. (2008) Systemic spread is an early step in breast cancer. Cancer Cell, 13, 58–68
https://doi.org/10.1016/j.ccr.2007.12.003.
pmid: 18167340
|
50 |
C. M. Svensson, , R. Hübler, and M. T. Figge, (2015) Automated classification of circulating tumor cells and the impact of interobsever variability on classifier training and performance. J. Immunol. Res., 2015, 573165
https://doi.org/10.1155/2015/573165.
pmid: 26504857
|
51 |
T. B. Lannin, , F. I. Thege, and B. J. Kirby, (2016) Comparison and optimization of machine learning methods for automated classification of circulating tumor cells. Cytometry A, 89, 922–931
https://doi.org/10.1002/cyto.a.22993.
pmid: 27754580
|
52 |
A. M. Goodman, , S. Kato, , L. Bazhenova, , S. P. Patel, , G. M. Frampton, , V. Miller, , P. J. Stephens, , G. A. Daniels, and R. Kurzrock, (2017) Tumor mutational burden as an independent predictor of response to immunotherapy in diverse cancers. Mol. Cancer Ther., 16, 2598–2608
https://doi.org/10.1158/1535-7163.MCT-17-0386.
pmid: 28835386
|
53 |
R. M. Samstein, , C. H. Lee, , A. N. Shoushtari, , M. D. Hellmann, , R. Shen, , Y. Y. Janjigian, , D. A. Barron, , A. Zehir, , E. J. Jordan, , A. Omuro, , et al. (2019) Tumor mutational load predicts survival after immunotherapy across multiple cancer types. Nat. Genet., 51, 202–206
https://doi.org/10.1038/s41588-018-0312-8.
pmid: 30643254
|
54 |
K. Ellrott, , M. H. Bailey, , G. Saksena, , K. R. Covington, , C. Kandoth, , C. Stewart, , J. Hess, , S. Ma, , K. E. Chiotti, , M. McLellan, , et al. (2018) Scalable open science approach for mutation calling of tumor exomes using multiple genomic pipelines. Cell Syst., 6, 271–281.e7
https://doi.org/10.1016/j.cels.2018.03.002.
pmid: 29596782
|
55 |
C. Cortes, and V. Vapnik, (1995) Support-vector networks. Mach. Learn., 20, 273–297
https://doi.org/10.1007/BF00994018.
|
56 |
A. Li, , J. Zhang, and Z. Zhou, (2014) PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme. BMC Bioinformatics, 15, 311
https://doi.org/10.1186/1471-2105-15-311.
pmid: 25239089
|
57 |
L. Breiman, (2001) Random forests. Mach. Learn., 45, 5–32
https://doi.org/10.1023/A:1010933404324.
|
58 |
T. Chen, and C. Guestrin, (2016) XGBoost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.785–794
|
59 |
F. F. Ting, and K. S. Sim, (2017) Self-regulated multilayer perceptron neural network for breast cancer classification. In: 2017 International Conference on Robotics, Automation and Sciences (Icoras)
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|