Please wait a minute...
Quantitative Biology

ISSN 2095-4689

ISSN 2095-4697(Online)

CN 10-1028/TM

邮发代号 80-971

Quantitative Biology  2020, Vol. 8 Issue (4): 347-358   https://doi.org/10.1007/s40484-020-0226-1
  本期目录
Performance-weighted-voting model: An ensemble machine learning method for cancer type classification using whole-exome sequencing mutation
Yawei Li, Yuan Luo()
Department of Preventive Medicine, Northwestern University, Feinberg School of Medicine, Chicago, IL 60611, USA
 全文: PDF(801 KB)   HTML
Abstract

Background: With improvements in next-generation DNA sequencing technology, lower cost is needed to collect genetic data. More machine learning techniques can be used to help with cancer analysis and diagnosis.

Methods: We developed an ensemble machine learning system named performance-weighted-voting model for cancer type classification in 6,249 samples across 14 cancer types. Our ensemble system consists of five weak classifiers (logistic regression, SVM, random forest, XGBoost and neural networks). We first used cross-validation to get the predicted results for the five classifiers. The weights of the five weak classifiers can be obtained based on their predictive performance by solving linear regression functions. The final predicted probability of the performance-weighted-voting model for a cancer type can be determined by the summation of each classifier’s weight multiplied by its predicted probability.

Results: Using the somatic mutation count of each gene as the input feature, the overall accuracy of the performance-weighted-voting model reached 71.46%, which was significantly higher than the five weak classifiers and two other ensemble models: the hard-voting model and the soft-voting model. In addition, by analyzing the predictive pattern of the performance-weighted-voting model, we found that in most cancer types, higher tumor mutational burden can improve overall accuracy.

Conclusion: This study has important clinical significance for identifying the origin of cancer, especially for those where the primary cannot be determined. In addition, our model presents a good strategy for using ensemble systems for cancer type classification.

Key wordscancer type classification    ensemble method    performance-weighted-voting model    linear regression    single-nucleotide polymorphism
收稿日期: 2020-07-15      出版日期: 2020-12-24
Corresponding Author(s): Yuan Luo   
 引用本文:   
. [J]. Quantitative Biology, 2020, 8(4): 347-358.
Yawei Li, Yuan Luo. Performance-weighted-voting model: An ensemble machine learning method for cancer type classification using whole-exome sequencing mutation. Quant. Biol., 2020, 8(4): 347-358.
 链接本文:  
https://academic.hep.com.cn/qb/CN/10.1007/s40484-020-0226-1
https://academic.hep.com.cn/qb/CN/Y2020/V8/I4/347
Fig.1  
Fig.2  
Classifier Accuracy
Logistic regression 68.67%a±1.21%b
SVM 63.74%±0.72%
Random rorest 54.79%±1.64%
XGBoost 62.89%±1.43%
Neural network 68.07%±0.94%
Hard-voting 69.06%±1.33%
Soft-voting 69.66%±1.37%
Performance-weighted-voting 71.46%±1.02%
Tab.1  
Fig.3  
  B
L
C
A
B
R
C
A
G
B
M
H
N
S
C
K
I
R
C
L
G
G
L
I
H
C
L
U
A
D
L
U
S
C
P
R
A
D
S
K
C
M
S
T
A
D
T
H
C
A
U
C
E
C
BLCA 47 3 0 10 0 0 4 1 2 5 0 6 1 1
BRCA 1 106 2 8 1 1 9 2 0 19 0 2 6 5
GBM 0 2 36 2 0 6 0 0 0 2 2 0 0 0
HNSC 3 4 1 68 0 1 1 1 10 5 1 6 0 1
KIRC 2 3 1 0 50 0 1 0 0 12 1 0 0 1
LGG 0 1 9 1 0 78 0 1 0 0 0 0 1 0
LIHC 1 5 0 6 4 3 47 0 0 5 0 2 0 2
LUAD 1 4 1 9 1 3 1 55 11 12 0 3 2 0
LUSC 2 1 1 9 0 0 2 8 65 0 0 6 3 0
PRAD 0 10 2 1 0 2 0 0 0 80 0 3 6 0
SKCM 0 0 3 0 0 0 1 1 0 3 60 2 2 0
STAD 2 6 0 2 1 0 6 2 1 3 0 43 0 0
THCA 0 1 0 0 0 0 0 0 0 6 1 0 81 0
UCEC 3 8 0 1 0 0 0 1 1 2 0 2 0 70
Tab.2  
Fig.4  
Fig.5  
1 B. Vogelstein, and K. W. Kinzler, (2004) Cancer genes and the pathways they control. Nat. Med., 10, 789–799
https://doi.org/10.1038/nm1087. pmid: 15286780
2 A. G. Knudson, (2002) Cancer genetics. Am. J. Med. Genet., 111, 96–102
https://doi.org/10.1002/ajmg.10320. pmid: 12124744
3 S. Ling, , Z. Hu, , Z. Yang, , F. Yang, , Y. Li, , P. Lin, , K. Chen, , L. Dong, , L. Cao, , Y. Tao, , et al. (2015) Extremely high genetic diversity in a single tumor points to prevalence of non-Darwinian cell evolution. Proc. Natl. Acad. Sci. USA, 112, E6496–E6505
https://doi.org/10.1073/pnas.1519556112. pmid: 26561581
4 Y. Zhang, , Y. Li, , T. Li, , X. Shen, , T. Zhu, , Y. Tao, , X. Li, , D. Wang, , Q. Ma, , Z. Hu, , et al. (2019) Genetic load and potential mutational meltdown in cancer cell populations. Mol. Biol. Evol., 36, 541–552
https://doi.org/10.1093/molbev/msy231. pmid: 30649444
5 I. Bozic, , T. Antal, , H. Ohtsuki, , H. Carter, , D. Kim, , S. Chen, , R. Karchin, , K. W. Kinzler, , B. Vogelstein, and M. A. Nowak, (2010) Accumulation of driver and passenger mutations during tumor progression. Proc. Natl. Acad. Sci. USA, 107, 18545–18550
https://doi.org/10.1073/pnas.1010978107. pmid: 20876136
6 Z. Hu, , J. Ding, , Z. Ma, , R. Sun, , J. A. Seoane, , J. Scott Shaffer, , C. J. Suarez, , A. S. Berghoff, , C. Cremolini, , A. Falcone, , et al. (2019) Quantitative evidence for early metastatic seeding in colorectal cancer. Nat. Genet., 51, 1113–1122
https://doi.org/10.1038/s41588-019-0423-x. pmid: 31209394
7 S. Yachida, , S. Jones, , I. Bozic, , T. Antal, , R. Leary, , B. Fu, , M. Kamiyama, , R. H. Hruban, , J. R. Eshleman, , M. A. Nowak, , et al. (2010) Distant metastasis occurs late during the genetic evolution of pancreatic cancer. Nature, 467, 1114–1117
https://doi.org/10.1038/nature09515. pmid: 20981102
8 LR Yates, S Knappskog, D Wedge, JHR Farmery, S Gonzalez, I Martincorena, LB Alexandrov, P Van Loo, HK Haugland, PK Lilleng, et al. (2017) Genomic evolution of breast cancer metastasis and relapse. Cancer Cell, 32,169-84 e7
9 G. R. Varadhachary, and M. N. Raber, (2014) Cancer of unknown primary site. N. Engl. J. Med., 371, 757–765
https://doi.org/10.1056/NEJMra1303917. pmid: 25140961
10 T. J. Hudson, , W. Anderson, , A. Artez, , A. D. Barker, , C. Bell, , R. R. Bernabé, , M. K. Bhan, , F. Calvo, , I. Eerola, , D. S. Gerhard, , et al. (2010) International network of cancer genome projects. Nature, 464, 993–998
https://doi.org/10.1038/nature08987. pmid: 20393554
11 The Cancer Genome Atlas Research N, J.N. Weinstein,, E.A. Collisson,, G.B. Mills,, K.R. Shaw,, B.A. Ozenberger,, K. Ellrott,, I. Shmulevich,, C. Sander, and J.M. Stuart, (2013)The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet., 45, 1113–1120
12 ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium. (2020) Pan-cancer analysis of whole genomes. Nature, 578, 82–93
https://doi.org/10.1038/s41586-020-1969-6. pmid: 32025007
13 L. B. Alexandrov, , S. Nik-Zainal, , D. C. Wedge, , S. A. Aparicio, , S. Behjati, , A. V. Biankin, , G. R. Bignell, , N. Bolli, , A. Borg, , A. L. Børresen-Dale, , et al. (2013) Signatures of mutational processes in human cancer. Nature, 500, 415–421
https://doi.org/10.1038/nature12477. pmid: 23945592
14 R. A. Burrell, , N. McGranahan, , J. Bartek, and C. Swanton, (2013) The causes and consequences of genetic heterogeneity in cancer evolution. Nature, 501, 338–345
https://doi.org/10.1038/nature12625. pmid: 24048066
15 D. V. Cicchetti, (1992) Neural networks and diagnosis in the clinical laboratory: state of the art. Clin. Chem., 38, 9–10
https://doi.org/10.1093/clinchem/38.1.9. pmid: 1733613
16 A. J. Cochran, (1997) Prediction of outcome for patients with cutaneous melanoma. Pigment Cell Res., 10, 162–167
https://doi.org/10.1111/j.1600-0749.1997.tb00479.x. pmid: 9266604
17 J. A. Cruz, and D. S. Wishart, (2007) Applications of machine learning in cancer prediction and prognosis. Cancer Inform, 2, 59–77
pmid: 19458758.
18 K. Kourou, , T. P. Exarchos, , K. P. Exarchos, , M. V. Karamouzis, and D. I. Fotiadis, (2015) Machine learning applications in cancer prognosis and prediction. Comput. Struct. Biotechnol. J., 13, 8–17
https://doi.org/10.1016/j.csbj.2014.11.005. pmid: 25750696
19 G. Eraslan, , Ž. Avsec, , J. Gagneur, and F. J. Theis, (2019) Deep learning: new computational modelling techniques for genomics. Nat. Rev. Genet., 20, 389–403
https://doi.org/10.1038/s41576-019-0122-6. pmid: 30971806
20 R. Fakoor,, F. Ladhak,, A. Nazi,, M Huber,. (2013) Using deep learning to enhance cancer diagnosis and classification. In: 2018 IEEE International Conference on System, Computation, Automation and Networking (icscan). IEEE
21 M. A. Shipp, , K. N. Ross, , P. Tamayo, , A. P. Weng, , J. L. Kutok, , R. C. Aguiar, , M. Gaasenbeek, , M. Angelo, , M. Reich, , G. S. Pinkus, , et al.et al. (2002) Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat. Med., 8, 68–74
https://doi.org/10.1038/nm0102-68. pmid: 11786909
22 M. P. S. Brown, , W. N. Grundy, , D. Lin, , N. Cristianini, , C. W. Sugnet, , T. S. Furey, , M. Ares, Jr and D. Haussler, (2000) Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl. Acad. Sci. USA, 97, 262–267
https://doi.org/10.1073/pnas.97.1.262. pmid: 10618406
23 T. R. Golub, , D. K. Slonim, , P. Tamayo, , C. Huard, , M. Gaasenbeek, , J. P. Mesirov, , H. Coller, , M. L. Loh, , J. R. Downing, , M. A. Caligiuri, , et al. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531–537
https://doi.org/10.1126/science.286.5439.531. pmid: 10521349
24 A. Ben-Dor, , L. Bruhn, , N. Friedman, , I. Nachman, , M. Schummer, and Z. Yakhini, (2000) Tissue classification with gene expression profiles. J. Comput. Biol., 7, 559–583
https://doi.org/10.1089/106652700750050943. pmid: 11108479
25 P. Danaee, , R. Ghaeini, and D. A. Hendrix, (2017) A deep learning approach for cancer detection and relevant gene identification. Pac. Symp. Biocomput., 22, 219–229
https://doi.org/10.1142/9789813207813_0022. pmid: 27896977
26 Y. Wang, , I. V. Tetko, , M. A. Hall, , E. Frank, , A. Facius, , K. F. Mayer, and H. W. Mewes, (2005) Gene selection from microarray data for cancer classification‒a machine learning approach. Comput. Biol. Chem., 29, 37–46
https://doi.org/10.1016/j.compbiolchem.2004.11.001. pmid: 15680584
27 Y. Liang, , C. Liu, , X. Z. Luan, , K. S. Leung, , T. M. Chan, , Z. B. Xu, and H. Zhang, (2013) Sparse logistic regression with a L1/2 penalty for gene selection in cancer classification. BMC Bioinformatics, 14, 198
https://doi.org/10.1186/1471-2105-14-198. pmid: 23777239
28 Z. Zeng, , A. H. Vo, , C. Mao, , S. E. Clare, , S. A. Khan, and Y. Luo, (2019) Cancer classification and pathway discovery using non-negative matrix factorization. J. Biomed. Inform., 96, 103247
https://doi.org/10.1016/j.jbi.2019.103247. pmid: 31271844
29 P. Milanez-Almeida, , A. J. Martins, , R. N. Germain, and J. S. Tsang, (2020) Cancer prognosis with shallow tumor RNA sequencing. Nat. Med., 26, 188–192
https://doi.org/10.1038/s41591-019-0729-3. pmid: 32042193
30 S. Moran, , A. Martínez-Cardús, , S. Sayols, , E. Musulén, , C. Balañá, , A. Estival-Gonzalez, , C. Moutinho, , H. Heyn, , A. Diaz-Lagares, , M. C. de Moura, , et al. (2016) Epigenetic profiling to classify cancer of unknown primary: a multicentre, retrospective analysis. Lancet Oncol., 17, 1386–1395
https://doi.org/10.1016/S1470-2045(16)30297-2. pmid: 27575023
31 A. M. Marquard, , N. J. Birkbak, , C. E. Thomas, , F. Favero, , M. Krzystanek, , C. Lefebvre, , C. Ferté, , M. Jamal-Hanjani, , G. A. Wilson, , S. Shafi, ,et al. (2015) TumorTracer: a method to identify the tissue of origin from the somatic mutations of a tumor specimen. BMC Med. Genomics, 8, 58
https://doi.org/10.1186/s12920-015-0130-0. pmid: 26429708
32 W. Jiao, , G. Atwal, , P. Polak, , R. Karlic, , E. Cuppen, , A. Danyi, , J. de Ridder, , C. van Herpen, , M. P. Lolkema, , N. Steeghs, , et al. (2020) A deep learning system accurately classifies primary and metastatic cancers using passenger mutation patterns. Nat. Commun., 11, 728doi:
https://doi.org/10.1038/s41467-019-13825-8. pmid: 32024849
33 C. Zhang,, Y. Ma, (2012) Ensemble Machine Learning: Methods and Applications. New York: Springer-Verlag
34 A. C. Tan, and D. Gilbert, (2003) Ensemble machine learning on gene expression data for cancer classification. Appl. Bioinformatics, 2, S75–S83
pmid: 15130820.
35 Z. R. Chalmers, , C. F. Connelly, , D. Fabrizio, , L. Gay, , S. M. Ali, , R. Ennis, , A. Schrock, , B. Campbell, , A. Shlien, , J. Chmielecki, , et al.et al. (2017) Analysis of 100,000 human cancer genomes reveals the landscape of tumor mutational burden. Genome Med., 9, 34
https://doi.org/10.1186/s13073-017-0424-2. pmid: 28420421
36 M. Ceccarelli, , F. P. Barthel, , T. M. Malta, , T. S. Sabedot, , S. R. Salama, , B. A. Murray, , O. Morozova, , Y. Newton, , A. Radenbaugh, , S. M. Pagnotta, ,et al. (2016) Molecular profiling reveals biologically discrete subsets and pathways of progression in diffuse glioma. Cell, 164, 550–563
https://doi.org/10.1016/j.cell.2015.12.028. pmid: 26824661
37 G. P. Risbridger, , I. D. Davis, , S. N. Birrell, and W. D. Tilley, (2010) Breast and prostate cancer: more similar than different. Nat. Rev. Cancer, 10, 205–212
https://doi.org/10.1038/nrc2795. pmid: 20147902
38 M. D. Long, and M. J. Campbell, (2015) Pan-cancer analyses of the nuclear receptor superfamily. Nucl. Receptor Res., 2, 2
https://doi.org/10.11131/2015/101182. pmid: 27200367
39 L. B. Alexandrov, , Y. S. Ju, , K. Haase, , P. Van Loo, , I. Martincorena, , S. Nik-Zainal, , Y. Totoki, , A. Fujimoto, , H. Nakagawa, , T. Shibata, , et al. (2016) Mutational signatures associated with tobacco smoking in human cancer. Science, 354, 618–622
https://doi.org/10.1126/science.aag0299. pmid: 27811275
40 D. L. Hartl, and A. G. Clark, (2007) Principles of Population Genetics. Sunderland: Sinauer Associates
41 M. H. Bailey, , C. Tokheim, , E. Porta-Pardo, , S. Sengupta, , D. Bertrand, , A. Weerasinghe, , A. Colaprico, , M. C. Wendl, , J. Kim, , B. Reardon, , et al. (2018) Comprehensive characterization of cancer driver genes and mutations. Cell, 174, 1034–1035
https://doi.org/10.1016/j.cell.2018.07.034. pmid: 30096302
42 K. Lee, , H. O. Jeong, , S. Lee, and W. K. Jeong, (2019) CPEM: Accurate cancer type classification based on somatic alterations using an ensemble of a random forest and a deep neural network. Sci. Rep., 9, 16927
43 ESMO Guidelines Task Force. (2005) ESMO Minimum Clinical Recommendations for diagnosis, treatment and follow-up of cancers of unknown primary site (CUP). Ann. Oncol., 16, i75–i76
https://doi.org/10.1093/annonc/mdi804. pmid: 15888766
44 E. Mnatsakanyan, , W. C. Tung, , B. Caine, and J. Smith-Gagen, (2014) Cancer of unknown primary: time trends in incidence, United States. Cancer Causes Control, 25, 747–757
https://doi.org/10.1007/s10552-014-0378-2. pmid: 24710663
45 N. Pavlidis, , H. Khaled, and R. Gaafar, (2015) A mini review on cancer of unknown primary site: A clinical puzzle for the oncologists. J. Adv. Res., 6, 375–382
https://doi.org/10.1016/j.jare.2014.11.007. pmid: 26257935
46 N. Sänger, , K. E. Effenberger, , S. Riethdorf, , V. Van Haasteren, , J. Gauwerky, , I. Wiegratz, , K. Strebhardt, , M. Kaufmann, and K. Pantel, (2011) Disseminated tumor cells in the bone marrow of patients with ductal carcinoma in situ. Int. J. Cancer, 129, 2522–2526
https://doi.org/10.1002/ijc.25895. pmid: 21207426
47 H. Hosseini, , M. M. S. Obradović, , M. Hoffmann, , K. L. Harper, , M. S. Sosa, , M. Werner-Klein, , L. K. Nanduri, , C. Werno, , C. Ehrl, , M. Maneck, , et al. (2016) Early dissemination seeds metastasis in breast cancer. Nature, 540, 552–558
https://doi.org/10.1038/nature20785. pmid: 27974799
48 A. D. Rhim, , E. T. Mirek, , N. M. Aiello, , A. Maitra, , J. M. Bailey, , F. McAllister, , M. Reichert, , G. L. Beatty, , A. K. Rustgi, , R. H. Vonderheide, ,et al. (2012) EMT and dissemination precede pancreatic tumor formation. Cell, 148, 349–361
https://doi.org/10.1016/j.cell.2011.11.025. pmid: 22265420
49 Y. Hüsemann, , J. B. Geigl, , F. Schubert, , P. Musiani, , M. Meyer, , E. Burghart, , G. Forni, , R. Eils, , T. Fehm, , G. Riethmüller, , et al.et al. (2008) Systemic spread is an early step in breast cancer. Cancer Cell, 13, 58–68
https://doi.org/10.1016/j.ccr.2007.12.003. pmid: 18167340
50 C. M. Svensson, , R. Hübler, and M. T. Figge, (2015) Automated classification of circulating tumor cells and the impact of interobsever variability on classifier training and performance. J. Immunol. Res., 2015, 573165
https://doi.org/10.1155/2015/573165. pmid: 26504857
51 T. B. Lannin, , F. I. Thege, and B. J. Kirby, (2016) Comparison and optimization of machine learning methods for automated classification of circulating tumor cells. Cytometry A, 89, 922–931
https://doi.org/10.1002/cyto.a.22993. pmid: 27754580
52 A. M. Goodman, , S. Kato, , L. Bazhenova, , S. P. Patel, , G. M. Frampton, , V. Miller, , P. J. Stephens, , G. A. Daniels, and R. Kurzrock, (2017) Tumor mutational burden as an independent predictor of response to immunotherapy in diverse cancers. Mol. Cancer Ther., 16, 2598–2608
https://doi.org/10.1158/1535-7163.MCT-17-0386. pmid: 28835386
53 R. M. Samstein, , C. H. Lee, , A. N. Shoushtari, , M. D. Hellmann, , R. Shen, , Y. Y. Janjigian, , D. A. Barron, , A. Zehir, , E. J. Jordan, , A. Omuro, , et al. (2019) Tumor mutational load predicts survival after immunotherapy across multiple cancer types. Nat. Genet., 51, 202–206
https://doi.org/10.1038/s41588-018-0312-8. pmid: 30643254
54 K. Ellrott, , M. H. Bailey, , G. Saksena, , K. R. Covington, , C. Kandoth, , C. Stewart, , J. Hess, , S. Ma, , K. E. Chiotti, , M. McLellan, , et al. (2018) Scalable open science approach for mutation calling of tumor exomes using multiple genomic pipelines. Cell Syst., 6, 271–281.e7
https://doi.org/10.1016/j.cels.2018.03.002. pmid: 29596782
55 C. Cortes, and V. Vapnik, (1995) Support-vector networks. Mach. Learn., 20, 273–297
https://doi.org/10.1007/BF00994018.
56 A. Li, , J. Zhang, and Z. Zhou, (2014) PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme. BMC Bioinformatics, 15, 311
https://doi.org/10.1186/1471-2105-15-311. pmid: 25239089
57 L. Breiman, (2001) Random forests. Mach. Learn., 45, 5–32
https://doi.org/10.1023/A:1010933404324.
58 T. Chen, and C. Guestrin, (2016) XGBoost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.785–794
59 F. F. Ting, and K. S. Sim, (2017) Self-regulated multilayer perceptron neural network for breast cancer classification. In: 2017 International Conference on Robotics, Automation and Sciences (Icoras)
[1] QB-20226-OF-LY_suppl_1 Download
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed