Please wait a minute...
Quantitative Biology

ISSN 2095-4689

ISSN 2095-4697(Online)

CN 10-1028/TM

Postal Subscription Code 80-971

Quant. Biol.    2017, Vol. 5 Issue (4) : 338-351    https://doi.org/10.1007/s40484-017-0121-6
RESEARCH ARTICLE
Variable importance-weighted Random Forests
Yiyi Liu1, Hongyu Zhao1,2()
1. Department of Biostatistics, School of Public Health, Yale University, New Haven, CT 06511, USA
2. Program of Computational Biology and Bioinformatics, Yale University, New Haven, CT 06511, USA
 Download: PDF(3398 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

Background: Random Forests is a popular classification and regression method that has proven powerful for various prediction problems in biological studies. However, its performance often deteriorates when the number of features increases. To address this limitation, feature elimination Random Forests was proposed that only uses features with the largest variable importance scores. Yet the performance of this method is not satisfying, possibly due to its rigid feature selection, and increased correlations between trees of forest.

Methods: We propose variable importance-weighted Random Forests, which instead of sampling features with equal probability at each node to build up trees, samples features according to their variable importance scores, and then select the best split from the randomly selected features.

Results: We evaluate the performance of our method through comprehensive simulation and real data analyses, for both regression and classification. Compared to the standard Random Forests and the feature elimination Random Forests methods, our proposed method has improved performance in most cases.

Conclusions: By incorporating the variable importance scores into the random feature selection step, our method can better utilize more informative features without completely ignoring less informative ones, hence has improved prediction accuracy in the presence of weak signals and large noises. We have implemented an R package “viRandomForests” based on the original R package “randomForest” and it can be freely downloaded from http://zhaocenter.org/software.

Author Summary  Random Forests is a powerful classification and regression method and is commonly used in genomic data analyses. However, when the number of features is very large and the signals are relatively weak, its performance tends to decline. In this article, we propose a modified Random Forests, which instead of considering each feature equally as the original Random Forests does, weights the features according to their informativeness (measured using the variable importance score). This new method, variable importance-weighted Random Forests (viRF), demonstrates improved performance in both simulation and real data analyses studies.
Keywords Random Forests      variable importance score      classification      regression     
Corresponding Author(s): Hongyu Zhao   
Online First Date: 06 November 2017    Issue Date: 04 December 2017
 Cite this article:   
Yiyi Liu,Hongyu Zhao. Variable importance-weighted Random Forests[J]. Quant. Biol., 2017, 5(4): 338-351.
 URL:  
https://academic.hep.com.cn/qb/EN/10.1007/s40484-017-0121-6
https://academic.hep.com.cn/qb/EN/Y2017/V5/I4/338
  Tree structure used in regression Model 3.
  MSE in regression simulation Model 1.
  MSE in regression simulation Model 2.
  MSE in regression simulation Model 3.
Drug viRF eRF feRF RF # cell lines
normalized unnormalized marginal test normalized, recursive unnormalized, recursive normalized, nonrecursive unnormalized, nonrecursive
17-AAG 0.914 (0.007) 0.897 (0.012) 0.896 (0.013) 0.914 (0.012) 0.905 (0.016) 0.919 (0.016) 0.902 (0.011) 0.932 (0.007) 503
AEW541 0.303 (0.004) 0.305 (0.006) 0.297 (0.005) 0.302 (0.005) 0.300 (0.006) 0.304 (0.007) 0.300 (0.004) 0.305 (0.003) 503
AZD0530 0.544 (0.008) 0.553 (0.011) 0.546 (0.017) 0.547 (0.015) 0.544 (0.010) 0.543 (0.009) 0.543 (0.011) 0.540 (0.007) 504
AZD6244 0.879 (0.014) 0.852 (0.016) 0.860 (0.016) 0.866 (0.021) 0.861 (0.017) 0.864 (0.019) 0.859 (0.015) 0.935 (0.015) 503
Erlotinib 0.326 (0.004) 0.326 (0.006) 0.325 (0.004) 0.327 (0.005) 0.327 (0.005) 0.327 (0.005) 0.328 (0.006) 0.331 (0.003) 503
Irinotecan 0.746 (0.012) 0.669 (0.016) 0.789 (0.010) 0.685 (0.036) 0.698 (0.039) 0.679 (0.026) 0.698 (0.024) 0.810 (0.010) 317
L-685458 0.223 (0.004) 0.225 (0.005) 0.218 (0.004) 0.219 (0.005) 0.220 (0.006) 0.220 (0.006) 0.219 (0.005) 0.221 (0.004) 491
LBW242 0.469 (0.005) 0.477 (0.009) 0.472 (0.006) 0.472 (0.005) 0.471 (0.005) 0.471 (0.006) 0.473 (0.006) 0.470 (0.003) 503
Lapatinib 0.323 (0.004) 0.307 (0.008) 0.320 (0.005) 0.317 (0.010) 0.316 (0.009) 0.314 (0.008) 0.316 (0.009) 0.335 (0.003) 504
Nilotinib 0.489 (0.020) 0.488 (0.024) 0.469 (0.018) 0.493 (0.027) 0.497 (0.035) 0.495 (0.026) 0.490 (0.028) 0.490 (0.016) 420
Nutlin-3 0.198 (0.003) 0.200 (0.004) 0.197 (0.003) 0.199 (0.003) 0.199 (0.003) 0.199 (0.003) 0.199 (0.003) 0.198 (0.002) 504
PD-0325901 1.258 (0.018) 1.223 (0.023) 1.242 (0.022) 1.233 (0.024) 1.239 (0.024) 1.233 (0.018) 1.237 (0.029) 1.367 (0.020) 504
PD-0332991 0.295 (0.003) 0.298 (0.004) 0.291 (0.003) 0.294 (0.004) 0.294 (0.006) 0.294 (0.005) 0.296 (0.006) 0.294 (0.003) 434
PF2341066 0.278 (0.005) 0.265 (0.008) 0.280 (0.004) 0.276 (0.013) 0.279 (0.012) 0.276 (0.008) 0.276 (0.007) 0.289 (0.003) 504
PHA-665752 0.248 (0.002) 0.251 (0.002) 0.248 (0.002) 0.249 (0.003) 0.249 (0.003) 0.249 (0.002) 0.249 (0.002) 0.248 (0.002) 503
PLX4720 0.291 (0.004) 0.297 (0.004) 0.290 (0.004) 0.291 (0.005) 0.291 (0.004) 0.293 (0.004) 0.293 (0.007) 0.292 (0.005) 496
Paclitaxel 1.260 (0.013) 1.246 (0.017) 1.232 (0.011) 1.236 (0.014) 1.238 (0.022) 1.227 (0.022) 1.220 (0.017) 1.279 (0.01) 503
Panobinostat 0.387 (0.005) 0.384 (0.006) 0.391 (0.005) 0.391 (0.007) 0.389 (0.006) 0.388 (0.005) 0.388 (0.008) 0.393 (0.005) 500
RAF265 0.466 (0.008) 0.467 (0.009) 0.458 (0.009) 0.464 (0.009) 0.461 (0.010) 0.465 (0.009) 0.459 (0.010) 0.468 (0.006) 460
Sorafenib 0.217 (0.005) 0.221 (0.006) 0.214 (0.005) 0.216 (0.006) 0.216 (0.008) 0.217 (0.006) 0.217 (0.006) 0.217 (0.004) 503
TAE684 0.578 (0.009) 0.576 (0.012) 0.570 (0.008) 0.580 (0.013) 0.575 (0.010) 0.579 (0.010) 0.575 (0.012) 0.586 (0.007) 504
TKI258 0.284 (0.003) 0.279 (0.004) 0.284 (0.004) 0.285 (0.004) 0.284 (0.004) 0.284 (0.003) 0.283 (0.004) 0.287 (0.003) 504
Topotecan 0.996 (0.012) 0.931 (0.013) 1.061 (0.010) 1.001 (0.020) 0.997 (0.018) 0.997 (0.016) 1.000 (0.020) 1.082 (0.009) 504
ZD-6474 0.437 (0.006) 0.433 (0.006) 0.440 (0.009) 0.440 (0.009) 0.437 (0.008) 0.440 (0.007) 0.436 (0.007) 0.442 (0.005) 496
Tab.1  Cross validation MSE in CCLE data analysis (in parentheses are standard deviations).
  Tree structure used in classification Model 3. Noises are added by assigning data point in each terminal node the denoted class with probability 0.9 and any other class with probability 0.1/3.
  Error rate in classification simulation Model 1.
  Error rate in classification simulation Model 2.
  Error rate in classification simulation Model 3.
Data set Main task Sample size Number of features
Arcene [15] Distinguish ovarian and prostate cancer vs. normal using mass-spectrometric data Class 1 (cancer): 112
Class 2 (normal): 88
10,000
Pomeroy [16] Distinguish central neural system embryonal tumor subtypes using gene expression data Class 1: 10
Class 2: 10
Class 3: 10
Class 4: 4
Class 5: 8
5,597
Singh [17] Distinguish prostate cancer vs. normal using gene expression data Class 1 (cancer): 52
Class 2 (normal): 50
6,033
Tab.2  Classification data sets summary.
Data set Arcene
Overall [Class1, Class2]
Pomeroy
Overall [Class1, Class2, Class3, Class 4, Class 5]
Singh
Overall [Class1, Class2]
viRF, normalized 0.168 (0.025) [0.078 (0.013), 0.089 (0.014)] 0.261 (0.048) [0.021 (0.017), 0.025 (0.005), 0.012 (0.014), 0.063 (0.024), 0.139 (0.030)] 0.057 (0.008) [0.013 (0.005), 0.044 (0.006)]
viRF, unnormalized 0.175 (0.028) [0.082 (0.017), 0.093 (0.015)] 0.255 (0.048) [0.026 (0.019), 0.024 (0.008), 0.014 (0.016), 0.061 (0.025), 0.130 (0.032)] 0.082 (0.011) [0.032 (0.011), 0.05 (0.004)]
eRF, marginal test 0.178 (0.017) [0.080 (0.011), 0.098 (0.013)] 0.276 (0.051) [0.029 (0.021), 0.031 (0.017), 0.013 (0.014), 0.054 (0.024), 0.150 (0.023)] 0.063 (0.011) [0.016 (0.008), 0.047 (0.006)]
feRF, normalized, recursive 0.195 (0.033) [0.100 (0.022), 0.096 (0.017)] 0.345 (0.094) [0.067 (0.035), 0.058 (0.033), 0.038 (0.029), 0.058 (0.024), 0.124 (0.033)] 0.111 (0.024) [0.054 (0.020), 0.057 (0.011)]
feRF,unnormalized, recursive 0.211 (0.027) [0.101 (0.019), 0.110 (0.017)] 0.395 (0.102) [0.073 (0.037), 0.065 (0.046), 0.056 (0.035), 0.064 (0.019), 0.137 (0.037)] 0.106 (0.024) [0.050 (0.018), 0.056 (0.016)]
feRF, normalized, nonrecursive 0.203 (0.026) [0.103 (0.015), 0.101 (0.018)] 0.339 (0.090) [0.054 (0.042), 0.058 (0.036), 0.031 (0.030), 0.064 (0.027), 0.132 (0.037)] 0.114 (0.024) [0.060 (0.014), 0.054 (0.017)]
feRF,unnormalized, nonrecursive 0.213 (0.022) [0.103 (0.016), 0.110 (0.015)] 0.319 (0.075) [0.050 (0.033), 0.044 (0.025), 0.032 (0.032), 0.061 (0.021), 0.132 (0.031)] 0.102 (0.022) [0.046 (0.016), 0.056 (0.015)]
RF 0.180 (0.022) [0.082 (0.010), 0.099 (0.015)] 0.273 (0.043) [0.015 (0.014), 0.026 (0.007), 0.008 (0.012), 0.070 (0.020), 0.152 (0.026)] 0.091 (0.011) [0.027 (0.008), 0.064 (0.006)]
Tab.3  Cross validation error rate in cancer (subtype) classification analysis (in parentheses are standard deviations).
1 D. Hanahan, and R. A. Weinberg, (2011) Hallmarks of cancer: the next generation. Cell, 144, 646–674
https://doi.org/10.1016/j.cell.2011.02.013 pmid: 21376230
2 L. Breiman, (2001) Random forests. Mach. Learn., 45, 5– 32
https://doi.org/10.1023/A:1010933404324
3 D. S. Palmer, , N. M. O’Boyle, , R. C. Glen, and J. B. Mitchell, (2007) Random forest models to predict aqueous solubility. J. Chem. Inf. Model., 47, 150–158
https://doi.org/10.1021/ci060164k pmid: 17238260
4 P. Jiang,, H. Wu,, W. Wang,, W. Ma,, X. Sun,, Z Lu,. (2007) MiPred: classification of real and pseudo microRNA precursors using random forest prediction model with combined features. Nucleic Acids Res. 35, W339–W344
5 J. W. Lee,, J. B. Lee,, M. Park,, S. H. Song,(2005) An extensive comparison of recent classification tools applied to microarray data. Comput. Stat. Data Anal. 48, 869–885
6 B. A. Goldstein, , E. C. Polley, and F. B. Briggs, (2011) Random forests for genetic association studies. Stat. Appl. Genet. Mol. Biol., 10, 32
https://doi.org/10.2202/1544-6115.1691 pmid: 22889876
7 D. Amaratunga, , J. Cabrera, and Y. S. Lee, (2008) Enriched random forests. Bioinformatics, 24, 2010–2014
https://doi.org/10.1093/bioinformatics/btn356 pmid: 18650208
8 P. M. Granitto, , C. Furlanello, , F. Biasioli, and F. Gasperi, (2006) Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products. Chemometr. Intell. Lab., 83, 83–90
https://doi.org/10.1016/j.chemolab.2006.01.007
9 V. Svetnik, , A. Liaw, , C. Tong, and T. Wang, (2004) Application of Breiman’s random forest to modeling structure-activity relationships of pharmaceutical molecules. Lect. Notes Comput. Sci., 3077, 334–343
https://doi.org/10.1007/978-3-540-25966-4_33
10 R. Díaz-Uriarte, and S.A. de Andrés, (2006) Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 7, 3
https://doi.org/10.1186/1471-2105-7-3 pmid: 16398926
11 L. Breiman, (2001) Statistical modeling: the two cultures. Stat. Sci., 16, 199–231
https://doi.org/10.1214/ss/1009213726
12 D. Amaratunga, and J. Cabrera, (2009) A conditional t suite of tests for identifying differentially expressed genes in a DNA microarray experiment with little replication. Stat. Biopharm. Res., 1, 26–38
https://doi.org/10.1198/sbr.2009.0003
13 G. Biau, (2012) Analysis of a random forests model. J. Mach. Learn. Res., 13, 1063–1095
14 J. Barretina, , G. Caponigro, , N. Stransky, , K. Venkatesan, , A. A. Margolin, , S. Kim, , C. J. Wilson, , J. Lehár, , G. V. Kryukov, , D. Sonkin, , et al. (2012) The cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature, 483, 603–607
https://doi.org/10.1038/nature11003 pmid: 22460905
15 I. Guyon, , S. Gunn, , A. Ben-Hur, and G. Dror, (2004) Result Analysis of The Nips 2003 Feature Selection Challenge. In Proceeding NIPS’04 Proceedings of the 17th International Conference on Neural Information Processing Systems. pp. 545–552
16 S. L. Pomeroy, , P. Tamayo, , M. Gaasenbeek, , L. M. Sturla, , M. Angelo, , M. E. McLaughlin, , J. Y. Kim, , L. C. Goumnerova, , P. M. Black, , C. Lau, , et al. (2002) Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature, 415, 436–442
https://doi.org/10.1038/415436a pmid: 11807556
17 D. Singh, , P. G. Febbo, , K. Ross, , D. G. Jackson, , J. Manola, , C. Ladd, , P. Tamayo, A. A. Renshaw, , A. V. D’Amico, , J. P. Richie, , et al. (2002) Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 1, 203–209
https://doi.org/10.1016/S1535-6108(02)00030-2 pmid: 12086878
[1] QB-07121-OF-ZHY_suppl_1 Download
[1] Aishwarza Panday, Muhammad Ashad Kabir, Nihad Karim Chowdhury. A survey of machine learning techniques for detecting and diagnosing COVID-19 from imaging[J]. Quant. Biol., 2022, 10(2): 188-207.
[2] Yawei Li, Yuan Luo. Performance-weighted-voting model: An ensemble machine learning method for cancer type classification using whole-exome sequencing mutation[J]. Quant. Biol., 2020, 8(4): 347-358.
[3] Dongfang Wang, Jin Gu. Integrative clustering methods of multi-omics data for molecule-based cancer classifications[J]. Quant. Biol., 2016, 4(1): 58-67.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed