Variable importance-weighted Random Forests

doi:10.1007/s40484-017-0121-6

Quant. Biol.

2017, Vol. 5

Issue (4) : 338-351 https://doi.org/10.1007/s40484-017-0121-6

RESEARCH ARTICLE

Variable importance-weighted Random Forests

Yiyi Liu¹, Hongyu Zhao^1,²(

)

¹. Department of Biostatistics, School of Public Health, Yale University, New Haven, CT 06511, USA
². Program of Computational Biology and Bioinformatics, Yale University, New Haven, CT 06511, USA

Download: PDF(3398 KB) HTML
Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks

Abstract

Background: Random Forests is a popular classification and regression method that has proven powerful for various prediction problems in biological studies. However, its performance often deteriorates when the number of features increases. To address this limitation, feature elimination Random Forests was proposed that only uses features with the largest variable importance scores. Yet the performance of this method is not satisfying, possibly due to its rigid feature selection, and increased correlations between trees of forest.

Methods: We propose variable importance-weighted Random Forests, which instead of sampling features with equal probability at each node to build up trees, samples features according to their variable importance scores, and then select the best split from the randomly selected features.

Results: We evaluate the performance of our method through comprehensive simulation and real data analyses, for both regression and classification. Compared to the standard Random Forests and the feature elimination Random Forests methods, our proposed method has improved performance in most cases.

Conclusions: By incorporating the variable importance scores into the random feature selection step, our method can better utilize more informative features without completely ignoring less informative ones, hence has improved prediction accuracy in the presence of weak signals and large noises. We have implemented an R package “viRandomForests” based on the original R package “randomForest” and it can be freely downloaded from http://zhaocenter.org/software.

Author Summary Random Forests is a powerful classification and regression method and is commonly used in genomic data analyses. However, when the number of features is very large and the signals are relatively weak, its performance tends to decline. In this article, we propose a modified Random Forests, which instead of considering each feature equally as the original Random Forests does, weights the features according to their informativeness (measured using the variable importance score). This new method, variable importance-weighted Random Forests (viRF), demonstrates improved performance in both simulation and real data analyses studies.

Keywords Random Forests variable importance score classification regression

Corresponding Author(s): Hongyu Zhao

Online First Date: 06 November 2017 Issue Date: 04 December 2017

Cite this article:

Yiyi Liu,Hongyu Zhao. Variable importance-weighted Random Forests[J]. Quant. Biol., 2017, 5(4): 338-351.

URL:

https://academic.hep.com.cn/qb/EN/10.1007/s40484-017-0121-6
https://academic.hep.com.cn/qb/EN/Y2017/V5/I4/338

Tree structure used in regression Model 3.

MSE in regression simulation Model 1.

MSE in regression simulation Model 2.

MSE in regression simulation Model 3.

Drug	viRF		eRF	feRF				RF	# cell lines
	normalized	unnormalized	marginal test	normalized, recursive	unnormalized, recursive	normalized, nonrecursive	unnormalized, nonrecursive
17-AAG	0.914 (0.007)	0.897 (0.012)	0.896 (0.013)	0.914 (0.012)	0.905 (0.016)	0.919 (0.016)	0.902 (0.011)	0.932 (0.007)	503
AEW541	0.303 (0.004)	0.305 (0.006)	0.297 (0.005)	0.302 (0.005)	0.300 (0.006)	0.304 (0.007)	0.300 (0.004)	0.305 (0.003)	503
AZD0530	0.544 (0.008)	0.553 (0.011)	0.546 (0.017)	0.547 (0.015)	0.544 (0.010)	0.543 (0.009)	0.543 (0.011)	0.540 (0.007)	504
AZD6244	0.879 (0.014)	0.852 (0.016)	0.860 (0.016)	0.866 (0.021)	0.861 (0.017)	0.864 (0.019)	0.859 (0.015)	0.935 (0.015)	503
Erlotinib	0.326 (0.004)	0.326 (0.006)	0.325 (0.004)	0.327 (0.005)	0.327 (0.005)	0.327 (0.005)	0.328 (0.006)	0.331 (0.003)	503
Irinotecan	0.746 (0.012)	0.669 (0.016)	0.789 (0.010)	0.685 (0.036)	0.698 (0.039)	0.679 (0.026)	0.698 (0.024)	0.810 (0.010)	317
L-685458	0.223 (0.004)	0.225 (0.005)	0.218 (0.004)	0.219 (0.005)	0.220 (0.006)	0.220 (0.006)	0.219 (0.005)	0.221 (0.004)	491
LBW242	0.469 (0.005)	0.477 (0.009)	0.472 (0.006)	0.472 (0.005)	0.471 (0.005)	0.471 (0.006)	0.473 (0.006)	0.470 (0.003)	503
Lapatinib	0.323 (0.004)	0.307 (0.008)	0.320 (0.005)	0.317 (0.010)	0.316 (0.009)	0.314 (0.008)	0.316 (0.009)	0.335 (0.003)	504
Nilotinib	0.489 (0.020)	0.488 (0.024)	0.469 (0.018)	0.493 (0.027)	0.497 (0.035)	0.495 (0.026)	0.490 (0.028)	0.490 (0.016)	420
Nutlin-3	0.198 (0.003)	0.200 (0.004)	0.197 (0.003)	0.199 (0.003)	0.199 (0.003)	0.199 (0.003)	0.199 (0.003)	0.198 (0.002)	504
PD-0325901	1.258 (0.018)	1.223 (0.023)	1.242 (0.022)	1.233 (0.024)	1.239 (0.024)	1.233 (0.018)	1.237 (0.029)	1.367 (0.020)	504
PD-0332991	0.295 (0.003)	0.298 (0.004)	0.291 (0.003)	0.294 (0.004)	0.294 (0.006)	0.294 (0.005)	0.296 (0.006)	0.294 (0.003)	434
PF2341066	0.278 (0.005)	0.265 (0.008)	0.280 (0.004)	0.276 (0.013)	0.279 (0.012)	0.276 (0.008)	0.276 (0.007)	0.289 (0.003)	504
PHA-665752	0.248 (0.002)	0.251 (0.002)	0.248 (0.002)	0.249 (0.003)	0.249 (0.003)	0.249 (0.002)	0.249 (0.002)	0.248 (0.002)	503
PLX4720	0.291 (0.004)	0.297 (0.004)	0.290 (0.004)	0.291 (0.005)	0.291 (0.004)	0.293 (0.004)	0.293 (0.007)	0.292 (0.005)	496
Paclitaxel	1.260 (0.013)	1.246 (0.017)	1.232 (0.011)	1.236 (0.014)	1.238 (0.022)	1.227 (0.022)	1.220 (0.017)	1.279 (0.01)	503
Panobinostat	0.387 (0.005)	0.384 (0.006)	0.391 (0.005)	0.391 (0.007)	0.389 (0.006)	0.388 (0.005)	0.388 (0.008)	0.393 (0.005)	500
RAF265	0.466 (0.008)	0.467 (0.009)	0.458 (0.009)	0.464 (0.009)	0.461 (0.010)	0.465 (0.009)	0.459 (0.010)	0.468 (0.006)	460
Sorafenib	0.217 (0.005)	0.221 (0.006)	0.214 (0.005)	0.216 (0.006)	0.216 (0.008)	0.217 (0.006)	0.217 (0.006)	0.217 (0.004)	503
TAE684	0.578 (0.009)	0.576 (0.012)	0.570 (0.008)	0.580 (0.013)	0.575 (0.010)	0.579 (0.010)	0.575 (0.012)	0.586 (0.007)	504
TKI258	0.284 (0.003)	0.279 (0.004)	0.284 (0.004)	0.285 (0.004)	0.284 (0.004)	0.284 (0.003)	0.283 (0.004)	0.287 (0.003)	504
Topotecan	0.996 (0.012)	0.931 (0.013)	1.061 (0.010)	1.001 (0.020)	0.997 (0.018)	0.997 (0.016)	1.000 (0.020)	1.082 (0.009)	504
ZD-6474	0.437 (0.006)	0.433 (0.006)	0.440 (0.009)	0.440 (0.009)	0.437 (0.008)	0.440 (0.007)	0.436 (0.007)	0.442 (0.005)	496

Tab.1 Cross validation MSE in CCLE data analysis (in parentheses are standard deviations).

Tree structure used in classification Model 3. Noises are added by assigning data point in each terminal node the denoted class with probability 0.9 and any other class with probability 0.1/3.

Error rate in classification simulation Model 1.

Error rate in classification simulation Model 2.

Error rate in classification simulation Model 3.

Tab.2 Classification data sets summary.

Tab.3 Cross validation error rate in cancer (subtype) classification analysis (in parentheses are standard deviations).

1	D. Hanahan, and R. A. Weinberg, (2011) Hallmarks of cancer: the next generation. Cell, 144, 646–674 https://doi.org/10.1016/j.cell.2011.02.013 pmid: 21376230
2	L. Breiman, (2001) Random forests. Mach. Learn., 45, 5– 32 https://doi.org/10.1023/A:1010933404324
3	D. S. Palmer, , N. M. O’Boyle, , R. C. Glen, and J. B. Mitchell, (2007) Random forest models to predict aqueous solubility. J. Chem. Inf. Model., 47, 150–158 https://doi.org/10.1021/ci060164k pmid: 17238260
4	P. Jiang,, H. Wu,, W. Wang,, W. Ma,, X. Sun,, Z Lu,. (2007) MiPred: classification of real and pseudo microRNA precursors using random forest prediction model with combined features. Nucleic Acids Res. 35, W339–W344
5	J. W. Lee,, J. B. Lee,, M. Park,, S. H. Song,(2005) An extensive comparison of recent classification tools applied to microarray data. Comput. Stat. Data Anal. 48, 869–885
6	B. A. Goldstein, , E. C. Polley, and F. B. Briggs, (2011) Random forests for genetic association studies. Stat. Appl. Genet. Mol. Biol., 10, 32 https://doi.org/10.2202/1544-6115.1691 pmid: 22889876
7	D. Amaratunga, , J. Cabrera, and Y. S. Lee, (2008) Enriched random forests. Bioinformatics, 24, 2010–2014 https://doi.org/10.1093/bioinformatics/btn356 pmid: 18650208
8	P. M. Granitto, , C. Furlanello, , F. Biasioli, and F. Gasperi, (2006) Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products. Chemometr. Intell. Lab., 83, 83–90 https://doi.org/10.1016/j.chemolab.2006.01.007
9	V. Svetnik, , A. Liaw, , C. Tong, and T. Wang, (2004) Application of Breiman’s random forest to modeling structure-activity relationships of pharmaceutical molecules. Lect. Notes Comput. Sci., 3077, 334–343 https://doi.org/10.1007/978-3-540-25966-4_33
10	R. Díaz-Uriarte, and S.A. de Andrés, (2006) Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 7, 3 https://doi.org/10.1186/1471-2105-7-3 pmid: 16398926
11	L. Breiman, (2001) Statistical modeling: the two cultures. Stat. Sci., 16, 199–231 https://doi.org/10.1214/ss/1009213726
12	D. Amaratunga, and J. Cabrera, (2009) A conditional t suite of tests for identifying differentially expressed genes in a DNA microarray experiment with little replication. Stat. Biopharm. Res., 1, 26–38 https://doi.org/10.1198/sbr.2009.0003
13	G. Biau, (2012) Analysis of a random forests model. J. Mach. Learn. Res., 13, 1063–1095
14	J. Barretina, , G. Caponigro, , N. Stransky, , K. Venkatesan, , A. A. Margolin, , S. Kim, , C. J. Wilson, , J. Lehár, , G. V. Kryukov, , D. Sonkin, , et al. (2012) The cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature, 483, 603–607 https://doi.org/10.1038/nature11003 pmid: 22460905
15	I. Guyon, , S. Gunn, , A. Ben-Hur, and G. Dror, (2004) Result Analysis of The Nips 2003 Feature Selection Challenge. In Proceeding NIPS’04 Proceedings of the 17th International Conference on Neural Information Processing Systems. pp. 545–552
16	S. L. Pomeroy, , P. Tamayo, , M. Gaasenbeek, , L. M. Sturla, , M. Angelo, , M. E. McLaughlin, , J. Y. Kim, , L. C. Goumnerova, , P. M. Black, , C. Lau, , et al. (2002) Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature, 415, 436–442 https://doi.org/10.1038/415436a pmid: 11807556
17	D. Singh, , P. G. Febbo, , K. Ross, , D. G. Jackson, , J. Manola, , C. Ladd, , P. Tamayo, A. A. Renshaw, , A. V. D’Amico, , J. P. Richie, , et al. (2002) Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 1, 203–209 https://doi.org/10.1016/S1535-6108(02)00030-2 pmid: 12086878

[1]

QB-07121-OF-ZHY_suppl_1

Download

[1]	Aishwarza Panday, Muhammad Ashad Kabir, Nihad Karim Chowdhury. A survey of machine learning techniques for detecting and diagnosing COVID-19 from imaging[J]. Quant. Biol., 2022, 10(2): 188-207.
[2]	Yawei Li, Yuan Luo. Performance-weighted-voting model: An ensemble machine learning method for cancer type classification using whole-exome sequencing mutation[J]. Quant. Biol., 2020, 8(4): 347-358.
[3]	Dongfang Wang, Jin Gu. Integrative clustering methods of multi-omics data for molecule-based cancer classifications[J]. Quant. Biol., 2016, 4(1): 58-67.

Viewed

Full text

Abstract

Cited

Shared

Discussed