Please wait a minute...
Quantitative Biology

ISSN 2095-4689

ISSN 2095-4697(Online)

CN 10-1028/TM

Postal Subscription Code 80-971

Quant. Biol.    2017, Vol. 5 Issue (1) : 90-98    https://doi.org/10.1007/s40484-017-0096-3
RESEARCH ARTICLE
Construction of precise support vector machine based models for predicting promoter strength
Hailin Meng1,Yingfei Ma2,Guoqin Mai2,Yong Wang3(),Chenli Liu1,2()
1. Bioengineering Research Center, Guangzhou Institute of Advanced Technology, Chinese Academy of Sciences, Guangzhou 511458, China
2. Center for Synthetic Biology Engineering Research, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China
3. Chinese Academy of Sciences Key Laboratory of Synthetic Biology, Institute of Plant Physiology and Ecology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200032, China
 Download: PDF(3610 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

Background: The prediction of the prokaryotic promoter strength based on its sequence is of great importance not only in the fundamental research of life sciences but also in the applied aspect of synthetic biology. Much advance has been made to build quantitative models for strength prediction, especially the introduction of machine learning methods such as artificial neural network (ANN) has significantly improve the prediction accuracy. As one of the most important machine learning methods, support vector machine (SVM) is more powerful to learn knowledge from small sample dataset and thus supposed to work in this problem.

Methods: To confirm this, we constructed SVM based models to quantitatively predict the promoter strength. A library of 100 promoter sequences and strength values was randomly divided into two datasets, including a training set (≥10 sequences) for model training and a test set (≥10 sequences) for model test.

Results: The results indicate that the prediction performance increases with an increase of the size of training set, and the best performance was achieved at the size of 90 sequences. After optimization of the model parameters, a high-performance model was finally trained, with a high squared correlation coefficient for fitting the training set (R2>0.99) and the test set (R2>0.98), both of which are better than that of ANN obtained by our previous work.

Conclusions: Our results demonstrate the SVM-based models can be employed for the quantitative prediction of promoter strength.

Author Summary  Machine learning models can learn knowledge from a given dataset and make predictions on unknown data. These technologies are widely used in artificial intelligence and made tremendous progress, triggering the coming era of “Industry 4.0”. In life sciences, introduction of such methods has greatly promoted the development of the discipline, especially modeling in bioinformatics and systems biology. As a powerful machine learning method suitable for small sample learning, support vector machine (SVM) was introduced into the field of promoter strength prediction. The good performance of SVM models demonstrates a promising application prospect of this method in prediction of promoter strength.
Keywords support vector machine model      quantitative prediction      promoter strength      machine learning     
PACS:     
Fund: 
Corresponding Author(s): Yong Wang,Chenli Liu   
Online First Date: 15 February 2017    Issue Date: 22 March 2017
 Cite this article:   
Hailin Meng,Yingfei Ma,Guoqin Mai, et al. Construction of precise support vector machine based models for predicting promoter strength[J]. Quant. Biol., 2017, 5(1): 90-98.
 URL:  
https://academic.hep.com.cn/qb/EN/10.1007/s40484-017-0096-3
https://academic.hep.com.cn/qb/EN/Y2017/V5/I1/90
Fig.1  A sketch map of promoter strength prediction based on SVM models.
Fig.2  The prediction performance of SVM models as a function of the size of training set. The size of training set ranges from 10 to 90 sequences. Each size was independently and randomly sampled for five times to train SVM models, and maximum R2 and minimum mean squared error (MSE) values for prediction of the test set were calculated.
Fig.3  Parameter optimizations for model training. The balance factor C for loss function and the width s for RBF kernel function under different precision error e were optimized for searching the best parameters. MSE was shown as a function of log_2?C and log_2?s. A?D for model training and E?H for model test.
Fig.4  The best model ‘OptModel’ trained with the optimized parameters ( C = 128 , ? σ = 24.25 , ? ? = 0.01 ) can finely predict the measured promoter strengths.

(A) and (B) The predicted relative strengths fit with the measured values using the training set and test set (Supplementary Dataset S1), respectively. (C) Prediction of test set values and comparison with target values (experimental data).

Fig.5  Fitting results comparison: SVM (blue) versus ANN (red). (A) Training. (B) Test. Data points sampled for training set and test set are different between SVM and ANN.
Fig.6  Effect of each single base mutation on the sequence strength predicted by “OptModel” (A) and difference between SVM and ANN (B). Red indicates positive mutation while blue the negative. Deeper color means more significant change of the strength. Figure in the boxes is the location number of each base, while the subscript shows the mutation of wildtype base to another one (e.g., A→G denotes A mutated to G).
1 Blount, B. A., Weenink, T., Vasylechko, S. and Ellis, T. (2012) Rational diversification of a promoter providing fine-tuned expression and orthogonal regulation for synthetic biology. PLoS One, 7, e33279
https://doi.org/10.1371/journal.pone.0033279 pmid: 22442681
2 Qin, X., Qian, J., Yao, G., Zhuang, Y., Zhang, S. and Chu, J. (2011) GAP promoter library for fine-tuning of gene expression in Pichia pastoris. Appl. Environ. Microbiol., 77, 3600–3608
https://doi.org/10.1128/AEM.02843-10 pmid: 21498769
3 Alper, H., Fischer, C., Nevoigt, E. and Stephanopoulos, G. (2005) Tuning genetic control through promoter engineering. Proc. Natl. Acad. Sci. USA, 102, 12678–12683.
https://doi.org/10.1073/pnas.0504604102 pmid: 16123130
4 Salis, H. M., Mirsky, E. A. and Voigt, C. A. (2009) Automated design of synthetic ribosome binding sites to control protein expression. Nat. Biotechnol., 27, 946–950.
https://doi.org/10.1038/nbt.1568 pmid: 19801975
5 Lou, C., Stanton, B., Chen, Y. J., Munsky, B. and Voigt, C. A. (2012) Ribozyme-based insulator parts buffer synthetic circuits from genetic context. Nat. Biotechnol., 30, 1137–1142.
https://doi.org/10.1038/nbt.2401 pmid: 23034349
6 Rhodius, V. A. and Mutalik, V. K. (2010) Predicting strength and function for promoters of the Escherichia coli alternative sigma factor, σE. Proc. Natl. Acad. Sci. USA, 107, 2854–2859
https://doi.org/10.1073/pnas.0915066107 pmid: PMID:20133665
7 De Mey, M., Maertens, J., Lequeux, G. J., Soetaert, W. K. and Vandamme, E. J. (2007) Construction and model-based analysis of a promoter library for E. coli: an indispensable tool for metabolic engineering. BMC Biotechnol., 7, 34
https://doi.org/10.1186/1472-6750-7-34 pmid: 17572914
8 Meng, H., Wang, J., Xiong, Z., Xu, F., Zhao, G. and Wang, Y. (2013) Quantitative design of regulatory elements based on high-precision strength prediction using artificial neural network. PLoS One, 8, e60288
https://doi.org/10.1371/journal.pone.0060288 pmid: 23560087
9 Meng, H. and Wang, Y. (2015) Cis-acting regulatory elements: from random screening to quantitative design. Quant. Biol., 3, 107–114..
https://doi.org/10.1007/s40484-015-0050-1
10 Vapnik, V. N. (2000) The Nature of Statistical Learning Theory. New York: Springer-Verlag
11 Vapnik, V. N. (1999) An overview of statistical learning theory. IEEE Trans. Neural Netw., 10, 988–999.
https://doi.org/10.1109/72.788640 pmid: 18252602
12 Hassanien, A. E., Al-Shammari, E. T. and Ghali, N. I. (2013) Computational intelligence techniques in bioinformatics. Comput. Biol. Chem., 47, 37–47.
https://doi.org/10.1016/j.compbiolchem.2013.04.007 pmid: 23891719
13 Ho, H. K., Zhang, L., Ramamohanarao, K. and Martin, S. (2013) A survey of machine learning methods for secondary and supersecondary protein structure prediction. In Methods and Protocols: Methods in Molecular Biology, 932, 87–106. New York: Humana Press
https://doi.org/10.1007/978-1-62703-065-6_6 pmid: 22987348
14 Cheng, J., Tegge, A. N. and Baldi, P. (2008) Machine learning methods for protein structure prediction. IEEE Rev. Biomed. Eng., 1, 41–49.
https://doi.org/10.1109/RBME.2008.2008239 pmid: 22274898
15 Zhao, Y. and Wang, Z. (2008) RNA secondary structure prediction based on support vector machine classification. Chinese Journal of Biotechnology, 24, 1140–1148.
https://doi.org/10.1016/S1872-2075(08)60056-4 pmid: 18837386
16 Towsey, M. W., Gordon, J. J. and Hogan, J. M. (2006) The prediction of bacterial transcription start sites using SVMs. Int. J. Neural Syst., 16, 363–370.
https://doi.org/10.1142/S0129065706000767 pmid: 17117497
17 Ichikawa, D., Saito, T., Ujita, W. and Oyama, H. (2016) How can machine-learning methods assist in virtual screening for hyperuricemia? A healthcare machine-learning approach. J. Biomed. Inform., 64, 20–24.
https://doi.org/10.1016/j.jbi.2016.09.012 pmid: 27658886
18 Vyas, R., Bapat, S., Jain, E., Tambe, S. S., Karthikeyan, M. and Kulkarni, B. D. (2015) A study of applications of machine learning based classification methods for virtual screening of lead molecules. Comb. Chem. High Throughput Screen., 18, 658–672.
https://doi.org/10.2174/1386207318666150703112447 pmid: 26138573
19 Burton, J., Ijjaali, I., Petitet, F., Michel, A. and Vercauteren, D. P. (2009) Virtual screening for cytochromes p450: successes of machine learning filters. Comb. Chem. High Throughput Screen., 12, 369–382.
https://doi.org/10.2174/138620709788167935 pmid: 19442071
20 Melville, J. L., Burke, E. K. and Hirst, J. D. (2009) Machine learning in virtual screening. Comb. Chem. High Throughput Screen., 12, 332–343.
https://doi.org/10.2174/138620709788167980 pmid: 19442063
21 Fox, T. and Kriegl, J. M. (2006) Machine learning techniques for in silico modeling of drug metabolism. Curr. Top. Med. Chem., 6, 1579–1591.
https://doi.org/10.2174/156802606778108915 pmid: 16918470
22 Kourou, K., Exarchos, T. P., Exarchos, K. P., Karamouzis, M. V. and Fotiadis, D. I. (2015) Machine learning applications in cancer prognosis and prediction. Comput. Struct. Biotechnol. J., 13, 8–17.
https://doi.org/10.1016/j.csbj.2014.11.005 pmid: 25750696
23 Polley, M. Y., Freidlin, B., Korn, E. L., Conley, B. A., Abrams, J. S. and McShane, L. M. (2013) Statistical and practical considerations for clinical evaluation of predictive biomarkers. J. Natl. Cancer Inst., 105, 1677–1683.
https://doi.org/10.1093/jnci/djt282 pmid: 24136891
24 Liang, G. and Li, Z. (2007) Scores of generalized base properties for quantitative sequence-activity modelings for E. coli promoters based on support vector machine. J. Mol. Graph. Model., 26, 269–281.
https://doi.org/10.1016/j.jmgm.2006.12.004 pmid: 17291800
25 Towsey, M., Timms, P., Hogan, J. and Mathews, S. A. (2008) The cross-species prediction of bacterial promoters using a support vector machine. Comput. Biol. Chem., 32, 359–366.
https://doi.org/10.1016/j.compbiolchem.2008.07.009. pmid: 18703385
26 Xu, W., Zhang, L. and Lu, Y. (2016) SD-MSAEs: promoter recognition in human genome based on deep feature extraction. J. Biomed. Inform., 61, 55–62.
https://doi.org/10.1016/j.jbi.2016.03.018 pmid: 27018214
27 Sato, M. (2012) Promoter analysis with wavelets and support vector machines. Procedia Comput. Sci., 12, 432–437.
https://doi.org/10.1016/j.procs.2012.09.100
28 Holloway, D. T., Kon, M. and Delisi, C. (2007) Machine learning for regulatory analysis and transcription factor target prediction in yeast. Syst. Synth. Biol., 1, 25–46.
https://doi.org/10.1007/s11693-006-9003-3 pmid: 19003435
29 Anwar, F., Baker, S. M., Jabid, T., Mehedi Hasan, M., Shoyaib, M., Khan, H. and Walshe, R. (2008) Pol II promoter prediction using characteristic 4-mer motifs: a machine learning approach. BMC Bioinformatics, 9, 414
https://doi.org/10.1186/1471-2105-9-414 pmid: 18834544
30 Carvalho, S. G., Guerra-Sá, R. and de C Merschmann, L. H. (2015) The impact of sequence length and number of sequences on promoter prediction performance. BMC Bioinformatics, 16, S5
https://doi.org/10.1186/1471-2105-16-S19-S5 pmid: 26695879
31 Hwang, W., Oliver, V. F., Merbs, S. L., Zhu, H. and Qian, J. (2015) Prediction of promoters and enhancers using multiple DNA methylation-associated features. BMC Genomics, 16, S11
https://doi.org/10.1186/1471-2164-16-S7-S11 pmid: 26099324
32 Li, Y., Lee, K. K., Walsh, S., Smith, C., Hadingham, S., Sorefan, K., Cawley, G. and Bevan, M. W. (2006) Establishing glucose- and ABA-regulated transcription networks in Arabidopsis by microarray analysis and promoter classification using a Relevance Vector Machine. Genome Res., 16, 414–427.
https://doi.org/10.1101/gr.4237406 pmid: 16424108
33 Sandhu, R. S., Coyne, E. J., Feinstein, H. L. and Youman, C. E. (1996) Role based access control models. IEEE Computer, 29, 38–47..
https://doi.org/10.1109/2.485845
[1] QB-17096-OF-LCL_suppl_1 Download
[1] Jie Ren, Kai Song, Chao Deng, Nathan A. Ahlgren, Jed A. Fuhrman, Yi Li, Xiaohui Xie, Ryan Poplin, Fengzhu Sun. Identifying viruses from metagenomic data using deep learning[J]. Quant. Biol., 2020, 8(1): 64-77.
[2] Yasen Jiao, Pufeng Du. Performance measures in evaluating machine learning based bioinformatics predictors for classifications[J]. Quant. Biol., 2016, 4(4): 320-330.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed