Quant. Biol.    2017, Vol. 5 Issue (1) : 90-98
Construction of precise support vector machine based models for predicting promoter strength
Hailin Meng1,Yingfei Ma2,Guoqin Mai2,Yong Wang3(),Chenli Liu1,2()
1. Bioengineering Research Center, Guangzhou Institute of Advanced Technology, Chinese Academy of Sciences, Guangzhou 511458, China
2. Center for Synthetic Biology Engineering Research, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China
3. Chinese Academy of Sciences Key Laboratory of Synthetic Biology, Institute of Plant Physiology and Ecology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200032, China
Background: The prediction of the prokaryotic promoter strength based on its sequence is of great importance not only in the fundamental research of life sciences but also in the applied aspect of synthetic biology. Much advance has been made to build quantitative models for strength prediction, especially the introduction of machine learning methods such as artificial neural network (ANN) has significantly improve the prediction accuracy. As one of the most important machine learning methods, support vector machine (SVM) is more powerful to learn knowledge from small sample dataset and thus supposed to work in this problem.

Methods: To confirm this, we constructed SVM based models to quantitatively predict the promoter strength. A library of 100 promoter sequences and strength values was randomly divided into two datasets, including a training set (≥10 sequences) for model training and a test set (≥10 sequences) for model test.

Results: The results indicate that the prediction performance increases with an increase of the size of training set, and the best performance was achieved at the size of 90 sequences. After optimization of the model parameters, a high-performance model was finally trained, with a high squared correlation coefficient for fitting the training set (R2>0.99) and the test set (R2>0.98), both of which are better than that of ANN obtained by our previous work.

Conclusions: Our results demonstrate the SVM-based models can be employed for the quantitative prediction of promoter strength.

Author Summary  Machine learning models can learn knowledge from a given dataset and make predictions on unknown data. These technologies are widely used in artificial intelligence and made tremendous progress, triggering the coming era of “Industry 4.0”. In life sciences, introduction of such methods has greatly promoted the development of the discipline, especially modeling in bioinformatics and systems biology. As a powerful machine learning method suitable for small sample learning, support vector machine (SVM) was introduced into the field of promoter strength prediction. The good performance of SVM models demonstrates a promising application prospect of this method in prediction of promoter strength.
Keywords support vector machine model      quantitative prediction      promoter strength      machine learning     
Corresponding Author(s): Yong Wang,Chenli Liu   
Online First Date: 15 February 2017    Issue Date: 22 March 2017
 Cite this article:   
Hailin Meng,Yingfei Ma,Guoqin Mai, et al. Construction of precise support vector machine based models for predicting promoter strength[J]. Quant. Biol., 2017, 5(1): 90-98.
Fig.1  A sketch map of promoter strength prediction based on SVM models.
Fig.2  The prediction performance of SVM models as a function of the size of training set. The size of training set ranges from 10 to 90 sequences. Each size was independently and randomly sampled for five times to train SVM models, and maximum R2 and minimum mean squared error (MSE) values for prediction of the test set were calculated.
Fig.3  Parameter optimizations for model training. The balance factor C for loss function and the width s for RBF kernel function under different precision error e were optimized for searching the best parameters. MSE was shown as a function of log_2?C and log_2?s. A?D for model training and E?H for model test.
Fig.4  The best model ‘OptModel’ trained with the optimized parameters ( C = 128 , ? σ = 24.25 , ? ? = 0.01 ) can finely predict the measured promoter strengths.

(A) and (B) The predicted relative strengths fit with the measured values using the training set and test set (Supplementary Dataset S1), respectively. (C) Prediction of test set values and comparison with target values (experimental data).

Fig.5  Fitting results comparison: SVM (blue) versus ANN (red). (A) Training. (B) Test. Data points sampled for training set and test set are different between SVM and ANN.
Fig.6  Effect of each single base mutation on the sequence strength predicted by “OptModel” (A) and difference between SVM and ANN (B). Red indicates positive mutation while blue the negative. Deeper color means more significant change of the strength. Figure in the boxes is the location number of each base, while the subscript shows the mutation of wildtype base to another one (e.g., A→G denotes A mutated to G).
