|
|
Identification of cytokine via an improved genetic algorithm |
Xiangxiang ZENG,Sisi YUAN,Xianxian HUANG,Quan ZOU() |
Department of Computer Science, Xiamen University, Xiamen 361005, China |
|
|
Abstract With the explosive growth in the number of protein sequences generated in the postgenomic age, research into identifying cytokines from proteins and detecting their biochemical mechanisms becomes increasingly important. Unfortunately, the identification of cytokines from proteins is challenging due to a lack of understanding of the structure space provided by the proteins and the fact that only a small number of cytokines exists in massive proteins. In view of fact that a proteins sequence is conceptually similar to a mapping of words to meaning, n-gram, a type of probabilistic language model, is explored to extract features for proteins. The second challenge focused on in this work is genetic algorithms, a search heuristic that mimics the process of natural selection, that is utilized to develop a classifier for overcoming the protein imbalance problem to generate precise prediction of cytokines in proteins. Experiments carried on imbalanced proteins data set show that our methods outperform traditional algorithms in terms of the prediction ability.
|
Keywords
n-grams
genetic algorithm
cytokine identification
sampling
imbalanced data
|
Corresponding Author(s):
Quan ZOU
|
Issue Date: 07 September 2015
|
|
1 |
Zou Q, Li X, Jiang Y, Zhao Y, Wang G. BinMemPredict: a Web server and software for predicting membrane protein types. Current Proteomics, 2013, 10(1): 2―9
https://doi.org/10.2174/1570164611310010002
|
2 |
Yabuki Y, Muramatsu T, Hirokawa T, Mukai H, Suwa M. GRIFFIN: a system for predicting GPCR-G-protein coupling selectivity using a support vector machine and a hidden Markov model. Nucleic AcidsResearch, 2005, 33(suppl 2): W148―W153
https://doi.org/10.1093/nar/gki495
|
3 |
Nielsen H, Engelbrecht J, Brunak S, Heijne G V. A neural network method for identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. International Journal of Neural Systems, 1997, 8(5-6): 581―599
https://doi.org/10.1142/S0129065797000537
|
4 |
Altschul S F, Gish W, Miller W, Myers E W, Lipman D J. Basic local alignment search tool. Journal of Molecular Biology, 1990, 215(3): 403―410
https://doi.org/10.1016/S0022-2836(05)80360-2
|
5 |
Pearson W R. Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. Genomics, 1991, 11(3): 635―650
https://doi.org/10.1016/0888-7543(91)90071-L
|
6 |
Huang N, Chen H, Sun Z. CTKPred: an SVM-based method for the prediction and classification of the cytokine superfamily. Protein Engineering Design and Selection, 2005, 18(8): 365―368
https://doi.org/10.1093/protein/gzi041
|
7 |
Liu B, Wang X, Lin L, Tang B, Dong Q, Wang X. Prediction of protein binding sites in protein structures using hidden Markov support vector machine. BMC bioinformatics, 2009, 10(1): 381
https://doi.org/10.1186/1471-2105-10-381
|
8 |
Lin C, Zou Y, Qin J, Liu X, Jiang Y, Ke C, Zou Q. Hierarchical classification of protein folds using a novel ensemble classifier. PloS one, 2013, 8(2): e56499
https://doi.org/10.1371/journal.pone.0056499
|
9 |
Zou Q, Chen W, Huang Y, Liu X, Jiang Y. Identifying multi-functional enzyme by hierarchical multi-label classifier. Journal of Computational and Theoretical Nanoscience, 2013, 10(4): 1038―1043
https://doi.org/10.1166/jctn.2013.2804
|
10 |
Chou K C, Shen H B. Recent advances in developing web-servers for predicting protein attributes. Natural Science, 2009, 1(2): 63―92
https://doi.org/10.4236/ns.2009.12011
|
11 |
Ganapathiraju M, Weisser D, Rosenfeld R, Carbonell J, Reddy R, Klein-Seetharaman J. Comparative n-gram analysis of whole-genome protein sequences. In: Proceedings of the 2nd International Conference on Human Language Technology Research. 2002, 76―81
https://doi.org/10.3115/1289189.1289259
|
12 |
Srinivasan S M, Vural S, King B R, Guda C. Mining for class-specific motifs in protein sequence classification. BMC Bioinformatics, 2013, 14(1): 96
https://doi.org/10.1186/1471-2105-14-96
|
13 |
Koza J R. Genetic Programming. MIT press, 1992
|
14 |
Sun Y, Kamel M S, Wong A K, Wang Y. Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition, 2007, 40(12): 3358―3378
https://doi.org/10.1016/j.patcog.2007.04.009
|
15 |
Lewis D, Gale W. Training text classifiers by uncertainty sampling. In: Proceedings of the 14th ACM SIGIR Conference on Research and Development in Information Retrieval. 1994.
|
16 |
Kubat M, Holte R C, Matwin S. Machine learning for the detection of oil spills in satellite radar images. Machine learning, 1998, 30(2-3): 195―215
https://doi.org/10.1023/A:1007452223027
|
17 |
Fawcett T. An introduction to ROC analysis. Pattern recognition letters, 2006, 27(8): 861―874
https://doi.org/10.1016/j.patrec.2005.10.010
|
18 |
Provost F J, Fawcett T. Analysis and visualization of classifier performance: comparison under imprecise class and cost distributions. In: Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining. 1997, 97: 43―48
|
19 |
Bateman A, Coin L, Durbin R, Finn R D, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer E L L, Studholme D J, Yeats C, Eddy, S. R. The Pfam protein families database. Nucleic Acids Research, 2004, 32: D138―D141
https://doi.org/10.1093/nar/gkh121
|
[1] |
Supplementary Material-Highlights in 3-page ppt
|
Download
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|