Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2022, Vol. 16 Issue (5) : 165903    https://doi.org/10.1007/s11704-021-1015-3
RESEARCH ARTICLE
Towards a better prediction of subcellular location of long non-coding RNA
Zhao-Yue ZHANG, Zi-Jie SUN, Yu-He YANG, Hao LIN()
Key Laboratory for NeuroInformation of Ministry of Education, School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
 Download: PDF(2939 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

The spatial distribution pattern of long non-coding RNA (lncRNA) in cell is tightly related to their function. With the increment of publicly available subcellular location data, a number of computational methods have been developed for the recognition of the subcellular localization of lncRNA. Unfortunately, these computational methods suffer from the low discriminative power of redundant features or overfitting of oversampling. To address those issues and enhance the prediction performance, we present a support vector machine-based approach by incorporating mutual information algorithm and incremental feature selection strategy. As a result, the new predictor could achieve the overall accuracy of 91.60%. The highly automated web-tool is available at lin-group.cn/server/iLoc-LncRNA(2.0)/website. It will help to get the knowledge of lncRNA subcellular localization.

Keywords lncRNA      subcellular localization      support vector machine      mutual information      Web server     
Corresponding Author(s): Hao LIN   
Just Accepted Date: 12 April 2021   Issue Date: 24 December 2021
 Cite this article:   
Zhao-Yue ZHANG,Zi-Jie SUN,Yu-He YANG, et al. Towards a better prediction of subcellular location of long non-coding RNA[J]. Front. Comput. Sci., 2022, 16(5): 165903.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-021-1015-3
https://academic.hep.com.cn/fcs/EN/Y2022/V16/I5/165903
Fig.1  (a) The workflow diagram of developing the iLoc-lncRNA(2.0); (b) species composition in the benchmark dataset; (c) length distribution of lncRNA sequences and the number of samples in each subcellular location
Method Feature dimension Subcellular location Sn% Sp% Pre% MCC AUC OA%
iLoc-lncRNA(2.0) a) 1409 Nucleus 91.03 95.59 86.59 0.852 0.969 91.60
Cytoplasm 94.37 89.96 94.59 0.842 0.969
Ribosome 83.72 99.01 85.71 0.837 0.986
Exosome 66.67 99.36 83.33 0.735 0.949
Locate-R b) 857 Nucleus 66.92 95.15 / 0.66 0.900 90.69
Cytoplasm 84.74 89.1 / 0.725 0.930
Ribosome 100 98.37 / 0.97 1.000
Exosome 100 99.17 / 0.978 1.000
lncLocation / Nucleus 74.19 / 95.83 / / 87.78
Cytoplasm 100 / 85 / /
Ribosome 55.56 / 100 / /
Exosome 33.33 / 100 / /
iLoc-lncRNA c) 4107 Nucleus 77.56 97.59 / 0.796 / 86.72
Cytoplasm 99.06 67.68 / 0.742 /
Ribosome 46.51 99.83 / 0.652 /
Exosome 16.67 1 / 0.4 /
Tab.1  The performance of the SVM-based subcellular location prediction model
Fig.2  Incremental feature selection strategy accuracy curve for mRMR feature selection
Fig.3  ROC curves for the iLoc-lncRNA(2.0) for (a) nucleus, (b) cytosol, (c) ribosome, (d) exosome
Fig.4  Visualization of significant class-specific sequence motifs for (a) nucleus, (b) cytosol, (c) ribosome, (d) exosome by using DREME
Fig.5  The class-specific motifs distribution in four classes
1 H S Chiu , S Somvanshi , E Patel , T W Chen , V P Singh , B Zorman , S L Patil , Y Pan , S S Chatterjee , N Cancer Genome Atlas Research , A K Sood , P H Gunaratne , P Sumazin . Pan-cancer analysis of lncRNA regulation supports their targeting of cancer genes in each tumor context. Cell Reports, 2018, 23( 1): 297– 312. e12
2 J Ji , J Tang , KJ Xia , R Jiang . LncRNA in tumorigenesis microenvironment. Current Bioinformatics, 2019, 14( 7): 640– 641
3 C J Guo , G Xu , L L Chen . Mechanisms of long noncoding RNA nuclear retention. Trends in Biochemical Sciences, 2020, 45(11): 947-960,
4 M R Chowdhury , J Basak , R P Bahadur . Elucidating the functional role of predicted miRNAs in post-transcriptional gene regulation along with symbiosis in medicago truncatula. Current Bioinformatics, 2020, 15( 2): 108– 120
5 L Cheng , Y Hu , J Sun , M Zhou , Q Jiang . DincRNA: a comprehensive web-based bioinformatics toolkit for exploring disease associations and ncRNA function. Bioinformatics, 2018, 34( 11): 1953– 1956
6 L Cheng , P Wang , R Tian , S Wang , Q Guo , M Luo , W Zhou , G Liu , H Jiang , Q Jiang . LncRNA2Target v2.0: a comprehensive database for target genes of lncRNAs in human and mouse. Nucleic Acids Research, 2019, 47( D1): D140– D144
7 Q Jiang , R Ma , J Wang , X Wu , S Jin , J Peng , R Tan , T Zhang , Y Li , Y Wang . LncRNA2Function: a comprehensive resource for functional investigation of human lncRNAs based on RNA-seq data. BMC Genomics, 2015, 16( 3): 1– 11
8 Q Jiang , J Wang , X Wu , R Ma , T Zhang , S Jin , Z Han , R Tan , J Peng , G Liu , Y Li , Y Wang . LncRNA2Target: a database for differentially expressed genes after lncRNA knockdown or overexpression. Nucleic Acids Research, 2015, 43( Database issue): D193– 196
9 Q Jiang , J Wang , Y Wang , R Ma , X Wu , Y Li . TF2LncRNA: identifying common transcription factors for a list of lncRNA genes from ChIP-Seq data. Biomed Research International, 2014, 2014 : 317642–
10 L Ning , T Cui , B Zheng , N Wang , J Luo , B Yang , M Du , J Cheng , Y Dou , D Wang . MNDR v3.0: mammal ncRNA-disease repository with increased coverage and annotation. Nucleic Acids Research, 2021, 49( D1): D160– d164
11 F Mora-Marquez , J Luis Vazquez-Poletti , V Chano , C Collada , A Soto , U Lopez de Heredia . Hardware performance evaluation of de novo transcriptome assembly software in amazon elastic compute cloud. Current Bioinformatics, 2020, 15( 5): 420– 430
12 B Hu , L Zheng , C Long , M Song , T Li , L Yang , Y Zuo . EmExplorer: a database for exploring time activation of gene expression in mammalian embryos. Open Biology, 2019, 9( 6): 190054–
13 X Zhu , H D Li , L Guo , F X Wu , J Wang . Analysis of single-cell RNA-seq data by clustering approaches. Current Bioinformatics, 2019, 14( 4): 314– 322
14 T Zhang , P Tan , L Wang , N Jin , Y Li , L Zhang , H Yang , Z Hu , L Zhang , C Hu , C Li , K Qian , C Zhang , Y Huang , K Li , H Lin , D Wang . RNALocate: a resource for RNA subcellular localizations. Nucleic Acids Research, 2017, 45( D1): D135– D138
15 D Mas-Ponte , J Carlevaro-Fita , E Palumbo , T Hermoso Pulido , R Guigo , R Johnson . LncATLAS database for subcellular localization of long noncoding RNAs. RNA, 2017, 23( 7): 1080– 1087
16 X Wen , L Gao , X Guo , X Li , X Huang , Y Wang , H Xu , R He , C Jia , F Liang . lncSLdb: a resource for long non-coding RNA subcellular localization. Database (Oxford), 2018, 2018 : 1– 6
17 B L Gudenas , L Wang . Prediction of LncRNA subcellular localization with deep learning from sequence features. Science Reports, 2018, 8( 1): 16385–
18 T Zhao , Y Hu , J Peng , L Cheng . DeepLGP: a novel deep learning method for prioritizing lncRNA target genes. Bioinformatics, 2020, 36( 16): 4466– 4472
19 T Zhao , Y Hu , L Cheng . Deep-DRM: a computational method for identifying disease-related metabolites based on graph deep learning Approaches. Briefings in Bioinformatics, 2020, 22( 4): bbaa212–
20 B Wu , H Zhang , L Lin , H Wang , Y Gao , L Zhao , Y-P P Chen , R Chen , L Gu . A similarity searching system for biological phenotype images using deep convolutional encoder-decoder architecture. Current Bioinformatics, 2019, 14( 7): 628– 639
21 P Charoenkwan , C Nantasenamat , M M Hasan , W Shoombuatong . Meta-iPVP: a sequence-based meta-predictor for improving the prediction of phage virion proteins using effective feature representation. Journal of Computer-Aided Molecular Design, 2020, 34( 10): 1105– 1116
22 K Liu , L Cao , P Du , W Chen . im6A-TS-CNN: identifying the N(6)-methyladenine site in multiple tissues by using the convolutional neural network. Molecular Therapy-Nucleic Acids, 2020, 21 : 1044– 1049
23 B Zuckerman , I Ulitsky . Predictive models of subcellular localization of long RNAs. RNA, 2019, 25( 5): 557– 572
24 Y M Dong , J H Bi , Q E He , K Song . ESDA: an improved approach to accurately identify human snoRNAs for precision cancer therapy. Current Bioinformatics, 2020, 15( 1): 34– 40
25 Z Cao , X Pan , Y Yang , Y Huang , H B Shen . The lncLocator: a subcellular localization predictor for long non-coding RNAs based on a stacked ensemble classifier. Bioinformatics, 2018, 34( 13): 2185– 2194
26 Z D Su , Y Huang , Z Y Zhang , Y W Zhao , D Wang , W Chen , K C Chou , H Lin . iLoc-lncRNA: predict the subcellular location of lncRNAs by incorporating octamer composition into general PseKNC. Bioinformatics, 2018, 34( 24): 4196– 4204
27 A Ahmad , H Lin , S Shatabda . Locate-R: subcellular localization of long non-coding RNAs using nucleotide compositions. Genomics, 2020, 112( 3): 2583– 2589
28 S Feng , Y Liang , W Du , W Lv , Y Li . LncLocation: efficient subcellular location prediction of long non-coding RNA-based multi-source heterogeneous feature fusion. International Journal of Molecular Sciences, 2020, 21( 19): 7271–
29 Y Wang , F Shi , L Cao , N Dey , Q Wu , A S Ashour , R S Sherratt , V Rajinikanth , L Wu . Morphological segmentation analysis and texture-based support vector machines classification on mice liver fibrosis microscopic images. Current Bioinformatics, 2019, 14( 4): 282– 294
30 K D Pruitt , T Tatusova , D R Maglott . NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Research, 2007, 35( Database issue): D61– 65
31 H Y Lai , Z Y Zhang , Z D Su , W Su , H Ding , W Chen , H Lin . iProEP: a computational predictor for predicting promoter. Molecular Therapy-Nucleic Acids, 2019, 17 : 337– 346
32 K Liu , W Chen . iMRM: a platform for simultaneously identifying multiple kinds of RNA modifications. Bioinformatics, 2020, 36( 11): 3336– 3342
33 M M Hasan , S Basith , M S Khatun , G Lee , B Manavalan , H Kurata . Meta-i6mA: an interspecies predictor for identifying DNA N6-methyladenine sites of plant genomes by exploiting informative features in an integrative machine-learning framework. Briefings in Bioinformatics, 2020, 22( 3): bbaa202–
34 B Manavalan , S Basith , T H Shin , L Wei , G Lee . Meta-4mCpred: a sequence-based meta-predictor for accurate DNA 4mC site prediction using effective feature representation. Molecular Therapy-Nucleic Acids, 2019, 16 : 733– 744
35 S Basith , B Manavalan , T H Shin , G Lee . SDM6A: a web-based integrative machine-learning framework for predicting 6mA sites in the rice genome. Molecular Therapy-Nucleic Acids, 2019, 18 : 131– 141
36 L Zheng , S Huang , N Mu , H Zhang , J Zhang , Y Chang , L Yang , Y Zuo . RAACBook: a web server of reduced amino acid alphabet for sequence-dependent inference by using Chou's five-step rule. Database (Oxford), 2019,
37 Z Y Zhang , Y H Yang , H Ding , D Wang , W Chen , H Lin . Design powerful predictor for mRNA subcellular location prediction in Homo sapiens. Briefings in Bioinformatics, 2021, 22( 1): 526– 535
38 J Zhang , B Liu . A review on the recent developments of sequence-based protein feature extraction methods. Current Bioinformatics, 2019, 14( 3): 190– 199
39 P F Liang , W R Yang , X Chen , C S Long , L Zheng , H S Li , Y C Zuo . Machine learning of single-cell transcriptome highly identifies mRNA signature by comparing F-score selection with DGE analysis. Molecular Therapy-Nucleic Acids, 2020, 20 : 155– 163
40 K Liu , W Chen , H Lin . XG-PseU: an eXtreme Gradient Boosting based method for identifying pseudouridine sites. Molecular Genetics and Genomics, 2020, 295( 1): 13– 21
41 X Guo , L Gao , Y Wang , D K Y Chiu , B Wang , Y Deng , X Wen . Large-scale investigation of long noncoding RNA secondary structures in human and mouse. Current Bioinformatics, 2018, 13( 5): 450– 460
42 D Zhang , Z C Xu , W Su , Y H Yang , H Lv , H Yang , H Lin . iCarPS: a computational tool for identifying protein carbonylation sites by novel encoded features. Bioinformatics, 2021, 37( 2): 171– 177
43 S P Wang , Q Zhang , J Lu , Y D Cai . Analysis and prediction of nitrated tyrosine sites with the mRMR method and support vector machine algorithm. Current Bioinformatics, 2018, 13( 1): 3– 13
44 H Peng , F Long , C Ding . Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005, 27( 8): 1226– 1238
45 J Chen , J Zhao , S Yang , Z Chen , Z Zhang . Prediction of protein ubiquitination sites in arabidopsis thaliana. Current Bioinformatics, 2019, 14( 7): 614– 620
46 P Charoenkwan , C Nantasenamat , M M Hasan , W Shoombuatong . iTTCA-Hybrid: improved and robust identification of tumor T cell antigens by utilizing hybrid feature representation. Analytical Biochemistry, 2020, 599 : 113747–
47 Q Jiang , G Wang , S Jin , Y Li , Y Wang . Predicting human microRNA-disease associations based on support vector machine. International Journal of Dato Mining and Bioinformatics, 2013, 8( 3): 282– 293
48 C C Chang , C J Lin . LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2011, 2( 3): 27–
49 L Wei , W He , A Malik , R Su , L Cui , B Manavalan . Computational prediction and interpretation of cell-specific replication origin sites from multiple eukaryotes by exploiting stacking framework. Briefings in Bioinformatics, 2021, 22( 4): bbaa275–
50 M M Hasan , B Manavalan , W Shoombuatong , M S Khatun , H Kurata . i4mC-Mouse: improved identification of DNA N4-methylcytosine sites in the mouse genome using multiple encoding schemes. Computational and Structural Biotechnology Journal, 2020, 18 : 906– 912
51 P Charoenkwan , J Yana , N Schaduangrat , C Nantasenamat , M M Hasan , W Shoombuatong . iBitter-SCM: identification and characterization of bitter peptides using a scoring card method with propensity scores of dipeptides. Genomics, 2020, 112( 4): 2813– 2822
52 P Charoenkwan , W Chiangjong , V S Lee , C Nantasenamat , M M Hasan , W Shoombuatong . Improved prediction and characterization of anticancer activities of peptides using a novel flexible scoring card method. Scientific Reports, 2021, 11( 1): 1– 13
53 P Charoenkwan , S Kanthawong , C Nantasenamat , M M Hasan , W Shoombuatong . iDPPIV-SCM: a sequence-based predictor for identifying and analyzing dipeptidyl peptidase IV (DPP-IV) inhibitory peptides using a scoring card method. Journal of Proteome Research, 2020, 19( 10): 4125– 4136
54 P Charoenkwan , S Kanthawong , C Nantasenamat , M M Hasan , W Shoombuatong . iAMY-SCM: improved prediction and analysis of amyloid proteins using a scoring card method with propensity scores of dipeptides. Genomics, 2021, 113( 1): 689– 698
55 P Charoenkwan , S Kanthawong , N Schaduangrat , J Yana , W Shoombuatong . PVPred-SCM: improved prediction and analysis of phage virion proteins using a scoring card method. Cells, 2020, 9( 2): 353–
56 P Charoenkwan , C Nantasenamat , M M Hasan , W Shoombuatong . iTTCA-Hybrid: improved and robust identification of tumor T cell antigens by utilizing hybrid feature representation. Analytical Biochemistry, 2020, 599 : 113747–
57 P Charoenkwan , W Shoombuatong , H C Lee , J Chaijaruwanich , H L Huang , S Y Ho . SCMCRYS: predicting protein crystallization using an ensemble scoring card method with estimating propensity scores of P-collocated amino acid pairs. PLoS ONE, 2013, 8( 9): e72368–
58 P Charoenkwan , J Yana , C Nantasenamat , M M Hasan , W Shoombuatong . iUmami-SCM: a novel sequence-based predictor for prediction and analysis of umami peptides using a scoring card method with propensity scores of dipeptides. Journal of Chemical Information and Modeling, 2020, 60( 12): 6666– 6678
59 H Long , Z Sun , M Li , H Y Fu , M C Lin . Predicting protein phosphorylation sites based on deep learning. Current Bioinformatics, 2020, 15( 4): 300– 308
60 L Cheng . Computational and biological methods for gene therapy. Current Gene Therapy, 2019, 19( 4): 210– 210
61 L Cheng , Y Hu . Human disease system biology. Current Gene Therapy, 2018, 18( 5): 255– 256
62 L Kuang , H Zhao , L Wang , Z Xuan , T Pei . A novel approach based on point cut set to predict associations of diseases and LncRNAs. Current Bioinformatics, 2019, 14( 4): 333– 343
63 W Chen , P Feng , X Song , H Lv , H Lin . iRNA-m7G: identifying N(7)-methylguanosine sites by fusing multiple features. Molecular Therapy Nucleic Acids, 2019, 18 : 269– 274
64 D Liu , G Li , Y Zuo . Function determinants of TET proteins: the arrangements of sequence motifs with specific codes. Briefings in Bioinformatics, 2019, 20( 5): 1826– 1835
65 L Zheng , D Liu , W Yang , L Yang , Y Zuo . RaacLogo: a new sequence logo generator by using reduced amino acid clusters. Briefings in Bioinformatics, 2021, 22(3): bbaa096,
66 T L Bailey . DREME: motif discovery in transcription factor ChIP-seq data. Bioinformatics, 2011, 27( 12): 1653– 1659
67 C Ginestet . ggplot2: elegant graphics for data analysis. Journal of the Royal Statistical Society Series a-Statistics in Society, 2011, 174 : 245– 245
[1] Ren QI, Fei GUO, Quan ZOU. String kernels construction and fusion: a survey with bioinformatics application[J]. Front. Comput. Sci., 2022, 16(6): 166904-.
[2] Hongwei GE, Yuxuan HAN, Wenjing KANG, Liang SUN. Unpaired image to image transformation via informative coupled generative adversarial networks[J]. Front. Comput. Sci., 2021, 15(4): 154326-.
[3] Hui XUE, Haiming XU, Xiaohong CHEN, Yunyun WANG. A primal perspective for indefinite kernel SVM problem[J]. Front. Comput. Sci., 2020, 14(2): 349-363.
[4] Farid FEYZI, Saeed PARSA. Inforence: effective fault localization based on information-theoretic analysis and statistical causal inference[J]. Front. Comput. Sci., 2019, 13(4): 735-759.
[5] Hui XUE, Sen LI, Xiaohong CHEN, Yunyun WANG. A maximum margin clustering algorithm based on indefinite kernels[J]. Front. Comput. Sci., 2019, 13(4): 813-827.
[6] Xu YU,Jing YANG,Zhiqiang XIE. Training SVMs on a bound vectors set based on Fisher projection[J]. Front. Comput. Sci., 2014, 8(5): 793-806.
[7] Shangfei WANG, Shan HE, Yue WU, Menghua HE, Qiang JI. Fusion of visible and thermal images for facial expression recognition[J]. Front. Comput. Sci., 2014, 8(2): 232-242.
[8] Lean YU, Shouyang WANG, Kin Keung LAI. Developing an SVM-based ensemble learning system for customer risk identification collaborating with customer relationship management[J]. Front Comput Sci Chin, 2010, 4(2): 196-203.
[9] Baoliang LU, Xiaolin WANG, Masao UTIYAMA. Incorporating prior knowledge into learning by dividing training data[J]. Front Comput Sci Chin, 2009, 3(1): 109-122.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed