Please wait a minute...
Quantitative Biology

ISSN 2095-4689

ISSN 2095-4697(Online)

CN 10-1028/TM

Postal Subscription Code 80-971

Quant. Biol.    2018, Vol. 6 Issue (2) : 142-154    https://doi.org/10.1007/s40484-018-0140-y
RESEARCH ARTICLE
MTMO: an efficient network-centric algorithm for subtree counting and enumeration
Guanghui Li1,2, Jiawei Luo1(), Zheng Xiao1, Cheng Liang1
1. College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China
2. School of Information Engineering, East China Jiaotong University, Nanchang 330013, China
 Download: PDF(626 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

Background: The frequency of small subtrees in biological, social, and other types of networks could shed light into the structure, function, and evolution of such networks. However, counting all possible subtrees of a prescribed size can be computationally expensive because of their potentially large number even in small, sparse networks. Moreover, most of the existing algorithms for subtree counting belong to the subtree-centric approaches, which search for a specific single subtree type at a time, potentially taking more time by searching again on the same network.

Methods: In this paper, we propose a network-centric algorithm (MTMO) to efficiently count k-size subtrees. Our algorithm is based on the enumeration of all connected sets of k1 edges, incorporates a labeled rooted tree data structure in the enumeration process to reduce the number of isomorphism tests required, and uses an array-based indexing scheme to simplify the subtree counting method.

Results: The experiments on three representative undirected complex networks show that our algorithm is roughly an order of magnitude faster than existing subtree-centric approaches and base network-centric algorithm which does not use rooted tree, allowing for counting larger subtrees in larger networks than previously possible. We also show major differences between unicellular and multicellular organisms. In addition, our algorithm is applied to find network motifs based on pattern growth approach.

Conclusions: A network-centric algorithm which allows for a faster counting of non-induced subtrees is proposed. This enables us to count larger motif in larger networks than previously.

Keywords complex network      evolutionary systems biology      network motif discovery      subtree counting      subtree isomorphism     
Corresponding Author(s): Jiawei Luo   
Issue Date: 11 June 2018
 Cite this article:   
Guanghui Li,Jiawei Luo,Zheng Xiao, et al. MTMO: an efficient network-centric algorithm for subtree counting and enumeration[J]. Quant. Biol., 2018, 6(2): 142-154.
 URL:  
https://academic.hep.com.cn/qb/EN/10.1007/s40484-018-0140-y
https://academic.hep.com.cn/qb/EN/Y2018/V6/I2/142
Network No. of vertices No. of edges Average degree Maximum degree Source
YeastPPI 2361 6646 5.630 64 [25], [26]
Electronic 252 399 3.167 14 [27]
Dolphins 62 159 5.129 12 [28], [29]
Tab.1  Experimental Datasets
Network K Classes Occurrences Processing Times Comparison vs. NTMO
NTMO MTMO
YeastPPI 3 1 103504 0.08 0.00 nan
YeastPPI 4 2 2445294 3.75 0.14 26.79x
YeastPPI 5 3 70997199 161.32 3.00 53.77x
YeastPPI 6 6 2332901289 6145.17 93.40 65.79x
YeastPPI 7 11 83309369433 261864.73 3701.55 70.74x
YeastPPI 8 23 3156730829473 - 128677.50 -
Average Run Time Growth Ratio 42.65 31.74
Electronic 3 1 1161 0.00 0.00 nan
Electronic 4 2 4625 0.01 0.00 nan
Electronic 5 3 21627 0.06 0.00 nan
Electronic 6 6 110370 0.31 0.01 31.00x
Electronic 7 11 591201 2.03 0.06 33.83x
Electronic 8 23 3255764 12.96 0.25 51.84x
Electronic 9 47 18267470 79.76 1.37 58.22x
Electronic 10 106 103983466 515.54 7.40 69.67x
Electronic 11 235 598953356 3247.79 39.83 81.54x
Electronic 12 551 3483503787 21075.99 233.29 90.34x
Average Run Time Growth Ratio 6.19 5.38
Dolphins 3 1 923 0.00 0.00 nan
Dolphins 4 2 6884 0.01 0.00 nan
Dolphins 5 3 57434 0.14 0.00 nan
Dolphins 6 6 506955 1.36 0.04 34.00x
Dolphins 7 11 4616856 15.27 0.28 54.54x
Dolphins 8 23 42742064 169.75 2.51 67.63x
Dolphins 9 47 397400342 1811.70 22.71 79.78x
Dolphins 10 106 3670658836 19618.11 219.86 89.23x
Dolphins 11 235 33379840446 183064.13 1790.84 102.22x
Dolphins 12 551 296840871353 - 17257.83
Average Run Time Growth Ratio 10.98 8.75
Tab.2  Experimental results for MTMO vs. NRMO
Network K Classes Occurrences Processing times Comparison vs. MODA
MODA MTMO
Electronic 3 1 1161 0.047 0.00 nan
Electronic 4 2 4625 0.175 0.00 nan
Electronic 5 3 21627 0.997 0.015 66.47x
Electronic 6 6 110370 5.692 0.062 91.81x
Electronic 7 11 591201 34.983 0.312 112.13x
Electronic 8 23 3255764 219.930 1.607 136.86x
Electronic 9 47 18267470 1416.216 9.079 155.99x
Electronic 10 106 103983466 8644.806 50.684 170.56x
Electronic 11 235 598953356 57515.849 287.587 199.99x
Electronic 12 551 3483503787 MEM 1673.030 -
Average Run Time Growth Ratio 5.84 5.29
Dolphins 3 1 923 0.026 0.010 2.60x
Dolphins 4 2 6884 0.106 0.015 7.07x
Dolphins 5 3 57434 0.893 0.040 22.33x
Dolphins 6 6 506955 9.005 0.214 42.08x
Dolphins 7 11 4616856 84.713 1.862 45.50x
Dolphins 8 23 42742064 889.634 17.272 51.51x
Dolphins 9 47 397400342 10945.796 160.103 68.37x
Dolphins 10 106 3670658836 MEM 1532.602 -
Dolphins 11 235 33379840446 MEM 14884.071 -
Dolphins 12 551 296840871353 MEM 147292.440 -
Average Run Time Growth Ratio 9.13 7.33
Tab.3  Experimental results for MTMO vs. MODA
Network No. of vertices No. of edges Average degree Clustering coefficient Characteristic path length Diameter Power-law distribution N=αk-r
α r
S.cere 5137 22440 8.737 0.097 3.984 10 4014.5 1.707
E.coli 2955 11645 7.882 0.094 4.004 12 985.1 1.445
H.pylo 715 1364 3.815 0.017 4.135 9 352.2 1.651
C.eleg 2711 4042 2.982 0.022 4.837 14 601.8 1.607
Tab.4  Topological statistics in the PPI networks we studied
Fig.1  Normalized treelet distribution size 8 of the Yeast PPI network (Red), E. coli (Green), H. pylo (Blue) and C. eleg (Pink).
Fig.2  Shapes and labels for 3 and 4-node subgraphs in an undirected network.
Code Freq
(Real)
Mean-freq
(Random)
S-dev
(Random)
Z-score
M378 382545 382545.00 0.00 undefined
M3238 10245 3256.45 131.34 53.21
M48598 8002763 10047886.52 148466.57 –13.77
M44382 13027193 13027193.00 0.00 undefined
M427030 240724 64297.22 3166.95 55.71
M44958 1732944 806416.51 43209.95 21.44
M413278 234165 25512.19 2763.95 75.49
M431710 15822 680.79 131.12 115.48
Tab.5  The statistical properties of MIPS network generated by our algorithm (non-induced subgraphs)
Code Freq
(Real)
Mean-freq
(Random)
S-dev
(Random)
Z-score
M378 97.1700% 99.9900% 0.00038 –73.3400
M3238 2.8300% 0.0152% 0.00038 73.3400
M48598 27.0800% 38.6600% 0.00270 –42.8860
M44382 66.1600% 60.9000% 0.00301 17.4990
M427030 0.3055% 0.3849% 0.00019 –4.1973
M44958 5.5766% 0.0624% 0.00158 34.8470
M413278 0.7874% 0.0013% 4.22e-005 186.1800
M431710 0.0895% 5.71e-006% 5.29e-007 1690.1000
Tab.6  The statistical properties of MIPS network generated by ESU (induced subgraphs)
Fig.3  Example graph G1 and its 3-trees.
Fig.4  An instance of a network.
Fig.5  The implicit built trees rooted at vertex 1 of size 4 for network in Figure 4.
Fig.6  An example labeled rooted tree of enumerating 4-subtrees.
Fig.7  The steps of searching the rooted tree during enumerating a sample subtree.
Fig.8  List of treelets for k=5 with their canonical labelings.
k 3 4 5 6 7 8 9 10 11 12
Classes 1 2 3 6 11 23 47 106 235 551
Leaves 2 5 14 42 132 429 1430 4862 16796 58786
Tab.7  Number of each k-subtree non-isomorphic class, and the maximum number of corresponding labeled rooted tree leaves
k The minimum (in decimal) The maximum (in decimal) The length of array
3 52 52 1
4 212 216 5
5 852 920 69
6 3412 3696 285
7 13652 15472 1821
8 54612 61920 7309
9 218452 254432 35981
10 873812 1017792 143981
11 3495252 4130752 635501
12 13981012 16523136 2542125
Tab.8  The minimum and maximum of the canonical labeling for different k-subtrees, and the length of corresponding array
1 Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D. and Alon, U. (2002) Network motifs: simple building blocks of complex networks. Science, 298, 824–827
https://doi.org/10.1126/science.298.5594.824 pmid: 12399590
2 Alon, N., Dao, P., Hajirasouliha, I., Hormozdiari, F. and Sahinalp, S. C. (2008) Biomolecular network motif counting and discovery by color coding. Bioinformatics, 24, i241–i249
https://doi.org/10.1093/bioinformatics/btn163 pmid: 18586721
3 Huan, J., Wang, W. and Prins, J. (2003) Efficient mining of frequent subgraphs in the presence of isomorphism. In Proc. Third IEEE Int’l Conf. on Data Mining, pp. 549–552
4 Kuramochi, M. and Karypis, G. (2001) Frequent subgraph discovery. In Proc. First IEEE Int’l Conf. on Data Mining, pp. 313–320
5 Liang, C., Li, Y., Luo, J. and Zhang, Z. (2015) A novel motif-discovery algorithm to identify co-regulatory motifs in large transcription factor and microRNA co-regulatory networks in human. Bioinformatics, 31, 2348–2355
https://doi.org/10.1093/bioinformatics/btv159 pmid: 25788622
6 Chen, X., Yan, C. C., Zhang, X., Zhang, X., Dai, F., Yin, J. and Zhang, Y. (2016) Drug-target interaction prediction: databases, web servers and computational models. Brief. Bioinform., 17, 696–712
https://doi.org/10.1093/bib/bbv066 pmid: 26283676
7 Chen, X., Yan, C. C., Zhang, X. and You, Z. H. (2017) Long non-coding RNAs and complex diseases: from experimental results to computational models. Brief. Bioinformatics, 18, 558–576
pmid: 27345524
8 Chen, X., Huang, Y. A., You, Z. H., Yan, G. Y. and Wang, X. S. (2017) A novel approach based on KATZ measure to predict associations of human microbiota with non-infectious diseases. Bioinformatics, 33, 733–739
pmid: 28025197
9 Chen, X., Xie, D., Zhao, Q. and You, Z. H. (2017) MicroRNAs and complex diseases: from experimental results to computational models. Brief. Bioinform., doi: 10.1093/bib/bbx130
pmid: 29045685
10 Chen, X., Yan, C. C., Zhang, X., You, Z. H., Deng, L., Liu, Y., Zhang, Y. and Dai, Q. (2016) WBSMDA: within and between score for MiRNA-disease association prediction. Sci. Rep., 6, 21106
https://doi.org/10.1038/srep21106 pmid: 26880032
11 Dao, P., Schönhuth, A., Hormozdiari, F., Hajirasouliha, I., Sahinalp, S. C. and Este, M. (2009) Quantifying systemic evolutionary changes by color coding confidence-scored PPI networks. In 9th Int’l Workshop on Algorithms in Bioinformatics, pp. 37–48
12 Omidi, S., Schreiber, F. and Masoudi-Nejad, A. (2009) MODA: an efficient algorithm for network motif discovery in biological networks. Genes Genet. Syst., 84, 385–395
https://doi.org/10.1266/ggs.84.385 pmid: 20154426
13 Alon, N., Yuster, R. and Zwick, U. (1995) Color-coding. J. Assoc. Comput. Mach., 42, 844–856
https://doi.org/10.1145/210332.210337
14 Zhao, Z., Khan, M., Kumar, V. S. A. and Marathe, M. V. (2010) Subgraph enumeration in large social contact networks using parallel color coding and streaming. In Proc. IEEE 39th Int’l Conf. on Parallel Processing, pp. 594–603
15 Zhao, Z., Wang, G., Butt, A. R., Khan, M., Kumar, V. S. A. and Marathe, M. V. (2012) SAHAD: subgraph analysis in massive networks using Hadoop. In Proc. 26th Int’l. Parallel and Distributed Processing Symp., pp. 390–401
16 Slota, G. M. and Madduri, K. (2013) Fast approximate subgraph counting and enumeration. In Proc. IEEE 42nd Int’l Conf. on Parallel Processing, pp. 210–219
17 Slota, G. M. and Madduri, K. (2014) Complex network analysis using parallel approximate motif counting. In Proc. 28th Int’l. Parallel and Distributed Processing Symp., pp. 405–414
18 Slota, G. M. and Madduri, K. (2015) Parallel color-coding. Parallel Comput., 47, 51–69
https://doi.org/10.1016/j.parco.2015.02.004
19 Wernicke, S. (2006) Efficient detection of network motifs. IEEE/ACM Trans. Comput. Biol. Bioinformatics, 3, 347–359
https://doi.org/10.1109/TCBB.2006.51 pmid: 17085844
20 Kashani, Z. R., Ahrabian, H., Elahi, E., Nowzari-Dalini, A., Ansari, E. S., Asadi, S., Mohammadi, S., Schreiber, F. and Masoudi-Nejad, A. (2009) Kavosh: a new algorithm for finding network motifs. BMC Bioinformatics, 10, 318
https://doi.org/10.1186/1471-2105-10-318 pmid: 19799800
21 Khakabimamaghani, S., Sharafuddin, I., Dichter, N., Koch, I. and Masoudi-Nejad, A. (2013) QuateXelero: an accelerated exact network motif detection algorithm. PLoS One, 8, e68073
https://doi.org/10.1371/journal.pone.0068073 pmid: 23874498
22 Paredes, P. and Ribeiro, P. (2013) Towards a faster network-centric subgraph census. In IEEE/ACM Int’l Conf. on Advances in Social Networks Analysis and Mining, pp. 264–271
23 Ferreira, R., Grossi, R. and Rizzi, R. (2011) Output-sensitive listing of bounded-size trees in undirected graphs. In Proc. ESA’11, pp. 275–286
24 Wasa, K. (2016) Enumeration of enumeration algorithms. arXiv:1605.05102
25 Bu, D., Zhao, Y., Cai, L., Xue, H., Zhu, X., Lu, H., Zhang, J., Sun, S., Ling, L., Zhang, N., et al. (2003) Topological structure analysis of the protein-protein interaction network in budding yeast. Nucleic Acids Res., 31, 2443–2450
https://doi.org/10.1093/nar/gkg340 pmid: 12711690
26 Batagelj, V. and Mrvar, A. (2006) Pajek Datasets. Available:
27 ISCAS89 benchmark suite.
28 Lusseau, D., Schneider, K., Boisseau, O. J., Haase, P., Slooten, E. and Dawson, S. M. (2003) The bottlenose dolphin community of doubtful sound features a large proportion of long-lasting associations. Behav. Ecol. Sociobiol., 54, 396–405
https://doi.org/10.1007/s00265-003-0651-y
29 Newman, M. (2009) Network Data. Available:
30 Xenarios, I., Salwínski, L., Duan, X. J., Higney, P., Kim, S. M. and Eisenberg, D. (2002) DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res., 30, 303–305
https://doi.org/10.1093/nar/30.1.303 pmid: 11752321
31 Mewes, H. W., Amid, C., Arnold, R., Frishman, D., Güldener, U., Mannhaupt, G., Münsterkötter, M., Pagel, P., Strack, N., Stümpflen, V., et al. (2004) MIPS: analysis and annotation of proteins from whole genomes. Nucleic Acids Res., 32, D41– D44
https://doi.org/10.1093/nar/gkh092 pmid: 14681354
32 Wan, S. and Zou, Q. (2017) HAlign-II: efficient ultra-large multiple sequence alignment and phylogenetic tree reconstruction with distributed and parallel computing. Algorithms Mol. Biol., 12, 25
https://doi.org/10.1186/s13015-017-0116-x pmid: 29026435
33 Zou, Q., Hu, Q., Guo, M. and Wang, G. (2015) HAlign: fast multiple similar DNA/RNA sequence alignment based on the centre star strategy. Bioinformatics, 31, 2475–2481
https://doi.org/10.1093/bioinformatics/btv177 pmid: 25812743
34 Stinson, D. (1999) Combinatorial Algorithms: Generation, Enumeration, and Search, pp. 48–52. Boca Raton: CRC Press
35 Alamgir, Z. and Abbasi, S. (2007) Combinatorial algorithms for listing paths in minimal change order. In Proc. Fourth Conf. Combinatorial and Algorithmic Aspects of Networking, pp. 112–130
36 Hilton, P. and Pedersen, J. (1991) Catalan numbers, their generalization, and their uses. Math. Intell., 13, 64–75
https://doi.org/10.1007/BF03024089
37 Aho, A., Hopcroft, J. and Ullman, J. (1974) The Design and Analysis of Computer Algorithms, pp 84–85. New Jersey: Addison-Wesley Publishing Co.
38 Heubach, S. and Mansour, T. (2004) Compositions of n with parts in a set. Congr. Numer., 168, 127–143
[1] Ye Yuan, Xinying Ren, Zhen Xie, Xiaowo Wang. A quantitative understanding of microRNA- mediated competing endogenous RNA regulation[J]. Quant. Biol., 2016, 4(1): 47-57.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed