Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2018, Vol. 12 Issue (6) : 1220-1240    https://doi.org/10.1007/s11704-016-6194-y
RESEARCH ARTICLE
Distribution-free data density estimation in large-scale networks
Minqi ZHOU1,2, Rong ZHANG1,2(), Weining QIAN1, Aoying ZHOU1
1. Data Science and Engineering Institute, East China Normal University, Shanghai 200062, China
2. State Key Lab of Software Engineering,Wuhan University,Wuhan 430072, China
 Download: PDF(1331 KB)  
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

Estimating the global data distribution in largescale networks is an important issue and yet to be well addressed. It can benefit many applications, especially in the cloud computing era, such as load balancing analysis, query processing, and data mining. Inspired by the inversion method for random variate (number) generation, in this paper, we present a novel model called distribution-free data density estimation for large ring-based networks to achieve high estimation accuracy with low estimation cost regardless of the distribution models of the underlying data. This model generates random samples for any arbitrary distribution by sampling the global cumulative distribution function and is free from sampling bias. Armed with this estimation method, we can estimate data densities over both one-dimensional and multidimensional tuple sets, where each dimension could be either continuous or discrete as its domain. In large-scale networks, the key idea for distribution-free estimation is to sample a small subset of peers for estimating the global data distribution over the data domain. Algorithms on computing and sampling the global cumulative distribution function based on which the global data distribution is estimated are introduced with a detailed theoretical analysis. Our extensive performance study confirms the effectiveness and efficiency of our methods in large ring-based networks.

Keywords distribution-free      data density estimation      random sampling     
Corresponding Author(s): Rong ZHANG   
Just Accepted Date: 28 June 2016   Online First Date: 20 December 2017    Issue Date: 04 December 2018
 Cite this article:   
Minqi ZHOU,Rong ZHANG,Weining QIAN, et al. Distribution-free data density estimation in large-scale networks[J]. Front. Comput. Sci., 2018, 12(6): 1220-1240.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-016-6194-y
https://academic.hep.com.cn/fcs/EN/Y2018/V12/I6/1220
1 Lakshman A, Malik P. Cassandra: a decentralized structured storage system. ACM SIGOPS Operating Systems Review, 2010, 44(2): 35–40
https://doi.org/10.1145/1773912.1773922
2 DeCandia G, Hastorun D, Jampani M, Kakulapati G, Lakshman A, Pilchin A, Sivasubramanian S, Vosshall P, Vogels W. Dynamo: amazon’s highly available key-value store. ACM SIGOPS Review, 2007, 41(6): 205–220
https://doi.org/10.1145/1323293.1294281
3 Pitoura T, Triantafillou P. Load distribution fairness in P2P data management systems. In: Proceedings of the 23rd IEEE International Conference on Data Engineering. 2007, 396–405
4 Zhu Y W, Hu Y M. Towards efficient load balancing in structured P2P system. In: Proceedings of the 18th International Parallel and Distributed Processing Symposium. 2004
5 Shu Y F, Ooi B C, Tan K L, Zhou A Y. Supporting multi-dimensional range queries in peer-to-peer systems. In: Proceedings of the 5th IEEE International Conference on Peer-to-Peer Computing. 2005, 173–180
6 Arai B, Das G, Gunopulos D, Kalogeraki V. Approximating aggregation queries in peer-to-peer networks. In: Proceedings of the 22nd IEEE International Conference on Data Engineering. 2006
https://doi.org/10.1109/ICDE.2006.23
7 Wang S Y, Ooi B C, Tung A K H, Xu L Z. Efficient skyline query processing on peer-to-peer networks. In: Proceedings of the 23rd IEEE International Conference on Data Engineering. 2007, 1126–1135
https://doi.org/10.1109/ICDE.2007.368971
8 Datta S, Bhaduri K, Giannella C, Wolff R, Kargupta H. Distributed data mining in peer-to-peer networks. IEEE Internet Computing, 2006,10(4): 18–26
https://doi.org/10.1109/MIC.2006.74
9 Seshadri S. Probabilistic methods in query processing. Dissertation for the Doctoral Degree. University of Wisconsin, 1992
10 Matias Y, Vitter J S, Wang M. Wavelet-based histograms for selectivity estimation. ACM SIGMoD Record, 1998, 27(2): 448–459
https://doi.org/10.1145/276305.276344
11 Bonamente M. Theory of Probability. In: Statistics and Analysis of Scientific Data. Graduate Texts in Physics. New York: Springer, 2017
https://doi.org/10.1007/978-1-4939-6572-4_1
12 Eckhard R. Stan ulam, john von neumann, and the monte carlo method. Los Alamos Science, 1987, 15: 131–137
13 Zhou M Q, Shen H T, Zhou X F, Qian W N, Zhou A Y. Effective data density estimation in ring-based P2P networks. In: Proceedings of the 28th IEEE International Conference on Data Engineering. 2012, 594–605
14 Christodoulakis S. Estimating record selectivities. Information Systems, 1983, 8(2): 105–115
https://doi.org/10.1016/0306-4379(83)90035-2
15 Chen C M, Roussopoulos N. Adaptive selectivity estimation using query feedback. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 1994, 161–172
https://doi.org/10.1145/191839.191874
16 Haas P J, Naughton J F, Seshadri S, Stokes L. Sampling based estimation of the number of distinct values of an attribute. In: Proceedings of the 21st International Conference on Very Large Data Bases. 1995, 311–322
17 Jagadish H V, Koudas N, Muthukrishnan S, Poosala V, Sevcik K C, Suel T. Optimal histograms with quality gurantees. In: Proceedings of the 24th International Conference on Very Large Data Bases. 1998, 275–286
18 King V, Lewis S, Saia J, Young M. Choosing a random peer. In: Proceedings of the 23rd annual ACM Symposium on Principles of Distributed Computing. 2004, 125–130
https://doi.org/10.1145/1011767.1011786
19 Bharambe A R, Agrawal M, Seshan S. Mercury: supporting scalable multi-attribute range queries. ACM SIGCOMM Computer Communication Review, 2004, 34(4): 353–366
https://doi.org/10.1145/1030194.1015507
20 Arai B, Lin S, Gunopulos D. Efficient data sampling in heterogeneous peer-to-peer networks. In: Proceedings of the 7th IEEE International Conference on Data Engineering. 2007, 23–32
https://doi.org/10.1109/ICDM.2007.71
21 Darlagiannis V, Mauthe A, Steinmetz R. Sampling cluster endurance for peer-to-peer based content distribution networks. Multimedia Systems, 2007, 13(1): 19–33
https://doi.org/10.1007/s00530-007-0078-9
22 Oguz B, Anantharam V, Norros I. Stable distributed P2P protocols based on random peer sampling. IEEE/ACMTransactions on Networking, 2015, 23(5): 1444–1456
https://doi.org/10.1109/TNET.2014.2331352
23 Jelasity M, Voulgaris S, Guerraoui R, Kermarrec A M, Van Steen M. Gossip-based peer sampling. ACM Transactions on Computer Systems, 2007, 25(3): 8
https://doi.org/10.1145/1275517.1275520
24 Wu S, Li J Z, Ooi B C, Tan K L. Just-in-time query retrieval over partially indexed data on structured P2P overlays. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2008, 279–290
https://doi.org/10.1145/1376616.1376647
25 Jagadish H V, Ooi B C, Vu Q H. Baton: a balanced tree structure for peer-to-peer networks. In: Proceedings of the 31st International Conference on Very Large Data Bases. 2005, 661–672
26 Kempe D, Dobra A, Gehrke J. Gossip-based computation of aggregate information. In: Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science. 2003, 482–491
https://doi.org/10.1109/SFCS.2003.1238221
27 Hu Y S, Chen H, Lou J G, Li J. Distributed density estimation using non-parametric statistics. In: Proceedings of the 27th IEEE International Conference on Distributed Computing Systems. 2007
https://doi.org/10.1109/ICDCS.2007.100
28 Zhou M Q, Qian W N, Gong X Q, Zhou A Y. Multi-dimensional data density estimation in P2P networks. Distributed and Parallel Databases, 2009, 26(2–3): 261
https://doi.org/10.1007/s10619-009-7045-8
29 Evans M, Hastings N, Peacock B. Statistical Distributions. New York: Wiley, 2000
30 Stoica I, Morris R, Karger D, Kaashoek M F, Balakrishnan H. Chord:a scalable peer-to-peer lookup service for internet applications. ACM SIGCOMM Computer Communication Review, 2001, 31(4): 149–160
https://doi.org/10.1145/964723.383071
31 Rowstron A, Druschel P. Pastry: scalable, distributed object location and routing for large-scale peer-to-peer systems. In: Proceedings of IFIP/ACM International Conference on Distributed Systems Platforms and Open Distributed Processing. 2001, 329–350
https://doi.org/10.1007/3-540-45518-3_18
32 Jagadish H V, Ooi B C, Vu Q H. BATON: a balanced tree structure for peer-to-peer networks. In: Proceedings of the 31st International Conference on Very Large Data Bases. 2005, 661–672
33 Ioannidis Y E, Poosala V. Balancing histogram optimality and practicality for query result size estimation. ACM SIGMOD Record, 1995, 24(2): 233–244
https://doi.org/10.1145/568271.223841
34 Deering S, Estrin D, Farinacci D, Jacobson V, Liu C G, Wei L. An architecture for wide-area multicast routing. ACM SIGCOMM Computer Communication Review, 1994, 24(4): 126–135
https://doi.org/10.1145/190809.190326
35 Matsumoto M, Nishimura T. Mersenne twister: a 623-dimensionally equidistributed uniform pseudorandom number generator. ACM Transactions on Modeling and Computer Simulation, 1998, 8(1): 3–30
https://doi.org/10.1145/272991.272995
36 Wahba G. A polynomial algorithm for density estimation. The Annals of Mathematical Statistics, 1971, 42(6): 1870–1886
https://doi.org/10.1214/aoms/1177693053
37 Pfeiffer P E. Probability for Applications. New York: Springer-Verlag, 1990
https://doi.org/10.1007/978-1-4615-7676-1
38 Gray F. Pulse code communication. U.S. Patent 2,632,058, 1953-03-17
[1] Lijin WANG,Yilong YIN,Yiwen ZHONG. Cuckoo search with varied scaling factor[J]. Front. Comput. Sci., 2015, 9(4): 623-635.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed