|
|
A parallel computing framework for big data |
Guoliang CHEN1,2, Rui MAO1,2(), Kezhong LU1,2 |
1. Guangdong Province Key Laboratory of Popular High Performance Computers, Shenzhen 518060, China 2. College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China |
|
|
Abstract Big data has received great attention in research and application. However, most of the current efforts focus on system and application to handle the challenges of “volume” and “velocity”, and not much has been done on the theoretical foundation and to handle the challenge of “variety”. Based on metric-space indexing and computationalcomplexity theory, we propose a parallel computing framework for big data. This framework consists of three components, i.e., universal representation of big data by abstracting various data types into metric space, partitioning of big data based on pair-wise distances in metric space, and parallel computing of big data with the NC-class computing theory.
|
Keywords
NC-computing
metric space
data partitioning
parallel computing
|
Corresponding Author(s):
Rui MAO
|
Just Accepted Date: 29 March 2016
Online First Date: 17 March 2017
Issue Date: 26 July 2017
|
|
1 |
ZhouB, Liu W, FanC . Big Data: Strategy, Technology and Practice. Beijing: Publishing House of Electronics Industry, 2013
|
2 |
TuZ. Big Data. Guilin: Guangxi Normal University Press, 2012
|
3 |
FanW, GeertsF, NevenF. Making queries tractable on big data with preprocessing: through the eyes of complexity theory. Proceedings of the VLDB Endowment, 2013, 6(9): 685–696
https://doi.org/10.14778/2536360.2536368
|
4 |
ChenG. Design and Analysis of Parallel Algorithms. Beijing: Higher Education Press, 2011
|
5 |
CookS A. Deterministic CFL’s are accepted simultaneously in polynomial time and log squared space. In: Proceedings of the 11th Annual ACM Symposium on Theory of Computing. 1979, 338–345
https://doi.org/10.1145/800135.804426
|
6 |
MaoR, XuH L, WuW B, Li J Q, LiY , LuM H. Overcoming the challenge of variety: big data abstraction, the next evolution of data management for AAL communication system. IEEE Communications Magazine, 2015, 53(1): 42–47
https://doi.org/10.1109/MCOM.2015.7010514
|
7 |
XiongJ, LuJ, TanF. Topology. Beijing:China Machine Press, 2013
|
8 |
ChávezE, Navarro G, Baeza-YatesR , MarroquínJ L. Searching in metric spaces. ACM Computing Surveys, 2001, 33(3): 273–321
https://doi.org/10.1145/502807.502808
|
9 |
ZezulaP, AmatoG, DohnalV, Batko M. Similarity Search: the Metric Space Approach. Springer Science & Business Media, 2006
|
10 |
MaoR, ZhangP H, LiX L, Liu X, LuM H . Pivot selection for metricspace indexing. International Journal of Machine Learning and Cybernetics, 2016, 7(2): 311–323
https://doi.org/10.1007/s13042-016-0504-4
|
11 |
MaoR, Miranker W, MirankerD P . Pivot selection: dimension reduction for distance-based indexing. Journal of Discrete Algorithms, 2012, 13: 32–46
https://doi.org/10.1016/j.jda.2011.10.004
|
12 |
ShiH M, Schaeffer J. Parallel sorting by regular sampling. Journal of Parallel and Distributed Computing, 1992, 14(4): 361–372
https://doi.org/10.1016/0743-7315(92)90075-X
|
13 |
ValiantL G. Parallelism in comparison problems. SIAM Journal on Computing, 1975, 4(3): 348–355
https://doi.org/10.1137/0204030
|
14 |
ShiloachY, Vishkin U. Finding the maximum. Merging and Sorting in a Parallel Computational Model, 1981, 2(1): 88–102
|
15 |
ChenG. Balanced (n, m)-selection networks. Computer Research and Development, 1984, 21(11): 9–22
|
16 |
Samet, H. Foundations of Multidimensional and Metric Data Structures. San Francisco: Morgan-Kaufmann, 2006
|
17 |
HjaltasonG R, SametH. Index-driven similarity search in metric spaces. ACM Transactions on Database Systems, 2003. 28(4): 517–580
https://doi.org/10.1145/958942.958948
|
18 |
UhlmannJ K. Satisfying general proximity/similarity queries withmetric trees. Information Processing Letter, 1991, 40(4): 175–179
https://doi.org/10.1016/0020-0190(91)90074-R
|
19 |
YianilosP N. Data structures and algorithms for nearest neighbor search in general metric spaces. In: Proceedings of the 4th Annual ACM-SIAM Symposium on Discrete Algorithms. 1993
|
20 |
MaoR, LiuS, XuH L, Zhang D, MirankerD P . On data partitioning in tree structure metric-space indexes. In: Proceedings of the 19th International Conference on Database Systems for Advanced Applications. 2014, 141–155
https://doi.org/10.1007/978-3-319-05810-8_10
|
21 |
BozkayaT, Ozsoyoglu M. Indexing large metric spaces for similarity search queries. ACM Transactions on Database Systems, 1999, 24(3): 361–404
https://doi.org/10.1145/328939.328959
|
22 |
SergeyB. Near neighbor search in large metric spaces. In: Proceedings of the 21st International Conference on Very Large Data Bases. 1995
|
23 |
CiacciaP, Patella M, ZezulaP . M-tree: an efficient access method for similarity search in metric spaces. In: proceedings of the 23rd International Conference on Very Large Data Bases. 1997
|
24 |
NavarroG. Searching in metric spaces by spatial approximation. The VLDB Journal, 2002, 11(1): 28–46
https://doi.org/10.1007/s007780200060
|
25 |
JagadishH V, OoiB C, TanK L, Yu C, ZhangR . iDistance: an adaptive B+-tree based indexing method for nearest neighbor search. ACM Transactions on Database Systems (TODS), 2005, 30(2): 364–397
https://doi.org/10.1145/1071610.1071612
|
26 |
ChavezE, Navarro G. Unbalancing: the key to index high dimensional metric spaces. Technical Report. 1999
|
27 |
MaoR, XuW, RamakrishnanS . NuckollsG, Miranker D P. On optimizing distance-based similarity search for biological databases. In: Proceedings of the 2005 IEEE Computational Systems Bioinformatics Conference. 2005, 351–361
https://doi.org/10.1109/CSB.2005.42
|
28 |
MacQueenJ B. Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. 1967, 281–297
|
29 |
EsterM, Kriegel H P, SanderJ , XuX. A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd ACM SIGKDD Conference. 1996. 226–231
|
30 |
CullerD, KarpR, PattersonD, Sahay A, SchauserK E , SantosE, Subramonian R, von EickenT . LogP: towards a realistic model of parallel computation. In: Proceedings of the 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 1993, 1–12
https://doi.org/10.1145/155332.155333
|
31 |
TuB B, ZouM, ZhanJ F, Zhao X F, FanJ P . Research on parallel computation model with memory hierarchy on multi-core clusters. Chinese Journal of Computers, 2008, 31(11): 1948–1955
https://doi.org/10.3724/SP.J.1016.2008.01948
|
32 |
LadnerR E. The circuit value problem is log space complete for P. ACM SIGACT News, 1975, 7(1): 18–20
https://doi.org/10.1145/990518.990519
|
33 |
PippengerN J. On simultaneous resource bounds. In: Proceedings of the 20th IEEE Annual Symposium on Foundations of Computer Science. 1979, 307–311
https://doi.org/10.1109/sfcs.1979.29
|
34 |
ChandraA K, Stockmeyer L J. Alternation. In: Proceedings of the 17th IEEE Annual Symposium on Foundations of Computer Science. 1976, 98–108
https://doi.org/10.1109/sfcs.1976.4
|
35 |
GoldschlagerL M. Synchronous parallel computation. Dissertation for the Doctoral Degree. Toronto: University of Toronto, 1977
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|