MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data

doi:10.1007/s11704-013-3158-3

Front. Comput. Sci.

2014, Vol. 8

Issue (1) : 83-99 https://doi.org/10.1007/s11704-013-3158-3

RESEARCH ARTICLE

MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data

Yaobin HE^1,³(

), Haoyu TAN², Wuman LUO², Shengzhong FENG¹, Jianping FAN¹

¹. Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China
². Department of Computer Science, Guangzhou HKUST Fok Ying Tung Research Institute, Hong Kong University of Science and Technology, Hong Kong 999077, China
³. University of Chinese Academy of Sciences, Beijing 100049, China

Download: PDF(1578 KB)
Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks

Abstract

DBSCAN (density-based spatial clustering of applications with noise) is an important spatial clustering technique that is widely adopted in numerous applications. As the size of datasets is extremely large nowadays, parallel processing of complex data analysis such as DBSCAN becomes indispensable. However, there are three major drawbacks in the existing parallel DBSCAN algorithms. First, they fail to properly balance the load among parallel tasks, especially when data are heavily skewed. Second, the scalability of these algorithms is limited because not all the critical sub-procedures are parallelized. Third, most of them are not primarily designed for shared-nothing environments, which makes them less portable to emerging parallel processing paradigms. In this paper, we present MR-DBSCAN, a scalable DBSCAN algorithm using MapReduce. In our algorithm, all the critical sub-procedures are fully parallelized. As such, there is no performance bottleneck caused by sequential processing. Most importantly, we propose a novel data partitioning method based on computation cost estimation. The objective is to achieve desirable load balancing even in the context of heavily skewed data. Besides, We conduct our evaluation using real large datasets with up to 1.2 billion points. The experiment results well confirm the efficiency and scalability of MR-DBSCAN.

Keywords data clustering parallel algorithm data mining load balancing

Corresponding Author(s): Yaobin HE

Issue Date: 01 February 2014

Cite this article:

Yaobin HE,Haoyu TAN,Wuman LUO, et al. MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data[J]. Front. Comput. Sci., 2014, 8(1): 83-99.

URL:

https://academic.hep.com.cn/fcs/EN/10.1007/s11704-013-3158-3
https://academic.hep.com.cn/fcs/EN/Y2014/V8/I1/83

1	M Ester, H P Kriegel, J Sander, X Xu. A densitybased algorithm for discovering clusters in large spatial databases. Data Mining and Knowledge Discovery, 1996, 96: 226−231
2	J B MacQueen. Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. 1967, 281−297
3	T Zhang, R Ramakrishnan, M Livny. Birch: an efficient data clustering method for very large databases. In: Proceedings of 1996 the ACM SIGMOD Conference on Managemnet of Data. 1996, 103−114 https://doi.org/10.1145/233269.233324
4	A P Dempster, N M Laird, D B Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statisticai Societ, 1977, 39(1): 1−38
5	W Wang, J Yang, R R Muntz. Sting: A statistical information grid approach to spatial data mining. In: Proceedings of the 23rd International Conference on Very Large Data Bases, 1997, 186−195
6	Microsoft Academic Search. Top publications in data mining. 2013
7	J Dean, S Ghemawat. MapReduce: simplified data processing on large clusters. 2008, 107−113
8	T White. Hadoop: The Definitive Guide, 1st edition. O’Reilly Media, Inc., 2009
9	M Berger, S Bokhari. A partitioning strategy for nonuniform problems on multiprocessors. IEEE Transactions on Computers, 1987, 36: 570−580 https://doi.org/10.1109/TC.1987.1676942
10	B R Dai, I C Lin. Efficient map/reduce-based dbscan algorithm with optimized data partition. In: Proceedings of the 5th IEEE International Conference on Cloud Computing. 2012, 59−66
11	S T Leutenegger, J M Edgington, M A Lopez. Str: a simple and efficient algorithm for r-tree packing. In: Proceedings of the 1997 IEEE International Conference on Data Engineering. 1997, 497−506
12	Y Theodoridis, T Sellis. A model for the prediction of r-tree perfor-mance. In: Proceedings of the 15th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. 1996, 161−171
13	United States Census Bureau. TIGER/Line Shapefiles.
14	J Sander, M Ester, H P Kriegel, X Xu. Density-based clustering in spatial databases: The algorithm gdbscan and its applications. Data Mining and Knowledge Discovery, 1998, 2(2): 169−194 https://doi.org/10.1023/A:1009745219419
15	M Ankerst, M M Breunig, H P Kriegel, J Sander. Optics: ordering points to identify the clustering structure. SIGMOD Record, 1999, 28: 49−60 https://doi.org/10.1145/304181.304187
16	E Januzaj, H P Kriegel, M Pfeifle. Scalable density-based distributed clustering. In: Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases. 2004, 231−244
17	W Zhao, H Ma, Q He. Parallel k-means clustering based on mapreduce. In: Proceedings of the 1st International Conference on Cloud Computing. 2009, 674−679
18	Y Kwon, D Nunley, J P Gardner, M Balazinska, B Howe, S Loebman. Scalable clustering algorithm for n-body simulations in a sharednothing cluster. In: Proceedings of the 22nd International Conference on Scientific and Statistical Database Management. 2010, 132−150
19	J L Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 1975, 18: 509−517 https://doi.org/10.1145/361002.361007
20	X Xu, J Jäger, H P Kriegel. A fast parallel clustering algorithm for large spatial databases. Data Mining and Knowledge Discovery, 1999, 3: 263−290 https://doi.org/10.1023/A:1009884809343
21	Y He, H Tan, W Luo, H Mao, D Ma, S Feng, J Fan. MR-DBSCAN: an efficient parallel density-based clustering algorithm using mapreduce. In: Proceedings of the 2011 IEEE International Conference on Parallel and Distributed Systems. 2011, 473−480 https://doi.org/10.1109/ICPADS.2011.83

[1]	Kaimin WEI, Tianqi LI, Feiran HUANG, Jinpeng CHEN, Zefan HE. Cancer classification with data augmentation based on generative adversarial networks[J]. Front. Comput. Sci., 2022, 16(2): 162601-.
[2]	Genan DAI, Xiaoyang HU, Youming GE, Zhiqing NING, Yubao LIU. Attention based simplified deep residual network for citywide crowd flows prediction[J]. Front. Comput. Sci., 2021, 15(2): 152317-.
[3]	Yuling MA, Chaoran CUI, Jun YU, Jie GUO, Gongping YANG, Yilong YIN. Multi-task MIML learning for pre-course student performance prediction[J]. Front. Comput. Sci., 2020, 14(5): 145313-.
[4]	Guijuan ZHANG, Yang LIU, Xiaoning JIN. A survey of autoencoder-based recommender systems[J]. Front. Comput. Sci., 2020, 14(2): 430-450.
[5]	Lu LIU, Shang WANG. Meta-path-based outlier detection in heterogeneous information network[J]. Front. Comput. Sci., 2020, 14(2): 388-403.
[6]	Xu-Ying LIU, Sheng-Tao WANG, Min-Ling ZHANG. Transfer synthetic over-sampling for class-imbalance learning with limited minority class data[J]. Front. Comput. Sci., 2019, 13(5): 996-1009.
[7]	Satoshi MIYAZAWA, Xuan SONG, Tianqi XIA, Ryosuke SHIBASAKI, Hodaka KANEDA. Integrating GPS trajectory and topics from Twitter stream for human mobility estimation[J]. Front. Comput. Sci., 2019, 13(3): 460-470.
[8]	Shuaiqiang WANG, Yilong YIN. Polygene-based evolutionary algorithms with frequent pattern mining[J]. Front. Comput. Sci., 2018, 12(5): 950-965.
[9]	Bo SUN, Haiyan CHEN, Jiandong WANG, Hua XIE. Evolutionary under-sampling based bagging ensemble method for imbalanced data classification[J]. Front. Comput. Sci., 2018, 12(2): 331-350.
[10]	Xiong FU, Juzhou CHEN, Song DENG, Junchang WANG, Lin ZHANG. Layered virtual machine migration algorithm for network resource balancing in cloud computing[J]. Front. Comput. Sci., 2018, 12(1): 75-85.
[11]	Cheqing JIN, Jie CHEN, Huiping LIU. MapReduce-based entity matching with multiple blocking functions[J]. Front. Comput. Sci., 2017, 11(5): 895-911.
[12]	Yuan LI, Yuhai ZHAO, Guoren WANG, Xiaofeng ZHU, Xiang ZHANG, Zhanghui WANG, Jun PANG. Finding susceptible and protective interaction patterns in large-scale genetic association study[J]. Front. Comput. Sci., 2017, 11(3): 541-554.
[13]	Junhua LU,Wei CHEN,Yuxin MA,Junming KE,Zongzhuang LI,Fan ZHANG,Ross MACIEJEWSKI. Recent progress and trends in predictive visual analytics[J]. Front. Comput. Sci., 2017, 11(2): 192-207.
[14]	Chengliang WANG,Yayun PENG,Debraj DE,Wen-Zhan SONG. DPHK: real-time distributed predicted data collecting based on activity pattern knowledge mined from trajectories in smart environments[J]. Front. Comput. Sci., 2016, 10(6): 1000-1011.
[15]	Wenmei LIU,Hui LIU. Major motivations for extract method refactorings: analysis based on interviews and change histories[J]. Front. Comput. Sci., 2016, 10(4): 644-656.

Viewed

Full text

Abstract

Cited

Shared

Discussed