|
|
Density estimation-based method to determine sample size for random sample partition of big data |
Yulin HE1,2, Jiaqi CHEN1,2, Jiaxing SHEN3, Philippe FOURNIER-VIGER2, Joshua Zhexue HUANG1,2( ) |
1. Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen 518107, China 2. College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China 3. Department of Computing and Decision Sciences, Lingnan University, Hong Kong 999077, China |
|
|
Abstract Random sample partition (RSP) is a newly developed big data representation and management model to deal with big data approximate computation problems. Academic research and practical applications have confirmed that RSP is an efficient solution for big data processing and analysis. However, a challenge for implementing RSP is determining an appropriate sample size for RSP data blocks. While a large sample size increases the burden of big data computation, a small size will lead to insufficient distribution information for RSP data blocks. To address this problem, this paper presents a novel density estimation-based method (DEM) to determine the optimal sample size for RSP data blocks. First, a theoretical sample size is calculated based on the multivariate Dvoretzky-Kiefer-Wolfowitz (DKW) inequality by using the fixed-point iteration (FPI) method. Second, a practical sample size is determined by minimizing the validation error of a kernel density estimator (KDE) constructed on RSP data blocks for an increasing sample size. Finally, a series of persuasive experiments are conducted to validate the feasibility, rationality, and effectiveness of DEM. Experimental results show that (1) the iteration function of the FPI method is convergent for calculating the theoretical sample size from the multivariate DKW inequality; (2) the KDE constructed on RSP data blocks with sample size determined by DEM can yield a good approximation of the probability density function (p.d.f.); and (3) DEM provides more accurate sample sizes than the existing sample size determination methods from the perspective of p.d.f. estimation. This demonstrates that DEM is a viable approach to deal with the sample size determination problem for big data RSP implementation.
|
Keywords
random sample partition
big data
sample size
Dvoretzky-Kiefer-Wolfowitz inequality
kernel density estimator
probability density function
|
Corresponding Author(s):
Joshua Zhexue HUANG
|
Just Accepted Date: 28 April 2023
Issue Date: 10 July 2023
|
|
1 |
M, Sookhak F R, Yu A Y Zomaya . Auditing big data storage in cloud computing using divide and conquer tables. IEEE Transactions on Parallel and Distributed Systems, 2018, 29( 5): 999–1012
|
2 |
S Y, Zhao R X, Li W L, Tian W J, Xiao X H, Dong D J, Liao S U, Khan K Q Li . Divide-and-conquer approach for solving singular value decomposition based on MapReduce. Concurrency and Computation: Practice and Experience, 2016, 28( 2): 331–350
|
3 |
M R, Ghazi D Gangodkar . Hadoop, MapReduce and HDFS: a developers perspective. Procedia Computer Science, 2015, 48: 45–50
|
4 |
Neha M P, Narendra M P, Hasan M I, Parth D S, Mayur M P. Improving HDFS write performance using efficient replica placement. In: Proceedings of the 5th International Conference-Confluence the Next Generation Information Technology Summit. 2014, 36−39
|
5 |
S, Salloum J Z, Huang Y L He . Random sample partition: a distributed data model for big data analysis. IEEE Transactions on Industrial Informatics, 2019, 15( 11): 5846–5854
|
6 |
Wei C H, Salloum S, Emara T Z, Zhang X L, Huang J Z, He Y L. A two-stage data processing algorithm to generate random sample partitions for big data analysis. In: Proceedings of the 11th International Conference on Cloud Computing. 2018, 347−364
|
7 |
T Yamane . Statistics: An Introductory Analysis. 2nd ed. New York: Harper and Row, 1967
|
8 |
W G Cochran . Sampling Techniques. New York: John Wiley & Sons, 2007
|
9 |
Smith M F. Sampling considerations in evaluating cooperative extension programs. Gainesville: Florida Cooperative Extension Service, Institute of Food and Agricultural Sciences, University of Florida, 1983
|
10 |
M Naaman . On the tight constant in the multivariate Dvoretzky–Kiefer–Wolfowitz inequality. Statistics & Probability Letters, 2021, 173: 109088
|
11 |
A, Kleiner A, Talwalkar P, Sarkar M I Jordan . A scalable bootstrap for massive data. Journal of the Royal Statistical Society Series B: Statistical Methodology, 2014, 76( 4): 795–816
|
12 |
D N, Reshef Y A, Reshef H K, Finucane S R, Grossman G, McVean P J, Turnbaugh E S, Lander M, Mitzenmacher P C Sabeti . Detecting novel associations in large data sets. Science, 2011, 334( 6062): 1518–1524
|
13 |
S, Sengupta S, Volgushev X F Shao . A subsampled double bootstrap for massive data. Journal of the American Statistical Association, 2016, 111( 515): 1222–1232
|
14 |
R H Browne . On the use of a pilot sample for sample size determination. Statistics in Medicine, 1995, 14( 17): 1933–1940
|
15 |
R V Lenth . Some practical guidelines for effective sample size determination. The American Statistician, 2002, 55( 3): 187–193
|
16 |
W M A W, Ahmad W A A W M, Amin N A, Aleng N Mohamed . Some practical guidelines for effective sample-size determination in observational studies. Aceh International Journal of Science and Technology, 2012, 1( 2): 51–53
|
17 |
E, Burmeister L M Aitken . Sample size: how many is enough?. Australian Critical Care, 2012, 25( 4): 271–274
|
18 |
S, Okada M, Ohzeki S Taguchi . Efficient partition of integer optimization problems with one-hot encoding. Scientific Reports, 2019, 9( 1): 13036
|
19 |
Y L, He X, Ye D F, Huang P, Fournier-Viger J Z Huang . A hybrid method to measure distribution consistency of mixed-attribute datasets. IEEE Transactions on Artificial Intelligence, 2023, 4( 1): 182–196
|
20 |
E Parzen . On estimation of a probability density function and mode. The Annals of Mathematical Statistics, 1962, 33( 3): 1065–1076
|
21 |
J, Jiang Y L, He D X, Dai J Z Huang . A new kernel density estimator based on the minimum entropy of data set. Information Sciences, 2019, 491: 223–231
|
22 |
M C, Jones J S, Marron S J Sheather . A brief survey of bandwidth selection for density estimation. Journal of the American Statistical Association, 1996, 91( 433): 401–407
|
23 |
F Perez-Cruz . Kullback-Leibler divergence estimation of continuous distributions. In: Proceedings of 2008 IEEE International Symposium on Information Theory. 2008, 1666−1670
|
24 |
Perez-Cruz F. Estimation of information theoretic measures for continuous random variables. In: Proceedings of the 21st International Conference on Neural Information Processing Systems. 2008, 1257−1264
|
25 |
Yan Y Y, Cheng D Z, Feng J E, Li H T, Yue J M. Survey on applications of algebraic state space theory of logical systems to finite state machines. Science China Information Sciences, 2023, 66(1): 111201
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|