Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2024, Vol. 18 Issue (5) : 185322    https://doi.org/10.1007/s11704-023-2356-x
RESEARCH ARTICLE
Density estimation-based method to determine sample size for random sample partition of big data
Yulin HE1,2, Jiaqi CHEN1,2, Jiaxing SHEN3, Philippe FOURNIER-VIGER2, Joshua Zhexue HUANG1,2()
1. Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen 518107, China
2. College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China
3. Department of Computing and Decision Sciences, Lingnan University, Hong Kong 999077, China
 Download: PDF(12583 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

Random sample partition (RSP) is a newly developed big data representation and management model to deal with big data approximate computation problems. Academic research and practical applications have confirmed that RSP is an efficient solution for big data processing and analysis. However, a challenge for implementing RSP is determining an appropriate sample size for RSP data blocks. While a large sample size increases the burden of big data computation, a small size will lead to insufficient distribution information for RSP data blocks. To address this problem, this paper presents a novel density estimation-based method (DEM) to determine the optimal sample size for RSP data blocks. First, a theoretical sample size is calculated based on the multivariate Dvoretzky-Kiefer-Wolfowitz (DKW) inequality by using the fixed-point iteration (FPI) method. Second, a practical sample size is determined by minimizing the validation error of a kernel density estimator (KDE) constructed on RSP data blocks for an increasing sample size. Finally, a series of persuasive experiments are conducted to validate the feasibility, rationality, and effectiveness of DEM. Experimental results show that (1) the iteration function of the FPI method is convergent for calculating the theoretical sample size from the multivariate DKW inequality; (2) the KDE constructed on RSP data blocks with sample size determined by DEM can yield a good approximation of the probability density function (p.d.f.); and (3) DEM provides more accurate sample sizes than the existing sample size determination methods from the perspective of p.d.f. estimation. This demonstrates that DEM is a viable approach to deal with the sample size determination problem for big data RSP implementation.

Keywords random sample partition      big data      sample size      Dvoretzky-Kiefer-Wolfowitz inequality      kernel density estimator      probability density function     
Corresponding Author(s): Joshua Zhexue HUANG   
Just Accepted Date: 28 April 2023   Issue Date: 10 July 2023
 Cite this article:   
Yulin HE,Jiaqi CHEN,Jiaxing SHEN, et al. Density estimation-based method to determine sample size for random sample partition of big data[J]. Front. Comput. Sci., 2024, 18(5): 185322.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-023-2356-x
https://academic.hep.com.cn/fcs/EN/Y2024/V18/I5/185322
  
α=0.05, D=10 ε=0.05, D=10 α=0.05, ε=0.05
ε M α M D M
0.01 83132 0.01 2981 2 2284
0.02 18933 0.02 2832 4 2436
0.03 7931 0.03 2745 6 2524
0.04 4267 0.04 2683 8 2586
0.05 2634 0.05 2634 10 2634
0.06 1774 0.06 2595 12 2674
0.07 1269 0.07 2562 14 2707
0.08 949 0.08 2533 16 2736
0.09 734 0.09 2507 18 2761
0.10 583 0.10 2484 20 2784
Tab.1  Influence of D, α, and ε on sample size M
  
Fig.1  Convergence of Algorithm 1 for two error thresholds (ε=0.01 and ε=0.05). (a) M = 10,000 and ε = 0.01; (b) M = 200,000 and ε = 0.01; (c) M = 500 and ε = 0.05; (d) M = 10,000 and ε = 0.05
Fig.2  Influence of α, D, and ε for determining the optimal theoretical sample size. (a) Impact of significance level α; (b) impact of data dimension D; (c) impact of error threshold ε
Fig.3  Convergence of Algorithm 2 on a 10-mode-and-50-dimension synthetic data set (the theoretical sample size is 3,326). (a) 2-norm of log(p.d.f.); (b) estimated error
Fig.4  Convergence of Algorithm 2 on a 20-mode-and-50-dimension synthetic data set (the theoretical sample size is 3,326). (a) 2-norm of log(p.d.f.); (b) estimated error
Fig.5  Comparison of p.d.f. estimations for the 10-mode-and-50-dimension synthetic data set. (a) Dimension pair (#3,#41); (b) dimension pair (#11,#12); (c) dimension pair (#14,#39); (d) dimension pair (#16,#44)
Fig.6  Comparison of p.d.f. estimations for the 20-mode-and-50-dimension synthetic data set. (a) Dimension pair (#3,#41); (b) dimension pair (#11,#12); (c) dimension pair (#14,#39); (d) dimension pair (#16,#44)
Probability distribution Mode R Dimension D Sample number N Data size
10-mode-and-10-dimension p.d.f. 10 10 1,000,000 239 MB
10-mode-and-30-dimension p.d.f. 10 30 1,000,000 716 MB
10-mode-and-50-dimension p.d.f. 10 50 1,000,000 1.2 GB
10-mode-and-70-dimension p.d.f. 10 70 1,000,000 1.7 GB
10-mode-and-90-dimension p.d.f. 10 90 1,000,000 2.1 GB
20-mode-and-10-dimension p.d.f. 20 10 1,000,000 239 MB
20-mode-and-30-dimension p.d.f. 20 30 1,000,000 716 MB
20-mode-and-50-dimension p.d.f. 20 50 1,000,000 1.2 GB
20-mode-and-70-dimension p.d.f. 20 70 1,000,000 1.7 GB
20-mode-and-90-dimension p.d.f. 20 90 1,000,000 2.1 GB
Tab.2  Details of ten multiple-mode-and-multiple-dimension probability distributions
Data set (R, D) DEM PEM PMEM Slovin BLB
sample size KL divergence sample size KL divergence sample size KL divergence sample size KL divergence sample size KL divergence
Data set #1 (10,10) 22,721±1,801 5.472±0.215 664 27.191±0.724 120,704,033 1.382±0.009 400 33.443±1.048 3,982 12.391±0.160
Data set #2 (20,10) 17,231±1,609 133.085±2.747 664 193.707±7.939 126,073,272 117.481±0.558 400 210.179±7.095 3,982 152.980±4.777
Data set #3 (10,30) 23,309±5,806 390.160±6.047 664 461.030±31.830 88,161,518 387.137±2.550 400 482.872±32.974 3,982 408.454±7.421
Data set #4 (20,30) 18,222±2,799 1,371.175±11.426 664 1,518.077±73.759 146,430,907 1,332.028±5.745 400 1,570.067±110.786 3,982 1,418.196±37.204
Data set #5 (10,50) 19,497±1,358 1,692.886±17.667 664 1,766.343±83.468 58,328,135 1,698.777±5.474 400 1,832.928±112.976 3982 1,713.058±49.177
Data set #6 (20,50) 21,644±6,451 2,002.653±18.110 664 2,325.073±105.687 115,185,014 1,954.991±8.062 400 23,11.615±164.011 3,982 2,074.893±45.890
Data set #7 (10,70) 18,577±2,909 3,567.995±36.217 664 3,659.661±130.785 73,566,581 3,563.855±11.920 400 3,714.130±113.911 3,982 3,593.872±27.881
Data set #8 (20,70) 17,473±2,912 14,160.358±42.196 664 14,727.513±168.544 110,208,079 14,070.002±14.483 400 14,649.220±175.349 3982 14,241.059±59.866
Data set #9 (10,90) 18,143±4,488 4,896.886±29.501 664 5,104.065±120.971 69,407,840 4,883.667±7.847 400 5,215.207±83.576 3,982 4,913.157±53.618
Data set #10 (20,90) 18,776±2,827 12,648.245±42.346 664 13,025.018±166.676 123,067,620 12,603.501±16.031 400 13,102.256±172.041 3,982 12,697.893±108.720
Tab.3  Comparison of KL divergence for different sample size determination methods
Fig.7  Relationship between the sample size and KL divergence. (a) 10-mode-and-50-dimension synthetic data set; (b) 20-mode-and-10-dimension synthetic data set; (c) 20-mode-and-30-dimension synthetic data set; (d) 20-mode-and-50-dimension synthetic data set; (e) 20-mode-and-70-dimension synthetic data set; (f) 20-mode-and-90-dimension synthetic data set
Fig.8  Simple size (P=79,606) determination on the HIGGS data set. (a) Convergence of EDF on attribute #7; (b) convergence of EDF on attribute #11
Fig.9  Simple size (P=99,348) determination on the HEPMASS data set. (a) Convergence of EDF on attribute #1; (b) convergence of EDF on attribute #7
Sample size HIGGS HEPMASS
Attrib #7 Attrib #11 Attrib #1 Attrib #7
P 0.57200 0.57350 0.71586 0.62498
P+10,000 0.57207 0.57365 0.71592 0.62513
Tab.4  Average probabilities of EDFs with different sample sizes
  
  
  
  
  
1 M, Sookhak F R, Yu A Y Zomaya . Auditing big data storage in cloud computing using divide and conquer tables. IEEE Transactions on Parallel and Distributed Systems, 2018, 29( 5): 999–1012
2 S Y, Zhao R X, Li W L, Tian W J, Xiao X H, Dong D J, Liao S U, Khan K Q Li . Divide-and-conquer approach for solving singular value decomposition based on MapReduce. Concurrency and Computation: Practice and Experience, 2016, 28( 2): 331–350
3 M R, Ghazi D Gangodkar . Hadoop, MapReduce and HDFS: a developers perspective. Procedia Computer Science, 2015, 48: 45–50
4 Neha M P, Narendra M P, Hasan M I, Parth D S, Mayur M P. Improving HDFS write performance using efficient replica placement. In: Proceedings of the 5th International Conference-Confluence the Next Generation Information Technology Summit. 2014, 36−39
5 S, Salloum J Z, Huang Y L He . Random sample partition: a distributed data model for big data analysis. IEEE Transactions on Industrial Informatics, 2019, 15( 11): 5846–5854
6 Wei C H, Salloum S, Emara T Z, Zhang X L, Huang J Z, He Y L. A two-stage data processing algorithm to generate random sample partitions for big data analysis. In: Proceedings of the 11th International Conference on Cloud Computing. 2018, 347−364
7 T Yamane . Statistics: An Introductory Analysis. 2nd ed. New York: Harper and Row, 1967
8 W G Cochran . Sampling Techniques. New York: John Wiley & Sons, 2007
9 Smith M F. Sampling considerations in evaluating cooperative extension programs. Gainesville: Florida Cooperative Extension Service, Institute of Food and Agricultural Sciences, University of Florida, 1983
10 M Naaman . On the tight constant in the multivariate Dvoretzky–Kiefer–Wolfowitz inequality. Statistics & Probability Letters, 2021, 173: 109088
11 A, Kleiner A, Talwalkar P, Sarkar M I Jordan . A scalable bootstrap for massive data. Journal of the Royal Statistical Society Series B: Statistical Methodology, 2014, 76( 4): 795–816
12 D N, Reshef Y A, Reshef H K, Finucane S R, Grossman G, McVean P J, Turnbaugh E S, Lander M, Mitzenmacher P C Sabeti . Detecting novel associations in large data sets. Science, 2011, 334( 6062): 1518–1524
13 S, Sengupta S, Volgushev X F Shao . A subsampled double bootstrap for massive data. Journal of the American Statistical Association, 2016, 111( 515): 1222–1232
14 R H Browne . On the use of a pilot sample for sample size determination. Statistics in Medicine, 1995, 14( 17): 1933–1940
15 R V Lenth . Some practical guidelines for effective sample size determination. The American Statistician, 2002, 55( 3): 187–193
16 W M A W, Ahmad W A A W M, Amin N A, Aleng N Mohamed . Some practical guidelines for effective sample-size determination in observational studies. Aceh International Journal of Science and Technology, 2012, 1( 2): 51–53
17 E, Burmeister L M Aitken . Sample size: how many is enough?. Australian Critical Care, 2012, 25( 4): 271–274
18 S, Okada M, Ohzeki S Taguchi . Efficient partition of integer optimization problems with one-hot encoding. Scientific Reports, 2019, 9( 1): 13036
19 Y L, He X, Ye D F, Huang P, Fournier-Viger J Z Huang . A hybrid method to measure distribution consistency of mixed-attribute datasets. IEEE Transactions on Artificial Intelligence, 2023, 4( 1): 182–196
20 E Parzen . On estimation of a probability density function and mode. The Annals of Mathematical Statistics, 1962, 33( 3): 1065–1076
21 J, Jiang Y L, He D X, Dai J Z Huang . A new kernel density estimator based on the minimum entropy of data set. Information Sciences, 2019, 491: 223–231
22 M C, Jones J S, Marron S J Sheather . A brief survey of bandwidth selection for density estimation. Journal of the American Statistical Association, 1996, 91( 433): 401–407
23 F Perez-Cruz . Kullback-Leibler divergence estimation of continuous distributions. In: Proceedings of 2008 IEEE International Symposium on Information Theory. 2008, 1666−1670
24 Perez-Cruz F. Estimation of information theoretic measures for continuous random variables. In: Proceedings of the 21st International Conference on Neural Information Processing Systems. 2008, 1257−1264
25 Yan Y Y, Cheng D Z, Feng J E, Li H T, Yue J M. Survey on applications of algebraic state space theory of logical systems to finite state machines. Science China Information Sciences, 2023, 66(1): 111201
[1] FCS-22356-OF-YH_suppl_1 Download
[1] Ashish SINGH, Abhinav KUMAR, Suyel NAMASUDRA. DNACDS: Cloud IoE big data security and accessing scheme based on DNA cryptography[J]. Front. Comput. Sci., 2024, 18(1): 181801-.
[2] Muazzam MAQSOOD, Sadaf YASMIN, Saira GILLANI, Maryam BUKHARI, Seungmin RHO, Sang-Soo YEO. An efficient deep learning-assisted person re-identification solution for intelligent video surveillance in smart cities[J]. Front. Comput. Sci., 2023, 17(4): 174329-.
[3] Linlin DING, Shu WANG, Baoyan SONG. Efficient k-dominant skyline query over incomplete data using MapReduce[J]. Front. Comput. Sci., 2021, 15(4): 154611-.
[4] Zhihan JIANG, Yan LIU, Xiaoliang FAN, Cheng WANG, Jonathan LI, Longbiao CHEN. Understanding urban structures and crowd dynamics leveraging large-scale vehicle mobility data[J]. Front. Comput. Sci., 2020, 14(5): 145310-.
[5] Meifan ZHANG, Hongzhi WANG, Jianzhong LI, Hong GAO. Diversification on big data in query processing[J]. Front. Comput. Sci., 2020, 14(4): 144607-.
[6] Samuel IRVING, Bin LI, Shaoming CHEN, Lu PENG, Weihua ZHANG, Lide DUAN. Computer comparisons in the presence of performance variation[J]. Front. Comput. Sci., 2020, 14(1): 21-41.
[7] Xingyue CHEN, Tao SHANG, Feng ZHANG, Jianwei LIU, Zhenyu GUAN. Dynamic data auditing scheme for big data storage[J]. Front. Comput. Sci., 2020, 14(1): 219-229.
[8] Xuegang HU, Peng ZHOU, Peipei LI, Jing WANG, Xindong WU. A survey on online feature selection with streaming features[J]. Front. Comput. Sci., 2018, 12(3): 479-493.
[9] Min NIE, Lei YANG, Jun SUN, Han SU, Hu XIA, Defu LIAN, Kai YAN. Advanced forecasting of career choices for college students based on campus big data[J]. Front. Comput. Sci., 2018, 12(3): 494-503.
[10] Wuyang JU,Jianxin LI,Weiren YU,Richong ZHANG. iGraph: an incremental data processing system for dynamic graph[J]. Front. Comput. Sci., 2016, 10(3): 462-476.
[11] Jiuyue HAO, Chao LI, Zhang XIONG, Ejaz HUSSAIN. A temporal-spatial background modeling of dynamic scenes[J]. Front Comput Sci Chin, 2011, 5(3): 290-299.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed