Density estimation-based method to determine sample size for random sample partition of big data

doi:10.1007/s11704-023-2356-x

Front. Comput. Sci.

2024, Vol. 18

Issue (5) : 185322 https://doi.org/10.1007/s11704-023-2356-x

Artificial Intelligence

Density estimation-based method to determine sample size for random sample partition of big data

Yulin HE^1,², Jiaqi CHEN^1,², Jiaxing SHEN³, Philippe FOURNIER-VIGER², Joshua Zhexue HUANG^1,²(

)

¹. Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen 518107, China
². College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China
³. Department of Computing and Decision Sciences, Lingnan University, Hong Kong 999077, China

Download: PDF(12583 KB) HTML
Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks

Abstract

Random sample partition (RSP) is a newly developed big data representation and management model to deal with big data approximate computation problems. Academic research and practical applications have confirmed that RSP is an efficient solution for big data processing and analysis. However, a challenge for implementing RSP is determining an appropriate sample size for RSP data blocks. While a large sample size increases the burden of big data computation, a small size will lead to insufficient distribution information for RSP data blocks. To address this problem, this paper presents a novel density estimation-based method (DEM) to determine the optimal sample size for RSP data blocks. First, a theoretical sample size is calculated based on the multivariate Dvoretzky-Kiefer-Wolfowitz (DKW) inequality by using the fixed-point iteration (FPI) method. Second, a practical sample size is determined by minimizing the validation error of a kernel density estimator (KDE) constructed on RSP data blocks for an increasing sample size. Finally, a series of persuasive experiments are conducted to validate the feasibility, rationality, and effectiveness of DEM. Experimental results show that (1) the iteration function of the FPI method is convergent for calculating the theoretical sample size from the multivariate DKW inequality; (2) the KDE constructed on RSP data blocks with sample size determined by DEM can yield a good approximation of the probability density function (p.d.f.); and (3) DEM provides more accurate sample sizes than the existing sample size determination methods from the perspective of p.d.f. estimation. This demonstrates that DEM is a viable approach to deal with the sample size determination problem for big data RSP implementation.

Keywords random sample partition big data sample size Dvoretzky-Kiefer-Wolfowitz inequality kernel density estimator probability density function

Corresponding Author(s): Joshua Zhexue HUANG

Just Accepted Date: 28 April 2023 Issue Date: 10 July 2023

Cite this article:

Yulin HE,Jiaqi CHEN,Jiaxing SHEN, et al. Density estimation-based method to determine sample size for random sample partition of big data[J]. Front. Comput. Sci., 2024, 18(5): 185322.

URL:

https://academic.hep.com.cn/fcs/EN/10.1007/s11704-023-2356-x
https://academic.hep.com.cn/fcs/EN/Y2024/V18/I5/185322

$α$ =0.05, $D$ =10		$ε$ =0.05, $D$ =10		$α$ =0.05, $ε$ =0.05
$ε$	$M$	$α$	$M$	$D$	$M$
0.01	83132	0.01	2981	2	2284
0.02	18933	0.02	2832	4	2436
0.03	7931	0.03	2745	6	2524
0.04	4267	0.04	2683	8	2586
0.05	2634	0.05	2634	10	2634
0.06	1774	0.06	2595	12	2674
0.07	1269	0.07	2562	14	2707
0.08	949	0.08	2533	16	2736
0.09	734	0.09	2507	18	2761
0.10	583	0.10	2484	20	2784

Tab.1 Influence of

D

α

, and

ε

on sample size

M

Fig.1 Convergence of Algorithm 1 for two error thresholds (

ε = 0.01

and

ε = 0.05

). (a)

M

= 10,000 and

ε

= 0.01; (b)

M

= 200,000 and

ε

= 0.01; (c)

M

= 500 and

ε

= 0.05; (d)

M

= 10,000 and

ε

= 0.05

Fig.2 Influence of

α

D

, and

ε

for determining the optimal theoretical sample size. (a) Impact of significance level

α

; (b) impact of data dimension

D

; (c) impact of error threshold

ε

Fig.3 Convergence of Algorithm 2 on a 10-mode-and-50-dimension synthetic data set (the theoretical sample size is 3,326). (a) 2-norm of log(p.d.f.); (b) estimated error

Fig.4 Convergence of Algorithm 2 on a 20-mode-and-50-dimension synthetic data set (the theoretical sample size is 3,326). (a) 2-norm of log(p.d.f.); (b) estimated error

Fig.5 Comparison of p.d.f. estimations for the 10-mode-and-50-dimension synthetic data set. (a) Dimension pair (#3,#41); (b) dimension pair (#11,#12); (c) dimension pair (#14,#39); (d) dimension pair (#16,#44)

Fig.6 Comparison of p.d.f. estimations for the 20-mode-and-50-dimension synthetic data set. (a) Dimension pair (#3,#41); (b) dimension pair (#11,#12); (c) dimension pair (#14,#39); (d) dimension pair (#16,#44)

Probability distribution	Mode $R$	Dimension $D$	Sample number $N$	Data size
10-mode-and-10-dimension p.d.f.	10	10	1,000,000	239 MB
10-mode-and-30-dimension p.d.f.	10	30	1,000,000	716 MB
10-mode-and-50-dimension p.d.f.	10	50	1,000,000	1.2 GB
10-mode-and-70-dimension p.d.f.	10	70	1,000,000	1.7 GB
10-mode-and-90-dimension p.d.f.	10	90	1,000,000	2.1 GB
20-mode-and-10-dimension p.d.f.	20	10	1,000,000	239 MB
20-mode-and-30-dimension p.d.f.	20	30	1,000,000	716 MB
20-mode-and-50-dimension p.d.f.	20	50	1,000,000	1.2 GB
20-mode-and-70-dimension p.d.f.	20	70	1,000,000	1.7 GB
20-mode-and-90-dimension p.d.f.	20	90	1,000,000	2.1 GB

Tab.2 Details of ten multiple-mode-and-multiple-dimension probability distributions

Data set	( $R$ , $D$ )	DEM		PEM		PMEM		Slovin		BLB
Data set	( $R$ , $D$ )	sample size	KL divergence	sample size	KL divergence	sample size	KL divergence	sample size	KL divergence	sample size	KL divergence
Data set #1	(10,10)	22,721 $±$ 1,801	5.472 $±$ 0.215	664	27.191 $±$ 0.724	120,704,033	1.382 $±$ 0.009	400	33.443 $±$ 1.048	3,982	12.391 $±$ 0.160
Data set #2	(20,10)	17,231 $±$ 1,609	133.085 $±$ 2.747	664	193.707 $±$ 7.939	126,073,272	117.481 $±$ 0.558	400	210.179 $±$ 7.095	3,982	152.980 $±$ 4.777
Data set #3	(10,30)	23,309 $±$ 5,806	390.160 $±$ 6.047	664	461.030 $±$ 31.830	88,161,518	387.137 $±$ 2.550	400	482.872 $±$ 32.974	3,982	408.454 $±$ 7.421
Data set #4	(20,30)	18,222 $±$ 2,799	1,371.175 $±$ 11.426	664	1,518.077 $±$ 73.759	146,430,907	1,332.028 $±$ 5.745	400	1,570.067 $±$ 110.786	3,982	1,418.196 $±$ 37.204
Data set #5	(10,50)	19,497 $±$ 1,358	1,692.886 $±$ 17.667	664	1,766.343 $±$ 83.468	58,328,135	1,698.777 $±$ 5.474	400	1,832.928 $±$ 112.976	3982	1,713.058 $±$ 49.177
Data set #6	(20,50)	21,644 $±$ 6,451	2,002.653 $±$ 18.110	664	2,325.073 $±$ 105.687	115,185,014	1,954.991 $±$ 8.062	400	23,11.615 $±$ 164.011	3,982	2,074.893 $±$ 45.890
Data set #7	(10,70)	18,577 $±$ 2,909	3,567.995 $±$ 36.217	664	3,659.661 $±$ 130.785	73,566,581	3,563.855 $±$ 11.920	400	3,714.130 $±$ 113.911	3,982	3,593.872 $±$ 27.881
Data set #8	(20,70)	17,473 $±$ 2,912	14,160.358 $±$ 42.196	664	14,727.513 $±$ 168.544	110,208,079	14,070.002 $±$ 14.483	400	14,649.220 $±$ 175.349	3982	14,241.059 $±$ 59.866
Data set #9	(10,90)	18,143 $±$ 4,488	4,896.886 $±$ 29.501	664	5,104.065 $±$ 120.971	69,407,840	4,883.667 $±$ 7.847	400	5,215.207 $±$ 83.576	3,982	4,913.157 $±$ 53.618
Data set #10	(20,90)	18,776 $±$ 2,827	12,648.245 $±$ 42.346	664	13,025.018 $±$ 166.676	123,067,620	12,603.501 $±$ 16.031	400	13,102.256 $±$ 172.041	3,982	12,697.893 $±$ 108.720

Tab.3 Comparison of KL divergence for different sample size determination methods

Fig.7 Relationship between the sample size and KL divergence. (a) 10-mode-and-50-dimension synthetic data set; (b) 20-mode-and-10-dimension synthetic data set; (c) 20-mode-and-30-dimension synthetic data set; (d) 20-mode-and-50-dimension synthetic data set; (e) 20-mode-and-70-dimension synthetic data set; (f) 20-mode-and-90-dimension synthetic data set

Fig.8 Simple size (

P = 79, 606

) determination on the HIGGS data set. (a) Convergence of EDF on attribute #7; (b) convergence of EDF on attribute #11

Fig.9 Simple size (

P = 99, 348

) determination on the HEPMASS data set. (a) Convergence of EDF on attribute #1; (b) convergence of EDF on attribute #7

Sample size	HIGGS		HEPMASS
Sample size	Attrib #7	Attrib #11	Attrib #1	Attrib #7
$P$	0.57200	0.57350	0.71586	0.62498
$P$ +10,000	0.57207	0.57365	0.71592	0.62513

Tab.4 Average probabilities of EDFs with different sample sizes

1	M, Sookhak F R, Yu A Y Zomaya . Auditing big data storage in cloud computing using divide and conquer tables. IEEE Transactions on Parallel and Distributed Systems, 2018, 29( 5): 999–1012
2	S Y, Zhao R X, Li W L, Tian W J, Xiao X H, Dong D J, Liao S U, Khan K Q Li . Divide-and-conquer approach for solving singular value decomposition based on MapReduce. Concurrency and Computation: Practice and Experience, 2016, 28( 2): 331–350
3	M R, Ghazi D Gangodkar . Hadoop, MapReduce and HDFS: a developers perspective. Procedia Computer Science, 2015, 48: 45–50
4	Neha M P, Narendra M P, Hasan M I, Parth D S, Mayur M P. Improving HDFS write performance using efficient replica placement. In: Proceedings of the 5th International Conference-Confluence the Next Generation Information Technology Summit. 2014, 36−39
5	S, Salloum J Z, Huang Y L He . Random sample partition: a distributed data model for big data analysis. IEEE Transactions on Industrial Informatics, 2019, 15( 11): 5846–5854
6	Wei C H, Salloum S, Emara T Z, Zhang X L, Huang J Z, He Y L. A two-stage data processing algorithm to generate random sample partitions for big data analysis. In: Proceedings of the 11th International Conference on Cloud Computing. 2018, 347−364
7	T Yamane . Statistics: An Introductory Analysis. 2nd ed. New York: Harper and Row, 1967
8	W G Cochran . Sampling Techniques. New York: John Wiley & Sons, 2007
9	Smith M F. Sampling considerations in evaluating cooperative extension programs. Gainesville: Florida Cooperative Extension Service, Institute of Food and Agricultural Sciences, University of Florida, 1983
10	M Naaman . On the tight constant in the multivariate Dvoretzky–Kiefer–Wolfowitz inequality. Statistics & Probability Letters, 2021, 173: 109088
11	A, Kleiner A, Talwalkar P, Sarkar M I Jordan . A scalable bootstrap for massive data. Journal of the Royal Statistical Society Series B: Statistical Methodology, 2014, 76( 4): 795–816
12	D N, Reshef Y A, Reshef H K, Finucane S R, Grossman G, McVean P J, Turnbaugh E S, Lander M, Mitzenmacher P C Sabeti . Detecting novel associations in large data sets. Science, 2011, 334( 6062): 1518–1524
13	S, Sengupta S, Volgushev X F Shao . A subsampled double bootstrap for massive data. Journal of the American Statistical Association, 2016, 111( 515): 1222–1232
14	R H Browne . On the use of a pilot sample for sample size determination. Statistics in Medicine, 1995, 14( 17): 1933–1940
15	R V Lenth . Some practical guidelines for effective sample size determination. The American Statistician, 2002, 55( 3): 187–193
16	W M A W, Ahmad W A A W M, Amin N A, Aleng N Mohamed . Some practical guidelines for effective sample-size determination in observational studies. Aceh International Journal of Science and Technology, 2012, 1( 2): 51–53
17	E, Burmeister L M Aitken . Sample size: how many is enough?. Australian Critical Care, 2012, 25( 4): 271–274
18	S, Okada M, Ohzeki S Taguchi . Efficient partition of integer optimization problems with one-hot encoding. Scientific Reports, 2019, 9( 1): 13036
19	Y L, He X, Ye D F, Huang P, Fournier-Viger J Z Huang . A hybrid method to measure distribution consistency of mixed-attribute datasets. IEEE Transactions on Artificial Intelligence, 2023, 4( 1): 182–196
20	E Parzen . On estimation of a probability density function and mode. The Annals of Mathematical Statistics, 1962, 33( 3): 1065–1076
21	J, Jiang Y L, He D X, Dai J Z Huang . A new kernel density estimator based on the minimum entropy of data set. Information Sciences, 2019, 491: 223–231
22	M C, Jones J S, Marron S J Sheather . A brief survey of bandwidth selection for density estimation. Journal of the American Statistical Association, 1996, 91( 433): 401–407
23	F Perez-Cruz . Kullback-Leibler divergence estimation of continuous distributions. In: Proceedings of 2008 IEEE International Symposium on Information Theory. 2008, 1666−1670
24	Perez-Cruz F. Estimation of information theoretic measures for continuous random variables. In: Proceedings of the 21st International Conference on Neural Information Processing Systems. 2008, 1257−1264
25	Yan Y Y, Cheng D Z, Feng J E, Li H T, Yue J M. Survey on applications of algebraic state space theory of logical systems to finite state machines. Science China Information Sciences, 2023, 66(1): 111201

[1]

FCS-22356-OF-YH_suppl_1

Download

[1]	Ashish SINGH, Abhinav KUMAR, Suyel NAMASUDRA. DNACDS: Cloud IoE big data security and accessing scheme based on DNA cryptography[J]. Front. Comput. Sci., 2024, 18(1): 181801-.
[2]	Muazzam MAQSOOD, Sadaf YASMIN, Saira GILLANI, Maryam BUKHARI, Seungmin RHO, Sang-Soo YEO. An efficient deep learning-assisted person re-identification solution for intelligent video surveillance in smart cities[J]. Front. Comput. Sci., 2023, 17(4): 174329-.
[3]	Linlin DING, Shu WANG, Baoyan SONG. Efficient k-dominant skyline query over incomplete data using MapReduce[J]. Front. Comput. Sci., 2021, 15(4): 154611-.
[4]	Zhihan JIANG, Yan LIU, Xiaoliang FAN, Cheng WANG, Jonathan LI, Longbiao CHEN. Understanding urban structures and crowd dynamics leveraging large-scale vehicle mobility data[J]. Front. Comput. Sci., 2020, 14(5): 145310-.
[5]	Meifan ZHANG, Hongzhi WANG, Jianzhong LI, Hong GAO. Diversification on big data in query processing[J]. Front. Comput. Sci., 2020, 14(4): 144607-.
[6]	Samuel IRVING, Bin LI, Shaoming CHEN, Lu PENG, Weihua ZHANG, Lide DUAN. Computer comparisons in the presence of performance variation[J]. Front. Comput. Sci., 2020, 14(1): 21-41.
[7]	Xingyue CHEN, Tao SHANG, Feng ZHANG, Jianwei LIU, Zhenyu GUAN. Dynamic data auditing scheme for big data storage[J]. Front. Comput. Sci., 2020, 14(1): 219-229.
[8]	Min NIE, Lei YANG, Jun SUN, Han SU, Hu XIA, Defu LIAN, Kai YAN. Advanced forecasting of career choices for college students based on campus big data[J]. Front. Comput. Sci., 2018, 12(3): 494-503.
[9]	Xuegang HU, Peng ZHOU, Peipei LI, Jing WANG, Xindong WU. A survey on online feature selection with streaming features[J]. Front. Comput. Sci., 2018, 12(3): 479-493.
[10]	Wuyang JU,Jianxin LI,Weiren YU,Richong ZHANG. iGraph: an incremental data processing system for dynamic graph[J]. Front. Comput. Sci., 2016, 10(3): 462-476.
[11]	Jiuyue HAO, Chao LI, Zhang XIONG, Ejaz HUSSAIN. A temporal-spatial background modeling of dynamic scenes[J]. Front Comput Sci Chin, 2011, 5(3): 290-299.

Viewed

Full text

Abstract

Cited

Shared

Discussed