Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2019, Vol. 13 Issue (4) : 864-878    https://doi.org/10.1007/s11704-018-6308-9
RESEARCH ARTICLE
Fast correlation coefficient estimation algorithm for HBase-based massive time series data
Wen LIU1,2, Tuqian ZHANG2, Yanming SHEN2(), Peng WANG3
1. Department of Electrical and Information Engineering, Xinjiang Institute of Engineering, Urumqi 830091, China
2. School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
3. School of Computer Science, Fudan University, Shanghai 201203, China
 Download: PDF(725 KB)  
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

In recent years, the rapid development of Internet of Things and sensor networks makes the time series data experiencing explosive growth. OpenTSDB and other emerging systems begin to use Hadoop, HBase to store massive time series data, and how to use these platforms to query and mine time series data has become a current research hotspot. As a typical time series distance measurementmethod, correlation coefficient is widely used in various applications. However, it requires a large amount of I/O and network transmission to compute the correlation coefficient of long time sequence on HBase in real time, and therefore cannot be applied to interactive query. To address this problem, in this paper, we present two methods to estimate the correlation coefficients of two sequences on HBase. We first propose a fast estimation algorithm for the upper and lower bounds of correlation coefficient, named as DCE. In order to further reduce the cost of I/O, we extend the DCE algorithm, and propose the ADCE algorithm, which can estimate the correlation coefficient quickly with an iterative manner. Experiments show that the algorithms proposed in this paper can quickly calculate the correlation coefficient of the long time series.

Keywords time series      HBase      correlation coefficient      fast estimation     
Corresponding Author(s): Yanming SHEN   
Just Accepted Date: 04 September 2017   Online First Date: 07 September 2018    Issue Date: 29 May 2019
 Cite this article:   
Wen LIU,Tuqian ZHANG,Yanming SHEN, et al. Fast correlation coefficient estimation algorithm for HBase-based massive time series data[J]. Front. Comput. Sci., 2019, 13(4): 864-878.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-018-6308-9
https://academic.hep.com.cn/fcs/EN/Y2019/V13/I4/864
1 AMueen, SNath, JLiu. Fast approximate correlation for massive timeseries data. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. 2010, 171–182
https://doi.org/10.1145/1807167.1807188
2 Y FTao, D Papadias, CFaloutsos. Approximate temporal aggregation. In: Proceedings of the 20th IEEE International Conference on Data Engineering. 2004, 190–201
https://doi.org/10.1109/ICDE.2004.1319996
3 Y FTao, KYi, CSheng, J Pei, F FLi. Logging every footstep: quantile summaries for the entire history. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. 2010, 639–650
https://doi.org/10.1145/1807167.1807237
4 PEsling, CAgon. Time-series data mining. ACM Computing Surveys, 2012, 45(1): 12
https://doi.org/10.1145/2379776.2379788
5 ACamerra, T Palpanas, JShieh, EKeogh. iSAX 2.0: indexing and mining one billion time series. In: Proceedings of the 10th IEEE International Conference on Data Mining. 2010, 58–67
https://doi.org/10.1109/ICDM.2010.124
6 JYang, JWidom. Incremental computation and maintenance of temporal aggregates. The VLDB Journal — The International Journal on Very Large Data Bases, 2003, 12(3): 262–283
7 JJin, NAn, ASivasubramaniam. Analyzing range queries on spatial data. In: Proceedings of the 16th IEEE International Conference on Data Engineering. 2000, 525–534
https://doi.org/10.1109/ICDE.2000.839451
8 AMueen, H Hamooni, TEstrada. Time series join on subsequence correlation. In: Proceedings of the 2014 IEEE International Conference on Data Mining. 2014, 450–459
https://doi.org/10.1109/ICDM.2014.52
9 Y HLi, U LHou, M LYiu, Z G Gong. Discovering longest-lasting correlation in sequence databases. Proceedings of the VLDB Endowment, 2013, 6(14): 1666–1677
https://doi.org/10.14778/2556549.2556552
10 YWang, PWang, JPei, S Huang. A data-adaptive and dynamic segmentation index for whole matching on time series. Proceedings of the VLDB Endowment, 2013, 6(10): 793–804
https://doi.org/10.14778/2536206.2536208
11 JJeffrey, M PJeff, F FLi, M W Tang. Ranking large temporal data. Proceedings of the VLDB Endowment, 2012, 5(11): 1412–1423
https://doi.org/10.14778/2350229.2350257
12 W MLuo, H YTan, LChen, l M Lione. Finding time period-based most frequent path in big trajectory data. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. 2013, 713–724
https://doi.org/10.1145/2463676.2465287
13 RAgrawal, C Faloutsos, ASwami. Efficient similarity search in sequence databases. In: Proceedings of the International Conference on Foundations of Data Organization and Algorithms. 1993, 69–84
https://doi.org/10.1007/3-540-57301-1_5
14 K PChan, W CFu. Efficient time series matching by wavelets. In: Proceedings of the IEEE International Conference on Data Engineering. 1999, 126–133
15 EKeogh, K Chakrabarti, MPazzani, SMehrotra. Locally adaptive dimensionality reduction for indexing large time series databases. ACM Transactions on Database Systems, 2002, 27(2): 188–228
https://doi.org/10.1145/568518.568520
16 ACamerra, JShieh, TPalpanas, T Rakthanmanon, EKeogh. Beyond one billion time series: indexing and mining very large time series collections with iSAX2+. Knowledge & Information Systems, 2014, 39(1):123–151
https://doi.org/10.1007/s10115-012-0606-6
17 CFaloutsos, M Ranganathan, YManolopoulos. Fast subsequence matching in time-series databases. In: Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data. 1994, 419–429
https://doi.org/10.1145/191839.191925
18 ESoroush, M Balazinska, DWang. ArrayStore: a storage manager for complex parallel array processing. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. 2011, 253–264
https://doi.org/10.1145/1989323.1989351
19 SDas, Y Sismanis, K SBeyer. Ricardo: integrating R and Hadoop. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. 2010, 987–998
https://doi.org/10.1145/1807167.1807275
20 BHuang, SBabu, JYang. Cumulon: optimizing statistical data analysis in the cloud. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. 2013, 1–12
https://doi.org/10.1145/2463676.2465273
[1] Zhiyong YU, Xiangping ZHENG, Fangwan HUANG, Wenzhong GUO, Lin SUN, Zhiwen YU. A framework based on sparse representation model for time series prediction in smart city[J]. Front. Comput. Sci., 2021, 15(1): 151305-.
[2] Yi ZHENG,Qi LIU,Enhong CHEN,Yong GE,J. Leon ZHAO. Exploiting multi-channels deep convolutional neural networks for multivariate time series classification[J]. Front. Comput. Sci., 2016, 10(1): 96-112.
[3] Tim SCHLüTER, Stefan CONRAD. An approach for automatic sleep stage scoring and apnea-hypopnea detection[J]. Front Comput Sci, 2012, 6(2): 230-241.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed