Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2019, Vol. 13 Issue (3) : 500-515    https://doi.org/10.1007/s11704-017-7119-0
RESEARCH ARTICLE
Prefetch-aware fingerprint cache management for data deduplication systems
Mei LI(), Hongjun ZHANG, Yanjun WU, Chen ZHAO
1. Institute of Software, Chinese Academy of Sciences, Beijing 100190, China
2. University of Chinese Academy of Sciences, Beijing 100049, China
 Download: PDF(853 KB)  
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

Data deduplication has been widely utilized in large-scale storage systems, particularly backup systems. Data deduplication systems typically divide data streams into chunks and identify redundant chunks by comparing chunk fingerprints. Maintaining all fingerprints in memory is not cost-effective because fingerprint indexes are typically very large. Many data deduplication systems maintain a fingerprint cache in memory and exploit fingerprint prefetching to accelerate the deduplication process. Although fingerprint prefetching can improve the performance of data deduplication systems by leveraging the locality of workloads, inaccurately prefetched fingerprints may pollute the cache by evicting useful fingerprints. We observed that most of the prefetched fingerprints in a wide variety of applications are never used or used only once, which severely limits the performance of data deduplication systems. We introduce a prefetch-aware fingerprint cache management scheme for data deduplication systems (PreCache) to alleviate prefetch-related cache pollution. We propose three prefetch-aware fingerprint cache replacement policies (PreCache-UNU, PreCache-UOO, and PreCache-MIX) to handle different types of cache pollution. Additionally, we propose an adaptive policy selector to select suitable policies for prefetch requests. We implement PreCache on two representative data deduplication systems (Block Locality Caching and SiLo) and evaluate its performance utilizing three real-world workloads (Kernel, MacOS, and Homes). The experimental results reveal that PreCache improves deduplication throughput by up to 32.22% based on a reduction of on-disk fingerprint index lookups and improvement of the deduplication ratio by mitigating prefetch-related fingerprint cache pollution.

Keywords data deduplication      fingerprint prefetch      fingerprint cache     
Corresponding Author(s): Mei LI   
Just Accepted Date: 07 July 2017   Online First Date: 13 June 2018    Issue Date: 24 April 2019
 Cite this article:   
Mei LI,Hongjun ZHANG,Yanjun WU, et al. Prefetch-aware fingerprint cache management for data deduplication systems[J]. Front. Comput. Sci., 2019, 13(3): 500-515.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-017-7119-0
https://academic.hep.com.cn/fcs/EN/Y2019/V13/I3/500
1 D TMeyer, W J Bolosky. A study of practical deduplication. ACM Transactions on Storage, 2012, 7(4): 14
https://doi.org/10.1145/2078861.2078864
2 AWildani, E LMiller, ORodeh. Hands: a heuristically arranged nonbackup in-line deduplication system. In: Proceedings of the 29th IEEE International Conference on Data Engineering. 2013, 446–457
https://doi.org/10.1109/ICDE.2013.6544846
3 GWallace, F Douglis, HQian, PShilane, S Smaldone, MChamness, WHsu. Characteristics of backup workloads in production systems. In: Proceedings of the 10th USENIX Conference on File and Storage Technologies. 2012
4 DMeister, S JKaiser, ABrinkmann, T Cortes, MKuhn, JKunkel. A study on data deduplication in HPC storage systems. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 2012, 1–11
https://doi.org/10.1109/SC.2012.14
5 W JBolosky, SCorbin, DGoebel, J R Douceur. Single instance storage in Windows 2000. In: Proceedings of the 4th USENIX Windows Systems Symposium. 2000, 13–24
6 SQuinlan, S Dorward. Venti: a new approach to archival storage. In: Proceedings of USENIX Conference on File and Storage Technologies. 2002, 89–101
7 BZhu, KLi, HPatterson. Avoiding the disk bottleneck in the data domain deduplication file system. In: Proceedings of the 6th USENIX Conference on File and Storage Technologies.2008, 1–14
8 BLin, SLi, XLiao, J Zhang, XLiu. Leach: an automatic learning cache for inline primary deduplication system. Frontiers of Computer Science, 2014, 8(2):175–183
https://doi.org/10.1007/s11704-014-3377-2
9 SMandal, G Kuenning, DOk, VShastry, P Shilane, SZhen, VTarasov, EZadok. Using hints to improve inline block-layer deduplication. In: Proceedings of the 14th USENIX Conference on File and Storage Technologies. 2016, 315–322
10 AMuthitacharoen, B Chen, DMazieres. A low-bandwidth network file system. In: Proceedings of the 8th ACM Symposium on Operating Systems Principles. 2001, 174–187
https://doi.org/10.1145/502034.502052
11 PShilane, MHuang, GWallace, W Hsu. WAN-optimized replication of backup datasets using stream-informed delta compression. ACM Transactions on Storage, 2012, 8(4): 13
https://doi.org/10.1145/2385603.2385606
12 YHua, XLiu. Scheduling heterogeneous flows with delay-aware deduplication for avionics applications. IEEE Transactions on Parallel and Distributed Systems, 2012, 23(9): 1790–1802
https://doi.org/10.1109/TPDS.2012.51
13 JSun, HChen, LHe, HTan. Redundant network traffic elimination with GPU accelerated rabin fingerprinting. IEEE Transactions on Parallel and Distributed Systems, 2016, 27(7): 2130–2142
https://doi.org/10.1109/TPDS.2015.2473166
14 MLillibridge, KEshghi, DBhagwat, V Deolalikar, GTrezis, PCamble. Sparse indexing: large scale, inline deduplication using sampling and locality. In: Proceedings of the 7th USENIX Conference on File and Storage Technologies. 2009, 111–123
15 DBhagwat, KEshghi, D D ELong, M Lillibridge. Extreme binning: scalable, parallel deduplication for chunk-based file backup. In: Proceedings of the 17th Annual Meeting of the IEEE/ACM International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems. 2009, 1–9
https://doi.org/10.1109/MASCOT.2009.5366623
16 DMeister, JKaiser, ABrinkmann. Block locality caching for data deduplication. In: Proceedings of the 6th International Systems and Storage Conference. 2013
https://doi.org/10.1145/2485732.2485748
17 WXia, HJiang, DFeng, Y Hua. Similarity and locality based indexing for high performance data deduplication. IEEE Transactions on Computers, 2015, 64(4): 1162–1176
https://doi.org/10.1109/TC.2014.2308181
18 JMin, DYoon, YWon. Efficient deduplication techniques for modern backup operation. IEEE Transactions on Computers, 2011, 60(6): 824–840
https://doi.org/10.1109/TC.2010.263
19 BDebnath, S Sengupta, JLi. Chunkstash: speeding up inline storage deduplication using flash memory. In: Proceedings of USENIX Annual Technical Conference. 2010, 215–230
20 FGuo, P Efstathopoulos. Building a high-performance deduplication system. In: Proceedings of USENIX Annual Technical Conference. 2011, 271–284
21 ZSun, G Kuenning, SMandal, PShilane, V Tarasov, NXiao, EZadok. A long-term user-centric analysis of deduplication patterns. In: Proceedings of the 32nd International Conference on Massive Storage Systems and Technology. 2016, 1–7
https://doi.org/10.1109/MSST.2016.7897080
22 WXia, HJiang, DFeng, F Douglis, PShilane, YHua, MFu, YZhang, Y Zhou. A comprehensive study of the past, present, and future of data deduplication. Proceedings of the IEEE, 2016, 104(9):1681–1710
https://doi.org/10.1109/JPROC.2016.2571298
23 MFu, DFeng, YHua, X He, ZChen, WXia, YZhang, YTan. Design tradeoffs for data deduplication performance in backup workloads. In: Proceedings of the 13th USENIX Conference on File and Storage Technologies. 2015, 331–344
24 DMeister, A Brinkmann. Dedupv1: improving deduplication throughput using solid state drives (SSD). In: Proceedings of the 26th Symposium on Massive Storage Systems and Technologies. 2010, 1–6
https://doi.org/10.1109/MSST.2010.5496992
25 GLu, Y JNam, D H CDu. Bloomstore: bloom-filter based memoryefficient key-value store for indexing of data deduplication on flash. In: Proceedings of the 28th Symposium on Mass Storage Systems and Technologies. 2012, 1–11
https://doi.org/10.1109/MSST.2012.6232390
26 ZChen, KShen. Ordermergededup: efficient, failure-consistent deduplication on flash. In: Proceedings of the 14th USENIX Conference on File and Storage Technologies. 2016, 291–299
27 YFu, HJiang, NXiao. A scalable inline cluster deduplication framework for big data protection. In: Proceedings of the ACM/IFIP/USENIX International Conference on Distributed Systems Platforms and Open Distributed Processing. 2012, 354–373
https://doi.org/10.1007/978-3-642-35170-9_18
28 DFrey, A M Kermarrec, KKloudas. Probabilistic deduplication for cluster-based storage systems. In: Proceedings of the 3rd ACM Symposium on Cloud Computing. 2012
https://doi.org/10.1145/2391229.2391246
29 SLuo, GZhang, CWu, SKhan, KLi. Boafft: distributed deduplication for big data storage in the cloud. IEEE Transactions on Cloud Computing, 2015, 61(11): 1–13
https://doi.org/10.1109/TCC.2015.2511752
30 AJaleel, K B Theobald, S CSteely Jr, JEmer. High performance cache replacement using re-reference interval prediction (RRIP). In: Proceedings of the 37th Annual International Symposium on Computer Architecture. 2010, 60–71
https://doi.org/10.1145/1815961.1815971
31 C JWu, AJaleel, WHasenplaugh, MMartonosi, S C Steely Jr, JEmer. Ship: signature-based hit predictor for high performance caching. In: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. 2011, 430–441
https://doi.org/10.1145/2155620.2155671
32 C JWu, AJaleel, MMartonosi, S C Steely Jr, JEmer. Pacman: prefetch-aware cache management for high performance caching. In: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. 2011, 442–453
https://doi.org/10.1145/2155620.2155672
33 VSeshadri, SYedkar, HXin, O Mutlu, P BGibbons, M AKozuch, T CMowry. Mitigating prefetcher-caused pollution using informed caching policies for prefetched blocks. ACM Transactions on Architecture and Code Optimization, 2015, 11(4): 51
https://doi.org/10.1145/2677956
34 ACidon, A Eisenman, MAlizadeh, SKatti. Cliffhanger: scaling performance cliffs in web memory caches. In: Proceedings of the 13th USENIX Symposium on Networked Systems Design and Implementation. 2016, 379–392
35 MLi, HZhang, YWu, CZhao. Memsc: a scan-resistant and compact cache replacement framework for memory-based key-value cache systems. Journal of Computer Science and Technology, 2017, 32(1): 55–67
https://doi.org/10.1007/s11390-017-1705-3
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed