Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2014, Vol. 8 Issue (2) : 175-183    https://doi.org/10.1007/s11704-014-3377-2
RESEARCH ARTICLE
Leach: an automatic learning cache for inline primary deduplication system
Bin LIN(),Shanshan LI,Xiangke LIAO,Jing ZHANG,Xiaodong LIU
School of Computer, National University of Defense Technology, Changsha 410073, China
 Download: PDF(640 KB)  
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

Deduplication technology has been increasingly used to reduce storage costs. Though it has been successfully applied to backup and archival systems, existing techniques can hardly be deployed in primary storage systems due to the associated latency cost of detecting duplicated data, where every unit has to be checked against a substantially large fingerprint index before it is written. In this paper we introduce Leach, for inline primary storage, a self-learning in-memory fingerprints cache to reduce the writing cost in deduplication system. Leach is motivated by the characteristics of realworld I/O workloads: highly data skew exist in the access patterns of duplicated data. Leach adopts a splay tree to organize the on-disk fingerprint index, automatically learns the access patterns and maintains hot working sets in cachememory, with a goal to service a majority of duplicated data detection. Leveraging the working set property, Leach provides optimization to reduce the cost of splay operations on the fingerprint index and cache updates. In comprehensive experiments on several real-world datasets, Leach outperforms conventional LRU (least recently used) cache policy by reducing the number of cache misses, and significantly improves write performance without great impact to cache hits.

Keywords deduplication      duplicate detection      splay tree      cache     
Corresponding Author(s): Bin LIN   
Issue Date: 24 June 2014
 Cite this article:   
Bin LIN,Shanshan LI,Xiangke LIAO, et al. Leach: an automatic learning cache for inline primary deduplication system[J]. Front. Comput. Sci., 2014, 8(2): 175-183.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-014-3377-2
https://academic.hep.com.cn/fcs/EN/Y2014/V8/I2/175
1 SrinivasanK, BissonT, GoodsonG, VorugantiK. Idedup: latencyaware, inline data deduplication for primary storage. In: Proceedings of the 10th Usenix Conference on File and Storage Technologies. 2012, 24: 1-24: 14
2 GeerD. Reducing the storage burden via data deduplication. Computer, 2008, 41(12): 15-17
doi: 10.1109/MC.2008.538
3 ZhuB, LiK, PattersonH. Avoiding the disk bottleneck in the data domain deduplication file system. In: Proceedings of the 6th Usenix Conference on File and Storage Technologies. 2008, 18:1-18:14
4 RodehO, WildaniA, MillerE L. Hands: A heuristically arranged nonbackup in-line deduplication system. In: Proceedings of the 2013 IEEE International Conference on Data Engineering. 2013, 446-457
5 LillibridgeM, EshghiK, BhagwatD, DeolalikarV, TreziseG, CambleP. Sparse indexing: large scale, inline deduplication using sampling and locality. In: Proccedings of the 7th Conference on File and Storage technologies. 2009, 111-123
6 BhagwatD, EshghiK, LongD D, LillibridgeM. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In: Proceedings of the 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems. 2009, 1-9
7 MeyerD T, BoloskyW J. A study of practical deduplication. ACM Transactions on Storage, 2012, 7(4): 14:1-14:20
8 JinK, MillerE L. The effectiveness of deduplication on virtual machine disk images. In: Proceedings of the 2009 Israeli Experimental Systems Conference. 2009, 7:1-7:12
doi: 10.1145/1534530.1534540
9 LuM, ChamblissD, GliderJ, ConstantinescuC. Insights for data reduction in primary storage: a practical analysis. In: Proceedings of the 5th Annual International Systems and Storage Conference. 2012, 17:1-17:7
doi: 10.1145/2367589.2367606
10 KollerR, RangaswamiR. I/O deduplication: utilizing content similarity to improve I/O performance. ACM Transactions on Storage, 2010, 6(3): 13:1-13:26
11 AkuÿrekS, SalemK. Adaptive block rearrangement. Technical Report, 1993
12 CarsonS D. A system for adaptive disk rearrangement. Software: Practice and Experience, 1990, 20(3): 225-242
doi: 10.1002/spe.4380200302
13 SleatorD D, TarjanR E. Self-adjusting binary search trees. Journal of the ACM, 1985, 32(3): 652-686
doi: 10.1145/3828.3835
14 ZawE P, TheinN L. Improved live VM migration using LRU and Splay tree algorithm. International Journal of Computer Science and Telecommunications, 2012, 3(3): 1-7
[1] Chenchen HUANG, Huiqi HU, Xing WEI, Weining QIAN, Aoying ZHOU. Partition pruning for range query on distributed log-structured merge-tree[J]. Front. Comput. Sci., 2020, 14(3): 143604-.
[2] Xiaodong MENG, Chentao WU, Minyi GUO, Long ZHENG, Jingyu ZHANG. PAM: an efficient power-aware multilevel cache policy to reduce energy consumption of storage systems[J]. Front. Comput. Sci., 2019, 13(4): 850-863.
[3] Mei LI, Hongjun ZHANG, Yanjun WU, Chen ZHAO. Prefetch-aware fingerprint cache management for data deduplication systems[J]. Front. Comput. Sci., 2019, 13(3): 500-515.
[4] Sa WANG, Yiwen SHAO, Yungang BAO. Practices of backuping homomorphically encrypted databases[J]. Front. Comput. Sci., 2019, 13(2): 220-230.
[5] Jingyu ZHANG, Chentao WU, Dingyu YANG, Yuanyi CHEN, Xiaodong MENG, Liting XU, Minyi GUO. HSCS: a hybrid shared cache scheduling scheme for multiprogrammed workloads[J]. Front. Comput. Sci., 2018, 12(6): 1090-1104.
[6] Cheqing JIN, Jie CHEN, Huiping LIU. MapReduce-based entity matching with multiple blocking functions[J]. Front. Comput. Sci., 2017, 11(5): 895-911.
[7] Zixiao JIA,Jiwei HUANG,Chuang LIN. Hierarchical caches in content-centric networks: modeling and analysis[J]. Front. Comput. Sci., 2015, 9(6): 846-859.
[8] Quanqing XU,Rajesh Vellore ARUMUGAM,Khai Leong YONG,Yonggang WEN,Yew-Soon ONG,Weiya XI. Adaptive and scalable load balancing for metadata server cluster in cloud-scale file systems[J]. Front. Comput. Sci., 2015, 9(6): 904-918.
[9] Chenggang Clarence YAN,Hui YU,Weizhi XU,Yingping ZHANG,Bochuan CHEN,Zhu TIAN,Yuxuan WANG,Jian YIN. Memory bandwidth optimization of SpMV on GPGPUs[J]. Front. Comput. Sci., 2015, 9(3): 431-441.
[10] Xiaolin WANG, Xiang WEN, Yechen LI, Zhenlin WANG, Yingwei LUO, Xiaoming LI. Dynamic cache partitioning based on hot page migration[J]. Front Comput Sci, 2012, 6(4): 363-372.
[11] Quanqing XU , Bin CUI , Yafei DAI , Hengtao SHEN , Zaiben CHEN , Xiaofang ZHOU , . Hybrid information retrieval policies based on cooperative cache in mobile P2P networks[J]. Front. Comput. Sci., 2009, 3(3): 381-395.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed