Please wait a minute...
Frontiers of Optoelectronics

ISSN 2095-2759

ISSN 2095-2767(Online)

CN 10-1029/TN

Postal Subscription Code 80-976

Front. Optoelectron.    2010, Vol. 3 Issue (3) : 308-316    https://doi.org/10.1007/s12200-010-0103-z
Research articles
Optimization for data de-duplication algorithm based on file content
Xuejun NIE,Leihua QIN,Jingli ZHOU,Ke LIU,Jianfeng ZHU,Yu WANG,
School of Computer Science and Technology, Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan 430074, China;
 Download: PDF(286 KB)  
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract Content defined chunking (CDC) is a prevalent data de-duplication algorithm for removing redundant data segments in archival storage systems. Current researches on CDC do not consider the unique content characteristic of different file types, and they determine chunk boundaries in a random way and apply a single strategy for all file types. It has been proven that such method cannot achieve optimal performance for compound archival data. We analyze the content characteristic of different file types and propose candidate anchor histogram (CAH) to capture it. We propose an improved strategy for determining chunk boundaries based on CAH and tune some key parameters of CDC based on the data layout of underlying data de-duplication file system (TriDFS), which can efficiently store variable-sized chunks on fixed-sized physical blocks. These strategies are evaluated with representative archival data, and the result indicates that they can increase on average the compression ratio by 16.3% and write throughput by 13.7%, while only decrease the read throughput by 2.5%.
Issue Date: 05 September 2010
 Cite this article:   
Xuejun NIE,Leihua QIN,Jingli ZHOU, et al. Optimization for data de-duplication algorithm based on file content[J]. Front. Optoelectron., 2010, 3(3): 308-316.
 URL:  
https://academic.hep.com.cn/foe/EN/10.1007/s12200-010-0103-z
https://academic.hep.com.cn/foe/EN/Y2010/V3/I3/308
Tony A, Biggar H. DataDe-Duplication and Disk-to-Disk Backup Systems: Technical and BusinessConsiderations. The Enterprise StrategyGroup Technical Report. 2007
Biggar H. Experiencing in Data De-Duplication: Improving Efficiencyand Reducing Capacity Requirements. TheEnterprise Strategy Group Technical Report. 2007
Lillibridge M, Eshghi K, Bhagwat D, Deolalikar V, Trezise G, Camble P. Sparse indexing: large scale,inline deduplication using sampling and locality. In: Proceedings of the 7th USERNIX Conference on File and Storage Technologies. 2009
Cox L P, Murray C D, Noble B D. Pastiche: making backup cheapand easy. In: Proceedings of the 5th Symposiumon Operating Systems Design and Implementation. 2002, 285―298
Quinlan S, Dorward S. Venti:a new approach to archival storage. In: Proceedings of the Conference on File and Storage Technologies. 2002, 89―101
Jain N, Dahlia M, Tewari R. TAPER: tiered approach foreliminating redundancy in replica synchronization. In: Proceedings of the 4th USENIX Conference on File and Storage Technologies. 2005, 4: 21
Bobbarjung D R, Jagannathan S, Dubnicki C. Improving duplicate eliminationin storage systems. ACM Transactions onStorage, 2006, 2(4): 424―448

doi: 10.1145/1210596.1210599
Zhu B, Kai L, Patterson H. Avoiding the disk bottleneck in the datadomain deduplication file system. In: Proceedingsof the 6th USENIX Conference on File and Storage Technologies. 2008, 18
You L L, Karamanolis C. Evaluationof efficient archival storage techniques. In: Proceedings of the 21st IEEE Symposium on Mass Storage Systems andTechnologies. 2004, 227―232
Manber U. Finding similar files in a large file system. In: Proceedings of the USENIX Winter 1994 TechnicalConference. 1994, 1―10
Rabin M O. Fingerprinting by Random Polynomials. Center for Research in Computing Technology. Harvard University TechnicalReport TR-15-81. 1981
Brin S, Davis J, Garcia-Molina H. Copy detection mechanismsfor digital documents. In: Proceedingsof the ACM SIGMOD International Conference on Management of Data. 1995, 398―409
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed