Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2025, Vol. 19 Issue (1) : 191105    https://doi.org/10.1007/s11704-023-2772-y
Architecture
ICCG: low-cost and efficient consistency with adaptive synchronization for metadata replication
Chenhao ZHANG1,2, Liang WANG1,2(), Jing SHANG4, Zhiwen XIAO4, Limin XIAO1,2(), Meng HAN1,2, Bing WEI3, Runnan SHEN1,2, Jinquan WANG1,2
1. State Key Laboratory of Software Development Environment, Beihang University, Beijing 100191, China
2. School of Computer Science and Engineering, Beihang University, Beijing 100191, China
3. School of Cyberspace Security, Hainan University, Haikou 570228, China
4. China Mobile Information Technology Center, Beijing 100033, China
 Download: PDF(9859 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

The rapid growth in the storage scale of wide-area distributed file systems (DFS) calls for fast and scalable metadata management. Metadata replication is the widely used technique for improving the performance and scalability of metadata management. Because of the POSIX requirement of file systems, many existing metadata management techniques utilize a costly design for the sake of metadata consistency, leading to unacceptable performance overhead. We propose a new metadata consistency maintenance method (ICCG), which includes an incremental consistency guaranteed directory tree synchronization (ICGDT) and a causal consistency guaranteed replica index synchronization (CCGRI), to ensure system performance without sacrificing metadata consistency. ICGDT uses a flexible consistency scheme based on the state of files and directories maintained through the conflict state tree to provide an incremental consistency for metadata, which satisfies both metadata consistency and performance requirements. CCGRI ensures low latency and consistent access to data by establishing a causal consistency for replica indexes through multi-version extent trees and logical time. Experimental results demonstrate the effectiveness of our methods. Compared with the strong consistency policies widely used in modern DFSes, our methods significantly improve the system performance. For example, in file creation, ICCG can improve the performance of directory tree operations by at least 36.4 times.

Keywords metadata management      metadata replication      consistency      directory tree      replica index     
Corresponding Author(s): Liang WANG,Limin XIAO   
Just Accepted Date: 08 December 2023   Issue Date: 03 April 2024
 Cite this article:   
Chenhao ZHANG,Liang WANG,Jing SHANG, et al. ICCG: low-cost and efficient consistency with adaptive synchronization for metadata replication[J]. Front. Comput. Sci., 2025, 19(1): 191105.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-023-2772-y
https://academic.hep.com.cn/fcs/EN/Y2025/V19/I1/191105
Fig.1  Decentralized directory tree replication architecture
Fig.2  Overview of Raft-based consensus protocol architectures. R designates Raft groups, and Group is for wide-area consensus group example. (a) Wide-area consensus group example; (b) local-area consensus group example
Parameters Description
T The access state tree
P The global path of a node in a directory tree
N The state set of a node in a directory tree
S The replication state of a node in a directory tree
η The left ID of the directory tree synchronization message source
I The replication ID set
i The replication ID
s,? The access state of a node in a directory tree which is represented by color
t,τ The most recent access timestamp
Tab.1  Parameters of access state tree
Fig.3  Access state tree
Fig.4  Overview of incremental consistency synchronization method. According to the node conflict state, the IO proxy will choose the appropriate consistency channel in the incremental consistency model to propagate the directory tree request synchronization message
Fig.5  An example of a confliction state. The directory tree operation from the client assigns different conflict states to the node, and the node selects the corresponding synchronization protocol accordingly
Parameters Description
TL The logical timestamp of the I/O request
V The version vector in logical timestamps, consists of a mapping of left IDs to their I/O request sequence numbers
C The left ID group, including all left IDs of the replica file in the replica space
Q The I/O request sequence number of the edge I/O agent in the left
μ The machine time obtained from the operating system interface
Tab.2  Logical timestamp symbol
Fig.6  Machine time deviation estimation
Fig.7  New index extent insert operation with version information
Fig.8  Replica index Synchronization Architecture with Causal Consistency guarantees
Configuration items Value Description
Operating System Ubuntu 20.04 Ubuntu 20.04 can support docker-tc/mount point penetration better
CPU Intel Xeon Platinum 8269CY 2 cores
Memory 8 GB Memory capacity
Disk 100 GB Ali cloud ordinary cloud disk
Kernel version 5.4.0 Better support for Fuse3-kernel
Up/Downlink bandwidth 100 Mbps The network transmission speed is about 12.8 MB/s
GVDS verision 0.4.3 ICGDT and CCGRI are included in this release
MPI version OpenMPI 4.0.3 Used to run test programs
Tab.3  Experimental resource allocation and software environment
Data center IP address Average latency
Beijing 39.107.109.1 3.53 ms
Qingdao 47.104.52.12 19.2 ms
Hangzhou 120.27.218.166 34.1 ms
Beijing – Hangzhou ? 36.3 ms
Tab.4  Average latency between nodes
Fig.9  The directory operation performance result in a single process. (a) Directory tree operation delay of ICGDT, Raft-log and Local Center method in a single process; (b) directory tree operation QPS of ICGDT, Raft-log and Local Center method in a single process
Fig.10  The directory operation performance result in multi process. (a) QPS comparison test of file creation operation in multi-process; (b) the relationship between the number of conflicts and the total time of task execution on the WAN
Fig.11  Replica number sacability: (a) File create operation, (b) directory stat operation.
Fig.12  Client number sacability: (a) File create operation; (b) directory stat operation
Fig.13  The data wirte performance result with the I/O queue depth is 1 and page cache turned off. (a)The relationship between data write bandwidth and data block size; (b) the relationship between data write throughput and data block size; (c) the relationship between data write latency and data block size
Fig.14  The data wirte performance result with the I/O queue depth is 16 and page cache turned on. (a) The relationship between data write bandwidth and data block size; (b) the relationship between data write throughput and data block size; (c) the relationship between data write latency and data block size
Fig.15  The data read performance result. (a)The relationship between data read bandwidth and data block size; (b) the relationship between data read throughput and data block size; (c) the relationship between data read latency and data block size
  
  
  
  
  
  
  
  
  
1 J V, Lavric E, Juurola A T, Vermeulen W L Kutsch . Integrated carbon observation system (ICOS)-a domain-overarching long-term research infrastructure for the future. In: Proceedings of AGU Fall Meeting Abstracts. 2016, GC21C−1117
2 Wrzeszcz M, Trzepla K, S ota R, Zemek K, Lichoń T, Opioła Ł, Nikolow D, Dutka Ł, Słota R, Kitowski J. Metadata organization and management for globalization of data access with Onedata. In: Proceedings of the 11th International Conference on Parallel Processing and Applied Mathematics. 2016, 312−321
3 B, Wei L M, Xiao H J, Zhou G J, Qin Y, Song C H Zhang . Global virtual data space for unified data access across supercomputing centers. IEEE Transactions on Cloud Computing, 2023, 11( 2): 1822–1839
4 J T, Huo Y W, Xu Z S, Huo L M, Xiao Z X He . Research on key technologies of edge cache in virtual data space across wan. Frontiers of Computer Science, 2023, 17( 1): 171102
5 H, Dai Y, Wang K B, Kent L F, Zeng C Z Xu . The state of the art of metadata managements in large-scale distributed file systems– scalability, performance and availability. IEEE Transactions on Parallel and Distributed Systems, 2022, 33( 12): 3850–3869
6 Lv W H, Lu Y Y, Zhang Y M, Duan P L, Shu J W. InfiniFS: an efficient metadata service for Large-Scale distributed filesystems. In: Proceedings of the 20th USENIX Conference on File and Storage Technologies. 2022, 313−328
7 Ousterhout J K, Da Costa H, Harrison D, Kunze J A, Kupfer M, Thompson J G. A trace-driven analysis of the Unix 4.2 BSD file system. In: Proceedings of the 10th ACM Symposium on Operating Systems Principles. 1985, 15−24
8 Miller E L, Greenan K, Leung A, et al. Reliable and efficient metadata storage and indexing using nvram. Available: dcslab. hanyang. ac. kr/nvramos08/EthanMiller. pdf, 2008.
9 OPENSFS. Lustre. See Lustre website, 2023
10 Thomson A, Abadi D J. CalvinFS: Consistent WAN replication and scalable metadata management for distributed file systems. In: Proceedings of the 13th USENIX Conference on File and Storage Technologies. 2015, 1−14
11 S A, Weil S A, Brandt E L, Miller D D E, Long C Maltzahn . Ceph: a scalable, high-performance distributed file system. In: Proceedings of the 7th Symposium on Operating Systems Design and Implementation. 2006, 307−320
12 Shvachko K, Kuang H, Radia S, Chansler R. The Hadoop distributed file system. In: Proceedings of the 26th IEEE Symposium on Mass Storage Systems and Technologies (MSST). 2010, 1−10
13 P, Alvaro T, Condie N, Conway K, Elmeleegy J M, Hellerstein P C Sears . BOOM: data-centric programming in the datacenter. Technical Report UCB/EECS-2009-113. Berkeley: University of California at Berkeley, 2009
14 Data Lab Parallel . Shardfs. See Pdl.cmu.edu/ShardFS website, 2023
15 Matri P, Pérez M S, Costan A, Antoniu G. TýrFS: increasing small files access performance with dynamic metadata replication. In: Proceedings of the 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). 2018: 452−461
16 M Burrows . The chubby lock service for loosely-coupled distributed systems. In: Proceedings of the 7th Symposium on Operating Systems Design and Implementation. 2006, 335−350
17 Lipcon T, Alves D, Burkert D, et al. Kudu: Storage for fast analytics on fast data. Cloudera, Inc, 2015, 28: 36−77
18 Li Z Y, Xue R N, Ao L X. Replichard: towards tradeoff between consistency and performance for metadata. In: Proceedings of 2016 International Conference on Supercomputing. 2016, 25
19 Bravo M, Rodrigues L, Van Roy P. Saturn: a distributed metadata service for causal consistency. In: Proceedings of the 12th European Conference on Computer Systems. 2017, 111−126
20 M A, Vef N, Moti T, Süß T, Tocci R, Nou A, Miranda T, Cortes A Brinkmann . GekkoFS-a temporary distributed file system for HPC applications. In: Proceedings of 2018 IEEE International Conference on Cluster Computing (CLUSTER). 2018, 319−324
21 Guerraoui R, Pavlovic M, Seredinschi D A. Incremental consistency guarantees for replicated objects. In: Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation. 2016, 169−184
22 D Abadi . Consistency tradeoffs in modern distributed database system design: CAP is only part of the story. Computer, 2012, 45( 2): 37–42
23 Rodeh O, Teperman A. zFS-a scalable distributed file system using object disks. In: Proceedings of the 20th IEEE/ the 11th NASA Goddard Conference on Mass Storage Systems and Technologies. 2003, 207−218
24 E B, Boyer M C, Broomfield T A Perrotti . Glusterfs one storage server to rule them all. Los Alamos: Los Alamos National Laboratory, 2012
25 Niazi S, Ismail M, Haridi S, Dowling J, Grohsschmiedt S, Ronström M. HopsFS: scaling hierarchical file system metadata using newSQL databases. In: Proceedings of the 15th USENIX Conference on File and Storage Technologies. 2017, 89−103
26 M T, Özsu P Valduriez . Principles of Distributed Database Systems. Upper Saddle River: Prentice Hall, 1999
27 Lamport L. Paxos made simple. ACM SIGACT News (Distributed Computing Column) 32, 4 (Whole Number 121, December 2001), 2001: 51−58
28 D, Ongaro J Ousterhout . In search of an understandable consensus algorithm. In: Proceedings of 2014 USENIX Conference on USENIX Annual Technical Conference. 2014, 305−320
29 Q Q, Xu R V, Arumugam K L, Yong S Mahadevan . Efficient and scalable metadata management in EB-scale file systems. IEEE Transactions on Parallel and Distributed Systems, 2014, 25( 11): 2840–2850
30 Zhou J, Chen Y, Wang W P, Meng D. MAMS: a highly reliable policy for metadata service. In: Proceedings of the 44th International Conference on Parallel Processing. 2015, 729−738
31 Z, Chen J, Xiong D Meng . Replication-based highly available metadata management for cluster file systems. In: Proceedings of 2010 IEEE International Conference on Cluster Computing. 2010, 292−301
32 Chandra T D, Griesemer R, Redstone J. Paxos made live: an engineering perspective. In: Proceedings of the 26th Annual ACM Symposium on Principles of Distributed Computing. 2007, 398−407
33 Y, Saito M Shapiro . Optimistic replication. ACM Computing Surveys, 2005, 37( 1): 42–81
34 R, Ladin B, Liskov L, Shrira S Ghemawat . Providing high availability using lazy replication. ACM Transactions on Computer Systems, 1992, 10( 4): 360–391
35 MongoDB. Delayed replica set members. See Mongodb.com/docs/v6.0/core/replica-set-delayed-member/website, 2023
36 Bailis P, Fekete A, Franklin M J, Ghodsi A, Hellerstein J M, Stoica I. Feral concurrency control: an empirical investigation of modern application integrity. In: Proceedings of 2015 ACM SIGMOD International Conference on Management of Data. 2015, 1327−1342
37 I, Giannakopoulos I, Konstantinou D, Tsoumakos N Koziris . Cloud application deployment with transient failure recovery. Journal of Cloud Computing, 2018, 7( 1): 11
38 J, Jia Y, Liu G Z, Zhang Y L, Gao D P Qian . Software approaches for resilience of high performance computing systems: a survey. Frontiers of Computer Science, 2023, 17( 4): 174105
39 C, Wang K, Mohror M Snir . File system semantics requirements of HPC applications. In: Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing. 2021, 19−30
40 C X, Zhang Y M, Li R, Zhang W N, Qian A Y Zhou . Scalable and quantitative contention generation for performance evaluation on OLTP databases. Frontiers of Computer Science, 2023, 17( 2): 172202
41 L Lamport . Time, clocks, and the ordering of events in a distributed system. In: Malkhi D, ed. Concurrency: The Works of Leslie Lamport. New York: ACM, 2019, 179−196
42 B, Wei L M, Xiao Y, Song G J, Qin J B, Zhu B C, Yan C B, Wang Z S Huo . A self-tuning client-side metadata prefetching scheme for wide area network file systems. Science China Information Sciences, 2022, 65( 3): 132101
43 H, Zhou W N, Qian X, Zhou Q W, Dong A Y, Zhou W R Tan . Scalable and adaptive log manager in distributed systems. Frontiers of Computer Science, 2023, 17( 2): 172205
44 Alibaba. Alibaba elastic compute service. See alibabacloud.com/zh/product/ecs website, 2023
45 HPC IO Benchmark Repository. Mdtest parallel I/O benchmark. See github.com/hpc/ior website, 2023
46 Gupta A, Milojicic D. Evaluation of HPC applications on cloud. In: Proceedings of the 6th Open Cirrus Summit. 2011, 22−26
47 C, Wang M, Snir K Mohror . High performance computing application I/O traces. Livermore: Lawrence Livermore National Laboratory, 2020
48 Charapko A, Ailijiang A, Demirbas M. Linearizable quorum reads in Paxos. In: Proceedings of the 11th USENIX Workshop on Hot Topics in Storage and File Systems. 2019, 8
49 Jens A. Fio-flexible io tester. See freecode.com/projects/fio website, 2014.
50 Glass G, Gopalan A, Koujalagi D, Palicherla A, Sakdeo S. Logical synchronous replication in the tintri VMstore file system. In: Proceedings of the 16th USENIX Conference on File and Storage Technologies. 2018, 295−308
51 Lampson B, Lomet D. A new presumed commit optimization for two phase commit. In: Proceedings of the 19th International Conference on Very Large Data Bases (VLDB'93). 1993: 630-640
52 J W, Liu H Y, Shen H M, Chi H S, Narman Y Y, Yang L, Cheng W Y Chung . A low-cost multi-failure resilient replication scheme for high-data availability in cloud storage. IEEE/ACM Transactions on Networking, 2021, 29( 4): 1436–1451
53 A, Haeberlen A, Mislove P Druschel . Glacier: highly durable, decentralized storage despite massive correlated failures. In: Proceedings of the 2nd Conference on Symposium on Networked Systems Design & Implementation. 2005, 143−158
54 J W, Liu H Y Shen . A popularity-aware cost-effective replication scheme for high data durability in cloud storage. In: Proceedings of 2016 IEEE International Conference on Big Data (Big Data). 2016, 384−389
55 J, Zhou Y, Chen W P, Wang S B, He D Meng . A highly reliable metadata service for large-scale distributed file systems. IEEE Transactions on Parallel and Distributed Systems, 2020, 31( 2): 374–392
56 D, Stamatakis N, Tsikoudis E, Micheli K Magoutis . A general-purpose architecture for replicated metadata services in distributed file systems. IEEE Transactions on Parallel and Distributed Systems, 2017, 28( 10): 2747–2759
[1] FCS-22772-OF-CZ_suppl_1 Download
[1] Chenhao ZHANG, Liang WANG, Limin XIAO, Shixuan JIANG, Meng HAN, Jinquan WANG, Bing WEI, Guangjun QIN. Minimizing the cost of periodically replicated systems via model and quantitative analysis[J]. Front. Comput. Sci., 2024, 18(5): 185206-.
[2] Weiwei CAI, Fazhi HE, Xiao LV, Yuan CHENG. A semi-transparent selective undo algorithm for multi-user collaborative editors[J]. Front. Comput. Sci., 2021, 15(5): 155209-.
[3] Yuanrui ZHANG, Frédéric MALLET, Yixiang CHEN. A verification framework for spatio-temporal consistency language with CCSL as a specification language[J]. Front. Comput. Sci., 2020, 14(1): 105-129.
[4] Fei WANG, Tieyun QIAN, Bin LIU, Zhiyong PENG. Patent expanded retrieval via word embedding under composite-domain perspectives[J]. Front. Comput. Sci., 2019, 13(5): 1048-1061.
[5] Zhilin LI, Wenbo XU, Xiaobo ZHANG, Jiaru LIN. A survey on one-bit compressed sensing: theory and applications[J]. Front. Comput. Sci., 2018, 12(2): 217-230.
[6] Hao ZHU,Yongming NIE,Tao YUE,Xun CAO. The role of prior in image based 3D modeling: a survey[J]. Front. Comput. Sci., 2017, 11(2): 175-191.
[7] Yu TANG,Hailong SUN,Xu WANG,Xudong LIU. An efficient and highly available framework of data recency enhancement for eventually consistent data stores[J]. Front. Comput. Sci., 2017, 11(1): 88-104.
[8] Quanqing XU,Rajesh Vellore ARUMUGAM,Khai Leong YONG,Yonggang WEN,Yew-Soon ONG,Weiya XI. Adaptive and scalable load balancing for metadata server cluster in cloud-scale file systems[J]. Front. Comput. Sci., 2015, 9(6): 904-918.
[9] Zhiying LIU, David Lorge PARNAS, Baltasar Trancon y WIDEMANN. Documenting and verifying systems assembled from components[J]. Front Comput Sci Chin, 2010, 4(2): 151-161.
[10] Quanqing XU , Bin CUI , Yafei DAI , Hengtao SHEN , Zaiben CHEN , Xiaofang ZHOU , . Hybrid information retrieval policies based on cooperative cache in mobile P2P networks[J]. Front. Comput. Sci., 2009, 3(3): 381-395.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed