Efficient and stable quorum-based log replication and replay for modern cluster-databases

doi:10.1007/s11704-020-0210-y

Front. Comput. Sci.

2022, Vol. 16

Issue (5) : 165612 https://doi.org/10.1007/s11704-020-0210-y

RESEARCH ARTICLE

Efficient and stable quorum-based log replication and replay for modern cluster-databases

Donghui WANG, Peng CAI(

), Weining QIAN, Aoying ZHOU

School of Data Science and Engineering, East China Normal University, Shanghai 200062, China

Download: PDF(7276 KB) HTML
Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks

Abstract

The modern in-memory database (IMDB) can support highly concurrent on-line transaction processing (OLTP) workloads and generate massive transactional logs per second. Quorum-based replication protocols such as Paxos or Raft have been widely used in the distributed databases to offer higher availability and fault-tolerance. However, it is non-trivial to replicate IMDB because high transaction rate has brought new challenges. First, the leader node in quorum replication should have adaptivity by considering various transaction arrival rates and the processing capability of follower nodes. Second, followers are required to replay logs to catch up the state of the leader in the highly concurrent setting to reduce visibility gap. Third, modern databases are often built with a cluster of commodity machines connected by low configuration networks, in which the network anomalies often happen. In this case, the performance would be significantly affected because the follower node falls into the long-duration exception handling process (e.g., fetch lost logs from the leader). To this end, we build QuorumX, an efficient and stable quorum-based replication framework for IMDB under heavy OLTP workloads. QuorumX combines critical path based batching and pipeline batching to provide an adaptive log propagation scheme to obtain a stable and high performance at various settings. Further, we propose a safe and coordination-free log replay scheme to minimize the visibility gap between the leader and follower IMDBs. We further carefully design the process for the follower node in order to alleviate the influence of the unreliable network on the replication performance. Our evaluation results with the YCSB, TPC-C and a realistic micro-benchmark demonstrate that QuorumX achieves the performance close to asynchronous primary-backup replication and could always provide a stable service with data consistency and a low-level visibility gap.

Keywords log replication log replay consensus protocol high performance high availability quorum unreliable network packet loss

Corresponding Author(s): Peng CAI

Just Accepted Date: 16 March 2021 Issue Date: 07 January 2022

Cite this article:

Donghui WANG,Peng CAI,Weining QIAN, et al. Efficient and stable quorum-based log replication and replay for modern cluster-databases[J]. Front. Comput. Sci., 2022, 16(5): 165612.

URL:

https://academic.hep.com.cn/fcs/EN/10.1007/s11704-020-0210-y
https://academic.hep.com.cn/fcs/EN/Y2022/V16/I5/165612

Fig.1 Overall architecture of log replication and replaying in an IMDB cluster. ○ represents the step of log replication in the leader. ● indicates the process of log replication in the follower. denotes the flow of log replaying

Tab.1 Features of different batching algorithms

Fig.2 Critical-path-based batching (CB)

Fig.3 Pipeline-based batching (PB). (a) S₂ is the most time-consuming part, under low-latency network; (b) S₁ is the most time-consuming part, under high-latency network

Fig.4

Fig.5 An example illustrating a fault caused by CLR

Fig.6 The model of asynchronous loading mechanism

Fig.7 An example of 3-way replication’s logs in Raft and Multi-Paxos. (a) Raft replication; (b) Multi-Paxos replication

Fig.8 An example of 5-way replication illustrating the problem caused by flushing logs out of order

Fig.9 Throughput of YCSB

Fig.10 Latency of YCSB

Fig.11 Throughput of TPC-C

Fig.12 Latency of TPC-C

Fig.13 Impact of read/write ratios

Fig.14 Unavailable time

Fig.15 Performance of different batch size

Fig.16 Performance of different tuning algorithms

Fig.17 Performance over various buffer sizes

Fig.18 VGap changing under the write-heavy Micro-Benchmark

Fig.19 The maximum log entries difference between two followers

Fig.20 Rate of log commits in the follower

Fig.21 Impact of network delay

Fig.22 Impact of packet loss

Fig.23 Throughput over the number of replicas. (a) Write-only YCSB; (b) Read/write YCSB

1	T D Chandra, R Griesemer, J Redstone. Paxos made live: an engineering perspective. In: Proceedings of the 26th Annual ACM Symposium on Principles of Distributed Computing. 2007, 398– 407
2	D Ongaro, J Ousterhout. In search of an understandable consensus algorithm. In: Proceedings of 2014 USENIX Annual Technical Conference. 2014, 305– 319
3	R van Renesse , D Altinbuken . Paxos made moderately complex. ACM Computing Surveys, 2015, 47( 3): 42–
4	J Rao , E J Shekita , S Tata . Using paxos to build a scalable, consistent, and highly available datastore. Proceedings of the VLDB Endowment, 2011, 4( 4): 243– 254
5	J Zheng , Q Lin , J Xu , C Wei , C Zeng , P Yang , Y Zhang . PaxosStore: high-availability storage made practical in WeChat. Proceedings of the VLDB Endowment, 2017, 10( 12): 1730– 1741
6	T Zhu, Z Zhao, F Li, W Qian, A Zhou, D Xie, R Stutsman, H Li, H Hu. Solar: towards a shared-everything database on distributed log-structured storage. In: Proceedings of 2018 USENIX Conference on Usenix Annual Technical Conference. 2018, 795– 807
7	S Gilbert , N Lynch . Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. ACM SIGACT News, 2002, 33( 2): 51– 59
8	Y Breitbart , H Garcia-Molina , A Silberschatz . Overview of multidatabase transaction management. The VLDB Journal, 1992, 1( 2): 181– 239
9	K Daudjee, K Salem. Lazy database replication with ordering guarantees. In: Proceedings of the 20th International Conference on Data Engineering. 2004, 424– 435
10	S Elnikety, F Pedone, W Zwaenepoel. Database replication using generalized snapshot isolation. In: Proceedings of the 24th IEEE Symposium on Reliable Distributed Systems. 2005, 73– 84
11	J C Corbett, J Dean, M Epstein, A Fikes, C Frost. Spanner: Google’s globally-distributed database. In: Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation. 2012, 251– 264
12	G DeCandia, D Hastorun, M Jampani, G Kakulapati, A Lakshman, A Pilchin, S Sivasubramanian, P Vosshall, W Vogels. Dynamo: amazon’s highly available key-value store. In: Proceedings of the 21st ACM SIGOPS Symposium on Operating Systems Principles. 2007, 205– 220
13	A Lakshman , P Malik . Cassandra: a decentralized structured storage system. ACM SIGOPS Operating Systems Review, 2010, 44( 2): 35– 40
14	N Santos, A Schiper. Tuning paxos for high-throughput with batching and pipelining. In: Proceedings of the 13th International Conference on Distributed Computing and Networking. 2012, 153– 167
15	F Özcan, Y Tian, P Tözün. Hybrid transactional/analytical processing: a survey. In: Proceedings of the 2017 ACM International Conference on Management of Data. 2017, 1771−1775
16	J Lee , S Moon , K Kim , D H Kim , S K Cha , W S Han . Parallel replication across formats in SAP HANA for scaling out mixed OLTP/OLAP workloads. Proceedings of the VLDB Endowment, 2017, 10( 12): 1598– 1609
17	D Qin , A D Brown , A Goel . Scalable replay-based replication for fast databases. Proceedings of the VLDB Endowment, 2017, 10( 13): 2025– 2036
18	W Zheng, S Tu, E Kohler, B Liskov. Fast databases with fast durability and recovery through multicore parallelism. In: Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation. 2014, 465– 477
19	P Romano, M Leonetti. Self-tuning batching in total order broadcast protocols via analytical modelling and reinforcement learning. In: Proceedings of 2012 International Conference on Computing, Networking and Communications. 2012, 786– 792
20	R Friedman, E Hadad. Adaptive batching for replicated servers. In: Proceedings of the 2006 25th IEEE Symposium on Reliable Distributed Systems. 2006, 311– 320
21	X Yu , G Bezerra , A Pavlo , S Devadas , M Stonebraker . Staring into the abyss: an evaluation of concurrency control with one thousand cores. Proceedings of the VLDB Endowment, 2014, 8( 3): 209– 220
22	T Wang , H Kimura . Mostly-optimistic concurrency control for highly contended dynamic workloads on a thousand cores. Proceedings of the VLDB Endowment, 2016, 10( 2): 49– 60
23	K Ren , A Thomson , D J Abadi . Lightweight locking for main memory database systems. Proceedings of the VLDB Endowment, 2012, 6( 2): 145– 156
24	B Kemme, G Alonso. Don’t be lazy, be consistent: Postgres-R, a new way to implement database replication. In: Proceedings of the 26th International Conference on Very Large Data Bases. 2000, 134– 143
25	M Wiesmann, F Pedone, A Schiper, B Kemme, G Alonso. Database replication techniques: a three parameter classification. In: Proceedings of the 19th IEEE Symposium on Reliable Distributed Systems. 2000, 206– 215
26	M Stonebraker . Concurrency control and consistency of multiple copies of data in distributed INGRES. IEEE Transactions on Software Engineering, 1979, SE-5( 3): 188– 194
27	C Hong, D Zhou, M Yang, C Kuo, L Zhang, L Zhou. KuaFu: closing the parallelism gap in database replication. In: Proceedings of the 2013 IEEE 29th International Conference on Data Engineering. 2013, 1186-1195
28	P Hunt, M Konar, F P Junqueira, B Reed. ZooKeeper: wait-free coordination for internet-scale systems. In: Proceedings of 2010 USENIX Annual Technical Conference. 2010
29	D Wang, P Cai, W Qian, A Zhou. Fast quorum-based log replication and replay for fast databases. In: Proceedings of the 24th International Conference on Database Systems for Advanced Applications. 2019, 209– 226

[1]

Download

[1]	Huan ZHOU, Weining QIAN, Xuan ZHOU, Qiwen DONG, Aoying ZHOU, Wenrong TAN. Scalable and adaptive log manager in distributed systems[J]. Front. Comput. Sci., 2023, 17(2): 172205-.
[2]	Shuai XUE, Shang ZHAO, Quan CHEN, Zhuo SONG, Shanpei CHEN, Tao MA, Yong YANG, Wenli ZHENG, Minyi GUO. Kronos: towards bus contention-aware job scheduling in warehouse scale computers[J]. Front. Comput. Sci., 2023, 17(1): 171101-.
[3]	Guozhen ZHANG, Yi LIU, Hailong YANG, Jun XU, Depei QIAN. User-level failure detection and auto-recovery of parallel programs in HPC systems[J]. Front. Comput. Sci., 2021, 15(6): 156107-.
[4]	Jinwei GUO, Peng CAI, Weining QIAN, Aoying ZHOU. Accurate and efficient follower log repair for Raft-replicated database systems[J]. Front. Comput. Sci., 2021, 15(2): 152605-.
[5]	Yu TANG,Hailong SUN,Xu WANG,Xudong LIU. An efficient and highly available framework of data recency enhancement for eventually consistent data stores[J]. Front. Comput. Sci., 2017, 11(1): 88-104.
[6]	Wenhao ZHOU,Juan CHEN,Chen CUI,Qian WANG,Dezun DONG,Yuhua TANG. Detailed and clock-driven simulation for HPC interconnection network[J]. Front. Comput. Sci., 2016, 10(5): 797-811.
[7]	Yunquan ZHANG, Jiachang SUN, Guoxing YUAN, Linbo ZHANG, . Perspectives of China’s HPC system development: a view from the 2009 China HPC TOP100 list[J]. Front. Comput. Sci., 2010, 4(4): 437-444.
[8]	Mingfa ZHU, Limin XIAO, Li RUAN, Qinfen HAO, . DeepComp: towards a balanced system design for high performance computer systems[J]. Front. Comput. Sci., 2010, 4(4): 475-479.
[9]	YANG Xuejun, YI Huizhan, QU Xiangli, ZHOU Haifang. Compiler-directed power optimization of high-performance interconnection networks for load-balancing MPI applications[J]. Front. Comput. Sci., 2007, 1(1): 94-105.

Viewed

Full text

Abstract

Cited

Shared

Discussed