Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2023, Vol. 17 Issue (2) : 172205    https://doi.org/10.1007/s11704-022-1357-5
RESEARCH ARTICLE
Scalable and adaptive log manager in distributed systems
Huan ZHOU1(), Weining QIAN2, Xuan ZHOU2, Qiwen DONG2, Aoying ZHOU2, Wenrong TAN1
1. The Key Laboratory for Computer Systems of State Ethnic Affairs Commission, Southwest Minzu University, Chengdu 610041, China
2. School of Data Science and Engineering, East China Normal University, Shanghai 200062, China
 Download: PDF(12769 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

On-line transaction processing (OLTP) systems rely on transaction logging and quorum-based consensus protocol to guarantee durability, high availability and strong consistency. This makes the log manager a key component of distributed database management systems (DDBMSs). The leader of DDBMSs commonly adopts a centralized logging method to writing log entries into a stable storage device and uses a constant log replication strategy to periodically synchronize its state to followers. With the advent of new hardware and high parallelism of transaction processing, the traditional centralized design of logging limits scalability, and the constant trigger condition of replication can not always maintain optimal performance under dynamic workloads.

In this paper, we propose a new log manager named Salmo with scalable logging and adaptive replication for distributed database systems. The scalable logging eliminates centralized contention by utilizing a highly concurrent data structure and speedy log hole tracking. The kernel of adaptive replication is an adaptive log shipping method, which dynamically adjusts the number of log entries transmitted between leader and followers based on the real-time workload. We implemented and evaluated Salmo in the open-sourced transaction processing systems Cedar and DBx1000. Experimental results show that Salmo scales well by increasing the number of working threads, improves peak throughput by 1.56× and reduces latency by more than 4× over log replication of Raft, and maintains efficient and stable performance under dynamic workloads all the time.

Keywords distributed database systems      transaction logging      log replication      scalable      adaptive     
Corresponding Author(s): Huan ZHOU   
Just Accepted Date: 15 March 2022   Issue Date: 02 August 2022
 Cite this article:   
Huan ZHOU,Weining QIAN,Xuan ZHOU, et al. Scalable and adaptive log manager in distributed systems[J]. Front. Comput. Sci., 2023, 17(2): 172205.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-022-1357-5
https://academic.hep.com.cn/fcs/EN/Y2023/V17/I2/172205
Fig.1  Overview of the classical centralized transaction logging (Aether)
Fig.2  Log replication phase of Raft
Fig.3  Architecture of scalable design for centralized logging
Fig.4  The data structure of Tracking Table and access rules on the data structure. (a) Tracking table (G); (b) state transition order; (c) access rules on tracking table
  
Fig.5  The architecture of decentralized design for scalable logging Salmo
  
  
Fig.6  The effect of constant grouping/shipping time (timer) on throughput and latency. (a) Peak throughput; (b) latency
Fig.7  The producer-consumer model for transaction processing in distributed database systems
Hardware resource Influence Maximum
CPU the maximum rate of generating new log entries in a group 100%
IOPS the maximum number of groups written into disk per second 30K
Disk Bandwidth the maximum amount of log entries written to disk per second 600M/s
Sequential Write speed the average latency of log entries flushed to disk 165us/4KB
Network Bandwidth the maximum amount of log entries passed over the network per second 125M/s
Network transmission time the average latency of log entries in network 180μs
Network packet rate(PPS) the maximum number of groups transferred in the network per second 1488K
Tab.1  The impact of hardware resources on log replication
  
Fig.8  Evaluation of the scalable logging. Salmo shows the high scalability, best throughput, lowest latency. (a) Scalability; (b) throughput; (c) latency; (d) CPU Utilization
Fig.9  Evaluation of adaptive group commit in scalable logging of Salmo. (a) Throughput and latency; (b) breakdown of processing time
Fig.10  Throughput of decentralized loggings under YCSB and TPCC benchmark. Salmo has the same throughput as SiloR. (a) YCSB; (b) TPCC
Fig.11  Latency of decentralized loggings under YCSB and TPCC benchmark. Salmo has the lowest latency. (a) YCSB; (b) TPCC
Fig.12  Evaluation of adaptive group commit in decentralized logging
Fig.13  Evaluation of the adaptive replication. Compared with Raft Rep, Salmo Rep shows higher throughput and lower latency. (a) Throughput; (b) latency
Fig.14  Breakdown of processing time for log replication. Salmo Rep always follows timer=R(|Gˉ|timer)
Fig.15  Hardware resources utilization (network, disk and CPU) of Raft Rep and Salmo Rep. (a) Network utilization of Raft Rep in followers; (b) disk utilization of Raft Rep in followers; (c) CPU utilization of Raft Rep; (d) network utilization of Salmo Rep in followers; (e) disk utilization of Salmo Rep in followers; (f) CPU utilization of Salmo Rep
Fig.16  Overall performance under dynamic workloads. Salmo exhibits the lowest and most stable latency
  
  
  
  
  
  
1 M Stonebraker . Concurrency control and consistency of multiple copies of data in distributed INGRES. IEEE Transactions on Software Engineering, 1979, SE-5( 3): 188– 194
2 J C, Corbett J, Dean M, Epstein A, Fikes C, Frost . et al.. Spanner: Google’s globally distributed database. ACM Transactions on Computer Systems, 2013, 31( 3): 8
3 L Lamport . The part-time parliament. ACM Transactions on Computer Systems, 1998, 16( 2): 133– 169
4 D, Ongaro J Ousterhout. In search of an understandable consensus algorithm. In: Proceedings of 2014 USENIX Conference on USENIX Annual Technical Conference. 2014, 305– 320
5 J, Gray P, McJones M, Blasgen B, Lindsay R, Lorie T, Price F, Putzolu I Traiger . The recovery manager of the system R database manager. ACM Computing Surveys, 1981, 13( 2): 223– 243
6 C, Mohan D, Haderle B, Lindsay H, Pirahesh P Schwarz . ARIES: a transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging. ACM Transactions on Database Systems, 1992, 17( 1): 94– 162
7 C, Diaconu C, Freedman E, Ismert P A, Larson P, Mittal R, Stonecipher N, Verma M Zwilling. Hekaton: SQL server’s memory-optimized OLTP engine. In: Proceedings of 2013 ACM SIGMOD International Conference on Management of Data. 2013, 1243– 1254
8 J J, Levandoski D B, Lomet S, Sengupta R, Stutsman R Wang. High performance transactions in deuteronomy. In: Proceedings of the 7th Biennial Conference on Innovative Data Systems Research. 2015
9 H, Lim M, Kaminsky D G Andersen. Cicada: dependably fast multi-core in-memory transactions. In: Proceedings of 2017 ACM International Conference on Management of Data. 2017, 21– 35
10 R, Johnson I, Pandis R, Stoica M, Athanassoulis A Ailamaki . Aether: a scalable approach to logging. Proceedings of the VLDB Endowment, 2010, 3( 1−2): 681– 692
11 R, Johnson I, Pandis R, Stoica M, Athanassoulis A Ailamaki . Scalability of write-ahead logging on multicore and multisocket hardware. The VLDB Journal, 2012, 21( 2): 239– 263
12 K, Kim T, Wang R, Johnson I Pandis. ERMIA: fast memory-optimized database system for heterogeneous workloads. In: Proceedings of 2016 International Conference on Management of Data. 2016, 1675– 1687
13 H, Jung H, Han S Kang . Scalable database logging for multicores. Proceedings of the VLDB Endowment, 2017, 11( 2): 135– 148
14 S, Tu W, Zheng E, Kohler B, Liskov S Madden. Speedy transactions in multicore in-memory databases. In: Proceedings of the 24th ACM Symposium on Operating Systems Principles. 2013, 18– 32
15 W, Zheng S, Tu E, Kohler B Liskov. Fast databases with fast durability and recovery through multicore parallelism. In: Proceedings of the 11th USENIX conference on Operating Systems Design and Implementation. 2014, 465– 477
16 H Kimura. FOEDUS: OLTP engine for a thousand cores and NVRAM. In: Proceedings of 2015 ACM SIGMOD International Conference on Management of Data. 2015, 691– 706
17 J, Kim H, Jang S, Son H, Han S, Kang H Jung. Border-collie: a wait-free, read-optimal algorithm for database logging on multicore hardware. In: Proceedings of 2019 International Conference on Management of Data. 2019, 723– 740
18 Y, Xia X, Yu A, Pavlo S Devadas . Taurus: lightweight parallel logging for in-memory database management systems. Proceedings of the VLDB Endowment, 2020, 14( 2): 189– 201
19 M, Haubenschild C, Sauer T, Neumann V Leis. Rethinking logging, checkpoints, and recovery for high-performance storage engines. In: Proceedings of 2020 ACM SIGMOD International Conference on Management of Data. 2020, 877– 892
20 H, Zhou J, Guo H, Hu W, Qian X, Zhou A Zhou . Plover: parallel logging for replication systems. Frontiers of Computer Science, 2020, 14( 4): 144606
21 C, Hong D, Zhou M, Yang C, Kuo L, Zhang L Zhou. KuaFu: closing the parallelism gap in database replication. In: Proceedings of the 29th International Conference on Data Engineering. 2013, 1186– 1195
22 M, Poke T Hoefler. DARE: high-performance state machine replication on RDMA networks. In: Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing. 2015, 107– 118
23 D, Qin A D, Brown A Goel . Scalable replay-based replication for fast databases. Proceedings of the VLDB Endowment, 2017, 10( 13): 2025– 2036
24 T, Wang R, Johnson I Pandis . Query fresh: log shipping on steroids. Proceedings of the VLDB Endowment, 2017, 11( 4): 406– 419
25 J, Guo J, Chu P, Cai M, Zhou A Zhou . Low-overhead Paxos replication. Data Science and Engineering, 2017, 2( 2): 169– 177
26 V, Arora T, Mittal D, Agrawal A, El-Abbadi X, Xue Y, Zhi J Zhu. Leader or majority: why have one when you can have both? Improving read scalability in raft-like consensus protocols. In: Proceedings of the 9th USENIX Conference on Hot Topics in Cloud Computing. 2017
27 W, Cao Z, Liu P, Wang S, Chen C, Zhu S, Zheng Y, Wang G Ma . PolarFS: an ultra-low latency and failure resilient distributed file system for shared storage cloud database. Proceedings of the VLDB Endowment, 2018, 11( 12): 1849– 1862
28 Z, Zhang H, Hu Y, Yu W, Qian K Shu. Dependency preserved raft for transactions. In: Proceedings of the 25th International Conference of Database Systems for Advanced Applications. 2020, 228– 245
29 Y, Pan Y, Li C, Zhang R, Zhang D Hong. The design and implementation of an efficient order management system based on CEDAR. Journal of East China Normal University (Natural Science), 2018, (3): 88− 96
30 P, Helland H, Sammer J, Lyon R, Carr P, Garrett A Reuter. Group commit timers and high volume transaction systems. In: Proceedings of the 2nd International Workshop on High Performance Transaction Systems. 1987, 301– 329
31 P M, Spiro A M, Joshi T K Rengarajan. Designing an optimized transaction commit protocol. Digital Technical Journal, 1991, 3(1): 1– 32
32 H Howard. ARC: analysis of raft consensus. Technical Report UCAM-CL-TR-857. Cambridge: University of Cambridge, 2014
33 D, Huang Q, Liu Q, Cui Z, Fang X, Ma F, Xu L, Shen L, Tang Y, Zhou M, Huang W, Wei C, Liu J, Zhang J, Li X, Wu L, Song R, Sun S, Yu L, Zhao N, Cameron L, Pei X Tang . TiDB: a raft-based HTAP database. Proceedings of the VLDB Endowment, 2020, 13( 12): 3072– 3084
34 T, Wang R Johnson . Scalable logging through emerging non-volatile memory. Proceedings of the VLDB Endowment, 2014, 7( 10): 865– 876
35 K, Li F Han. Memory transaction engine of OceanBase. Journal of East China Normal University (Natural Science), 2014, 5: 149− 163
36 B F, Cooper A, Silberstein E, Tam R, Ramakrishnan R Sears. Benchmarking cloud serving systems with YCSB. In: Proceedings of the 1st ACM Symposium on Cloud Computing. 2010, 143– 154
37 T, Zhu Z, Zhao F, Li W, Qian A, Zhou D, Xie R, Stutsman H, Li H Hu. SolarDB: toward a shared-everything database on distributed log-structured storage. ACM Transaction on Storage, 2019, 15( 2): 11
38 J Gray. Notes on data base operating systems. In: Proceedings of Operating Systems, an Advanced Course. 1978, 393– 481
39 S, Harizopoulos D J, Abadi S, Madden M Stonebraker. OLTP through the looking glass, and what we found there. In: Proceedings of 2008 ACM SIGMOD International Conference on Management of Data. 2008, 981– 992
40 J, Huang K, Schwan M K Qureshi . NVRAM-aware logging in transaction systems. Proceedings of the VLDB Endowment, 2014, 8( 4): 389– 400
41 J, Arulraj M, Perron A Pavlo . Write-behind logging. Proceedings of the VLDB Endowment, 2016, 10( 4): 337– 348
42 R Hagmann. Reimplementing the cedar file system using logging and group commit. In: Proceedings of the Eleventh ACM Symposium on Operating Systems Principles. 1987, 155– 162
43 L Lamport. Paxos made simple, fast, and byzantine. In: Proceedings of the 6th International Conference on Principles of Distributed Systems. 2002, 7– 9
44 L Lamport . Fast Paxos. Distributed Computing, 2006, 19( 2): 79– 103
45 M Burrows. The Chubby lock service for loosely-coupled distributed systems. In: Proceedings of the 7th Symposium on Operating Systems Design and Implementation. 2006, 335– 350
46 J, Baker C, Bond J C, Corbett J J, Furman A, Khorlin J, Larson J M, Leon Y, Li A, Lloyd V Yushprakh. Megastore: providing scalable, highly available storage for interactive services. In: Proceedings of the 5th Biennial Conference on Innovative Data Systems Research. 2011, 223– 234
47 J, Shute R, Vingralek B, Samwel B, Handy C, Whipkey E, Rollins M, Oancea K, Littlefield D, Menestrina S, Ellner J, Cieslewicz I, Rae T, Stancescu H Apte . F1: a distributed SQL database that scales. Proceedings of the VLDB Endowment, 2013, 6( 11): 1068– 1079
48 J, Zheng Q, Lin J, Xu C, Wei C, Zeng P, Yang Y Zhang . PaxosStore: high-availability storage made practical in WeChat. Proceedings of the VLDB Endowment, 2017, 10( 12): 1730– 1741
49 J, Rao E J, Shekita S Tata . Using Paxos to build a scalable, consistent, and highly available datastore. Proceedings of the VLDB Endowment, 2011, 4( 4): 243– 254
50 J, Ousterhout P, Agrawal D, Erickson C, Kozyrakis J, Leverich D, Mazières S, Mitra A, Narayanan G, Parulkar M, Rosenblum S M, Rumble E, Stratmann R Stutsman . The case for RAMClouds: scalable high-performance storage entirely in DRAM. ACM SIGOPS Operating Systems Review, 2010, 43( 4): 92– 105
[1] FCS-21357-OF-HZ_suppl_1 Download
[1] Ismael RODRÍGUEZ, David RUBIO, Fernando RUBIO. Complexity of adaptive testing in scenarios defined extensionally[J]. Front. Comput. Sci., 2023, 17(3): 173206-.
[2] Bo WANG, Zitong KANG, Pengwei DONG, Fan WANG, Peng MA, Jiajing BAI, Pengwei LIANG, Chongyi LI. Underwater image enhancement by maximum-likelihood based adaptive color correction and robust scattering removal[J]. Front. Comput. Sci., 2023, 17(2): 172702-.
[3] Zhe XUE, Junping DU, Xin XU, Xiangbin LIU, Junfu WANG, Feifei KOU. Few-shot node classification via local adaptive discriminant structure learning[J]. Front. Comput. Sci., 2023, 17(2): 172316-.
[4] Haoyu ZHAO, Weidong MIN, Jianqiang XU, Qi WANG, Yi ZOU, Qiyan FU. Scene-adaptive crowd counting method based on meta learning with dual-input network DMNet[J]. Front. Comput. Sci., 2023, 17(1): 171304-.
[5] Ke-Jia CHEN, Mingyu WU, Yibo ZHANG, Zhiwei CHEN. SR-AFU: super-resolution network using adaptive frequency component upsampling and multi-resolution features[J]. Front. Comput. Sci., 2023, 17(1): 171307-.
[6] Donghui WANG, Peng CAI, Weining QIAN, Aoying ZHOU. Efficient and stable quorum-based log replication and replay for modern cluster-databases[J]. Front. Comput. Sci., 2022, 16(5): 165612-.
[7] Jinwei GUO, Peng CAI, Weining QIAN, Aoying ZHOU. Accurate and efficient follower log repair for Raft-replicated database systems[J]. Front. Comput. Sci., 2021, 15(2): 152605-.
[8] Jintao GAO, Wenjie LIU, Zhanhuai LI. An adaptive strategy for statistics collecting in distributed database[J]. Front. Comput. Sci., 2020, 14(5): 145610-.
[9] Hui ZHONG, Zaiyi CHEN, Chuan QIN, Zai HUANG, Vincent W. ZHENG, Tong XU, Enhong CHEN. Adam revisited: a weighted past gradients perspective[J]. Front. Comput. Sci., 2020, 14(5): 145309-.
[10] Waixi LIU, Yu WANG, Jie ZHANG, Hongjian LIAO, Zhongwei LIANG, Xiaochu LIU. AAMcon: an adaptively distributed SDN controller in data center networks[J]. Front. Comput. Sci., 2020, 14(1): 146-161.
[11] Qianjun ZHANG, Lei ZHANG. Convolutional adaptive denoising autoencoders for hierarchical feature extraction[J]. Front. Comput. Sci., 2018, 12(6): 1140-1148.
[12] Abdelkrim CHEBIEB, Yamine AIT AMEUR. A formal model for plastic human computer interfaces[J]. Front. Comput. Sci., 2018, 12(2): 351-375.
[13] Weihuang HUANG, Jeffrey Xu YU, Zechao SHANG. Handling query skew in large indexes: a view based approach[J]. Front. Comput. Sci., 2018, 12(1): 146-162.
[14] Lamia SADEG-BELKACEM,Zineb HABBAS,Wassila AGGOUNE-MTALAA. Adaptive genetic algorithms guided by decomposition for PCSPs: application to frequency assignment problems[J]. Front. Comput. Sci., 2016, 10(6): 1012-1025.
[15] Bin WANG,Xiaochun YANG,Guoren WANG,Ge YU,Wanyu ZANG,Meng YU. Energy efficient approximate self-adaptive data collection in wireless sensor networks[J]. Front. Comput. Sci., 2016, 10(5): 936-950.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed