Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

邮发代号 80-970

2019 Impact Factor: 1.275

Frontiers of Computer Science  2022, Vol. 16 Issue (2): 162603   https://doi.org/10.1007/s11704-020-0248-x
  本期目录
A comprehensive study on fault tolerance in stream processing systems
Xiaotong WANG1, Chunxi ZHANG1, Junhua FANG2, Rong ZHANG1(), Weining QIAN1, Aoying ZHOU1
1. School of Data Science and Engineering, East China Normal University, Shanghai 200062, China
2. Advanced Data Analytics Laboratory of Soochow University, Suzhou 215006, China
 全文: PDF(9155 KB)   HTML
Abstract

Stream processing has emerged as a useful technology for applications which require continuous and low latency computation on infinite streaming data. Since stream processing systems (SPSs) usually require distributed deployment on clusters of servers in face of large-scale of data, it is especially common to meet with failures of processing nodes or communication networks, but should be handled seriously considering service quality. A failed system may produce wrong results or become unavailable, resulting in a decline in user experience or even significant financial loss. Hence, a large amount of fault tolerance approaches have been proposed for SPSs. These approaches often have their own priorities on specific performance concerns, e.g., runtime overhead and recovery efficiency. Nevertheless, there is a lack of a systematic overview and classification of the state-of-the-art fault tolerance approaches in SPSs, which will become an obstacle for the development of SPSs. Therefore, we investigate the existing achievements and develop a taxonomy of the fault tolerance in SPSs. Furthermore, we propose an evaluation framework tailored for fault tolerance, demonstrate the experimental results on two representative open-sourced SPSs and exposit the possible disadvantages in current designs. Finally, we specify future research directions in this domain.

Key wordsfault tolerance    performance evaluation    stream processing
收稿日期: 2020-06-04      出版日期: 2021-09-18
Corresponding Author(s): Rong ZHANG   
 引用本文:   
. [J]. Frontiers of Computer Science, 2022, 16(2): 162603.
Xiaotong WANG, Chunxi ZHANG, Junhua FANG, Rong ZHANG, Weining QIAN, Aoying ZHOU. A comprehensive study on fault tolerance in stream processing systems. Front. Comput. Sci., 2022, 16(2): 162603.
 链接本文:  
https://academic.hep.com.cn/fcs/CN/10.1007/s11704-020-0248-x
https://academic.hep.com.cn/fcs/CN/Y2022/V16/I2/162603
Survey General Classification Study Sample Analysis
Active Passive Hybrid Syst. Lit. Qual. Quant.
Elnozahy et al. [ 36] ? ? ? ?
Balazinska et al. [ 37] ? ? ? ?
Andre et al. [ 38] ? ? ?
Heinze et al. [ 5] ? ?
? ?
Muhammad et al. [ 39] ? ? ? ? ?
Our study ? ? ? ? ? ? ?
Tab.1  
Fig.1  
Benchmarks System Datasets Workloads Metrics
LinearRoad [ 40] Aurora SREAM Position reports Historical query requests Toll processing Toll notifications Toll assessments Accident detection Throughput
StreamBench [ 41] Storm Flink Spark Streaming AOL search data CAIDA internet traces four atomic operators Word count Distinct count statistics Throughput Latency Penalty factor
Yahoo! StreamingBench [ 42] Storm Flink Spark Streaming Advertisements User clicks Advertisement click analytics Throughput Latency
YCSB Extension [ 43] Storm Flink Spark Streaming Advertisements User clicks Advertisement click analytics Throughput Latency
StreamBench [ 44] Storm Flink Spark Streaming Synthetic data Word count Advertisement click analytics KMeans Throughput Latency
RIoTBench [ 45] Storm Urban sensing data Energy consumption MHealth records ETL Statistics Model training Predictive Analytics Peak rate Latency Resource usage Jitter
Benchmark Suite [ 46] Storm Spark Streaming Project Gutenberg 1998 World Cup Taxi traces Word count Click analytics Traffic monitoring Throughput Latency Scalability Tuple loss Resource usage
Benchmarking DSDPS [ 47] Storm Flink Spark Streaming Purchasing records Advertisements Windowed aggregation Windowed join Throughput Sustainable throughput Latency
Our work Storm Flink English novel set Synthetic data Word count π calculation Dummy topology Throughput Normal latency Resource usage Recovery latency
Tab.2  
Runtime Overhead Normal latency
Hardware resource CPU
Memory
Network
Recovery Efficiency Recovery latency
Recovery accuracy Precise
Duplicate
Lost
Erroneous
Disordered
Tab.3  
Fig.2  
Fig.3  
Fig.4  
Fig.5  
Fig.6  
Deployments Steps
Operator Initiation Connection Establishment Checkpoint Retrieval
Hot
Cold ?
Deployed ? ?
Vacant ? ? ?
Tab.4  
Fig.7  
Fig.8  
Fig.9  
Object Knob Abbr Unit Value
Application State size State MB 1, 10, 15, 30
Computation intensity CPI round 0 (low), 2000(medium), 5000 (high)
Topology depth TD # of operators 2, 20
Checkpoint Checkpoint interval CI second NC (disabled), 1, 30, 45, 60
Checkpoint synchronicity CS ? sync, async
Barrier alignment BA ? true, false
Tab.5  
Workload Suite Use Case Description Object
1 WC Effects of CI on NL and CPU for applications with different State Runtime Overhead
2 PI Effects of CS on NL and CPU for applications with different CPI
3 WC Effects of BA on NL and CPU when processing data with different Skew
4 WC NL under different workload fluctuations
5 DT Effects of TD on RL Recovery Efficiency
6 WC Effects of CI on RL
Tab.6  
Knob Abbr Unit Value
Input Rate IR tuples/s 500, 5000, [0, 7500]
Skew Distribution Skew ? 0, 1
Tab.7  
Fig.10  
Fig.11  
Fig.12  
Fig.13  
Fig.14  
Fig.15  
Fig.16  
Fig.17  
Fig.18  
1 J F Naughton , D J DeWitt , D Maier , A Aboulnaga , J J Chen , L Galanis , J Kang , R Krishnamurthy , Q Luo , N Prakash , R Ramamurthy , J Shanmugasundaram , F Tian , K Tufte , S Viglas , Y Wang , C Zhang , B Jackson , R Chen . The niagara internet query system. IEEE Data Engineering Bulletin, 2001, 24( 2): 27– 33
2 D J Abadi , D Carney , U Çetintemel , M Cherniack , C Convey , S Lee , M Stonebraker , N Tatbul , Zdonik S B . Aurora: a new model and architecture for data stream management. The International Journal on Very Large Data Bases, 2003, 12( 2): 120– 139
3 Motwani R, Arasu J, Widomand A, Babcock B, Babu S, Datar M, Olston G S, Mankuand C, Varma J, Rosensteinand R. Query processing, approximation, and resource management in a data stream management system. In: Proceedings of International Conference on Innovative Data Systems Research. 2003
4 Chandrasekaran S, Cooper O, Deshpande A, J Franklin M, M Hellerstein J, Hong W, Krishnamurthy S, Madden S, Raman V, Reiss F, A Shah M. Telegraphcq: continuous dataflow processing for an uncertain world. In: Proceedings of Conference on Innovative Data Systems Research. 2003
5 Heinze T, Aniello L, Querzoni L, Jerzak Z. Cloud-based data stream processing. In: Proceedings of ACM International Conference on Distributed Event-Based Systems. 2014. 238–245
6 Cherniack M, Balakrishnan H, Balazinska M, Carney D, Çetintemel U, Xing Y, Zdonik S B. Scalable distributed stream processing. In: Proceedings of International Conference on Innovative Data Systems Research. 2003
7 Abadi D J, Ahmad Y, Balazinska M, Çetintemel U, Cherniack M, Hwang J H, Lindner W, Maskey A, Rasin A, Ryvkina E, Tatbul N, Xing Y, B Zdonik S. The design of the borealis stream processing engine. In: Proceedings of Conference on Innovative Data Systems Research. 2005. 277–289
8 Dean J Ghemawat S. Mapreduce: simplified data processing on large clusters. In: Proceedings of USENIX Symposium on Operating System Design and Implementation. 2004, 137–150
9 Toshniwal A, Taneja S, Shukla A, Ramasamy K, M Patel J, Kulkarni S, Jackson J, Gade K, Fu M S, Donham J, Bhagat N, Mittal S, V Ryaboy D. Storm@twitter. In: Proceedings of ACM International Conference on Management of Data. 2014, 147–156
10 Zaharia M, Das T, Li H Y, Hunter T, Shenker S, Stoica I. Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of ACM Symposium on Operating Systems Principles. 2013, 423–438
11 P Carbone , A Katsifodimos , S Ewen , V Markl , S Haridi , K Tzoumas . Apache flink TM: stream and batch processing in a single engine. IEEE Data Engineering Bulletin, 2015, 38( 4): 28– 38
12 Noghabi S A, Paramasivam K, Pan Y, Ramesh N, Bringhurst J, Gupta I, H Campbell R. Stateful scalable stream processing at linkedin. Proceedings of the VLDB Endowment, 2017, 10(12): 1634–1645
13 Gill P, Jain N, Nagappan N. Understanding network failures in data centers: measurement, analysis, and implications. In: Proceedings of ACM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications. 2011, 350–361
14 Dean J. Handling large datasets at google: current systems and future directions. In Data-Intensive Computing Symposium, 2008
15 V Vishwanath K Nagappan N. Characterizing cloud computing hardware reliability. In: Proceedings of Symposium on Cloud Computing. 2010, 193–204
16 Shankland S. Google spotlights data center inner workings. CNET News, 2008
17 Raphael J R. In pictures: the worst cloud outages of 2013. PCWorld, 2013
18 Shah M A, Hellerstein J M, Brewer E A. Highly-available, faulttolerant, parallel dataflows. In: Proceedings of ACM International Conference on Management of Data. 2004, 827–838
19 Hwang J H, Balazinska M, Rasin A, Çetintemel U, Stonebraker M, Zdonik S B. High-availability algorithms for distributed stream processing. In: Proceedings of IEEE International Conference on Data Engineering. 2005, 779–790
20 Balazinska M, Balakrishnan H, Madden S, Stonebraker M. Faulttolerance in the borealis distributed stream processing system. In: Proceedings of ACM International Conference on Management of Data. 2005, 13–24
21 Hwang J H, Çetintemel U, B Zdonik S. Fast and highly-available stream processing over wide area networks. In: Proceedings of IEEE International Conference on Data Engineering. 2008, 804–813
22 Hwang J H, Xing Y, Çetintemel U, B Zdonik S. A cooperative, self-configuring high-availability solution for stream processing. In: Proceedings of IEEE International Conference on Data Engineering. 2007, 176–185
23 Y C Kwon , M Balazinska , A G Greenberg . Fault-tolerant stream processing using a distributed, replicated file system. Proceedings of the VLDB Endowment, 2008, 1( 1): 574– 585
24 Gu Y, Zhang Z, Ye F, Yang H, Kim M, Lei H, Liu Z. An empirical study of high availability in stream processing systems. In: Proceedings of ACM/IFIP/USENIX International Middleware Conference. 2009
25 Sebepou Z Magoutis K. Cec: continuous eventual checkpointing for data stream processing operators. In: Proceedings of International Conference on Dependable Systems and Networks. 2011, 145–156
26 Fernandez R C, Migliavacca M, Kalyvianaki E, R Pietzuch P. Integrating scale out and fault tolerance in stream processing using operator state management. In: Proceedings of ACM International Conference on Management of Data. 2013, 725–736
27 Koldehofe B, Mayer R, Ramachandran U, Rothermel K, Völz M. Rollback-recovery without checkpoints in distributed event processing systems. In: Proceedings of ACM International Conference on Distributed Event-Based Systems. 2013, 27–38
28 P Huang Q Lee . Toward high-performance distributed stream processing via approximate fault tolerance. Proceedings of the VLDB Endowment, 2016, 10( 3): 73– 84
29 Zhang Z, Gu Y, Ye F, Yang H, Kim M, Lei H, Liu Z. A hybrid approach to high availability in stream processing systems. In: Proceedings of IEEE International Conference on Distributed Computing Systems. 2010, 138–148
30 Heinze T, Zia M, Krahn R, Jerzak Z, Fetzer C. An adaptive replication scheme for elastic data stream processing systems. In: Proceedings of ACM International Conference on Distributed Event-Based Systems. 2015, 150–161
31 Su L Zhou Y L. Tolerating correlated failures in massively parallel stream processing engines. In: Proceedings of IEEE International Conference on Data Engineering. 2016, 517–528
32 Y L Su L Zhou . Passive and partially active fault tolerance for massively parallel stream processing engines. IEEE Transactions on Knowledge and Data Engineering, 2019, 31( 1): 32– 45
33 Martin A, Smaneoto T, Dietze T, Brito A, Fetzer C. User-constraint and self-adaptive fault tolerance for event stream processing systems. In: Proceedings og IEEE/IFIP International Conference on Dependable Systems and Networks. 2015
34 F B Schneider . Implementing fault-tolerant services using the state machine approach: a tutorial. ACM Computing Surveys, 1990, 22( 4): 299– 319
35 Chandra T D, Griesemer R, Redstone J. Paxos made live: an engineering perspective. In: Proceedings of ACM Symposium on Principles of Distributed Computing. 2007, 398–407
36 E N Elnozahy , L Alvisi , Y M Wang , D B Johnson . A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys, 2002, 34( 3): 375– 408
37 Balazinska M, Hwang J H, Shah M A. Fault-tolerance and high availability in data stream management systems. In Encyclopedia of Database Systems. 2009, 1109–1115
38 A L S Gradvohl , H Senger , L Arantes , P Sens . Comparing distributed online stream processing systems considering fault tolerance issues. Journal of Emerging Technologies in Web Intelligence, 2014, 6( 2): 174– 179
39 Nasir M A U. Fault tolerance for stream processing engines. arXiv preprint, abs/1605.00928, 2016
40 Arasu A, Cherniack M, Galvez E F, Maier D, Maskey A, Ryvkina E, Stonebraker M, Tibbetts R. Linear road: a stream data management benchmark. In: Proceedings of ACM International Conference on Very Large Data Bases. 2004, 480–491
41 Lu R R, Wu G, Xie B, Hu J T. Streambench: towards benchmarking modern distributed stream computing frameworks. In: Proceedings of IEEE/ACM International Conference on Utility and Cloud Computing. 2014, 69–78
42 Chintapalli S, Dagit D, Evans B, Farivar R, Graves T, Holderbaugh M, Liu Z, Nusbaum K, Patil K, Peng B Y, Poulosky P. Benchmarking streaming computation engines: storm, flink and spark streaming. In: Proceedings of IEEE International Parallel and Distributed Processing Symposium Workshops. 2016, 1789–1792
43 Grier J. Extending the yahoo! streaming benchmark. Ververica, 2016
44 Wang Y J. Stream processing systems benchmark: streambench. Master’s thesis, Aalto University, 2016
45 Shukla A, Chaturvedi S, Simmhan Y. Riotbench: a real-time iot benchmark for distributed stream processing platforms. arXiv preprint, abs/1701.08530, 2017
46 Bordin M V. A benchmark suite for distributed stream processing systems. PhD thesis, Universidade Federal do Rio Grande Do Su, 2017
47 Karimov J, Rabl T, Katsifodimos A, Samarev R, Heiskanen H, Markl V. Benchmarking distributed stream data processing systems. In: Proceedings of IEEE International Conference on Data Engineering. 2018, 1507–1518
48 Venkataraman S, Panda A, Ousterhout K, Armbrust M, Ghodsi A, Franklin M J, Recht B, Stoica I. Drizzle: fast and adaptable stream processing at scale. In: Proceedings of ACM Symposium on Operating Systems Principles. 2017, 374–389
49 T Akidau , R Bradshaw , C Chambers , S Chernyak , R FernándezMoctezuma , R Lax , S McVeety , D Mills , F Perry , E Schmidt , S Whittle . The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-oforder data processing. Proceedings of the VLDB Endowment, 2015, 8( 12): 1792– 1803
50 P A Tucker , D Maier , T Sheard , L Fegaras . Exploiting punctuation semantics in continuous data streams. IEEE Transactions on Knowledge and Data Engineering, 2003, 15( 3): 555– 568
51 L Chandy K M Lamport . Distributed snapshots: determining global states of distributed systems. ACM Transactions on Computer Systems, 1985, 3( 1): 63– 75
52 Y M Wang , P Y Chung , I J Lin , W K Fuchs . Checkpoint space reclamation for uncoordinated checkpointing in message-passing systems. IEEE Transactions on Parallel and Distributed Systems, 1995, 6( 5): 546– 554
53 Neumeyer L, Robbins B, Nair A, Kesari A. S4: distributed stream computing platform. In: Proceedings of IEEE International Conference on Data Mining Workshops. 2010, 170–177
54 B Randell . System structure for software fault tolerance. IEEE Transactions on Software Engineering, 1975, 1( 2): 221– 232
55 Murray D G, McSherry F, Isaacs R, Isard M, Barham P, Abadi M. Naiad: a timely dataflow system. In: Proceedings of ACM Symposium on Operating Systems Principles. 2013, 439–455
56 T Akidau , A Balikov , K Bekiroglu , S Chernyak , J Haberman , R Lax , S McVeety , D Mills , P Nordstrom , S Whittle . Millwheel: faulttolerant stream processing at internet scale. Proceedings of the VLDB Endowment, 2013, 6( 11): 1033– 1044
57 Qian Z P, He Y, Su C Z, Wu Z J, Zhu H Y, Zhang T Z, Zhou L D, Yu Y, Zhang Z. Timestream: reliable stream computation in the cloud. In: Proceedings of Eurosys Conference. 2013, 1–14
58 Bhargava B K S Lian. Independent checkpointing and concurrent rollback for recovery in distributed systems - an optimistic approach. In: Proceedings of Symposium on Reliable Distributed Systems. 1988, 3–12
59 Feldman S I Brown C B. Igor: a system for program debugging via reversible execution. In: Proceedings of ACM SIGPLAN and SIGOPS Workshop on Parallel and Distributed Debugging. 1988, 112–123
60 Ghemawat S, Gobioff H, Leung S. The google file system. In: Proceedings of ACM Symposium on Operating Systems Principles. 2003, 29–43
61 Shvachko K, Kuang H, Radia S, Chansler R. The hadoop distributed file system. In: Proceedings of IEEE Symposium on Mass Storage Systems and Technologies. 2010, 1–10
62 Pohl C, Götze P, Sattler K. A cost model for data stream processing on modern hardware. In: Proceedings of International Workshop on Accelerating Analytics and Data Management Systems Using Modern Processor and Storage Architectures. 2017
63 S Zeuch , S Breß , T Rabl , B D Monte , J Karimov , C Lutz , M Renz , J Traub , V Markl . Analyzing efficient stream processing on modern hardware. Proceedings of the VLDB Endowment, 2019, 12( 5): 516– 530
[1] Highlights Download
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed