|
|
Big data challenge: a data management perspective |
Jinchuan CHEN, Yueguo CHEN, Xiaoyong DU( ), Cuiping LI, Jiaheng LU( ), Suyun ZHAO, Xuan ZHOU |
Key Laboratory of Data Engineering and Knowledge Engineering, School of Information, Renmin University of China, Beijing 100872, China |
|
|
Abstract There is a trend that, virtually everyone, ranging from big Web companies to traditional enterprisers to physical science researchers to social scientists, is either already experiencing or anticipating unprecedented growth in the amount of data available in their world, as well as new opportunities and great untapped value. This paper reviews big data challenges from a data management respective. In particular, we discuss big data diversity, big data reduction, big data integration and cleaning, big data indexing and query, and finally big data analysis and mining. Our survey gives a brief overview about big-data-oriented research and problems.
|
Keywords
big data
performance
databases
|
Corresponding Author(s):
DU Xiaoyong,Email:duyong@ruc.edu.cn; LU Jiaheng,Email:jiahenglu@ruc.edu.cn
|
Issue Date: 01 April 2013
|
|
1 |
Labrinidis A, Jagadish H. Challenges and opportunities with big data. Proceedings of the VLDB Endowment , 2012, 5(12): 2032-2033
|
2 |
Chang C, Kayed M, Girgis M R, Shaalan K F, others . A survey of web information extraction systems. IEEE Transactions on Knowledge and Data Engineering , 2006, 18(10): 1411-1428 doi: 10.1109/TKDE.2006.152
|
3 |
Lu J, Lu Y, Cong G. Reverse spatial and textual K nearest neighbor search. In: Proceedings of the 2011 International Conference on Management of Data . 2011, 349-360
|
4 |
Simmhan Y L, Plale B, Gannon D. A survey of data provenance in e-science. ACM Sigmod Record , 2005, 34(3): 31-36 doi: 10.1145/1084805.1084812
|
5 |
He B, Patel M, Zhang Z, Chang K C C. Accessing the deep web. Communications of the ACM , 2007, 50(5): 94-101 doi: 10.1145/1230819.1241670
|
6 |
Lu J, Senellart P, Lin C, Du X, Wang S, Chen X. Optimal top-k generation of attribute combinations based on ranked lists. In: Proceedings of the 2012 International Conference on Management of Data . 2012, 409-420
|
7 |
Aggarwal C C, Wang H. Managing and mining graph data. Springer Publishing Company, Incorporated , 2010 doi: 10.1007/978-1-4419-6045-0
|
8 |
Oceanbase . http://oceanbase.taobao.org
|
9 |
Sikka V, F?rber F, Lehner W, Cha S K, Peh T, Bornh?vd C. Efficient transaction processing in SAP HANA database: the end of a column store myth. In: Proceedings of the 2012 International Conference on Management of Data . 2012, 731-742
|
10 |
Neo4j . http://neo4j.org
|
11 |
Malewicz G, Austern M H, Bik A J, Dehnert J C, Horn I, Leiser N, Czajkowski G. Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 International Conference on Management of data . 2010, 135-146
|
12 |
Doan A, Naughton J F, Baid A, Chai X, Chen F, Chen T, Chu E, DeRose P, Gao B J, Gokhale C, Huang J, Shen W, Vuong B Q. The case for a structured approach to managing unstructured data. In: Proceedings of the 4th Biennial Conference on Innovative Data Systems Research . 2009
|
13 |
Jeffery S R, Franklin M J, Halevy A Y. Pay-as-you-go user feedback for dataspace systems. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data . 2008, 847-860 doi: 10.1145/1376616.1376701
|
14 |
Chai X, Vuong B Q, Doan A, Naughton J F. Efficiently incorporating user feedback into information extraction and integration programs. In: Proceedings of the 35th SIGMOD International Conference on Management of Data . 2009, 87-100 doi: 10.1145/1559845.1559857
|
15 |
Talukdar P P, Ives Z G, Pereira F. Automatically incorporating new sources in keyword search-based data integration. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data . 2010, 387-398 doi: 10.1145/1807167.1807211
|
16 |
Yakout M, Elmagarmid A K, Neville J, Ouzzani M, Ilyas I F. Guided data repair. Proceedings of the VLDB Endowment , 2011, 4(5): 279-289
|
17 |
Wang J, Kraska T, Franklin M J, Feng J. CrowdER: crowdsourcing entity resolution. Proceedings of the VLDB Endowment , 2012, 5(11): 1483-1494
|
18 |
Halevy A, Rajaraman A, Ordille J. Data integration: the teenage years. In: Proceedings of the 32nd International Conference on Very Large Data Bases . 2006, 9-16
|
19 |
Chen H, Ku W S, Wang H, Sun M T. Leveraging spatio-temporal redundancy for RFID data cleansing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data . 2010, 51-62 doi: 10.1145/1807167.1807176
|
20 |
Mahmoud H A, Aboulnaga A. Schema clustering and retrieval for multi-domain pay-as-you-go data integration systems. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data . 2010, 411-422 doi: 10.1145/1807167.1807213
|
21 |
Morton K, Bunker R, Mackinlay J, Morton R, Stolte C. Dynamic workload driven data integration in tableau. In: Proceedings of the 2012 International Conference on Management of Data . 2012, 807-816
|
22 |
Agrawal P, Sarma A D, Ullman J, Widom J. Foundations of uncertaindata integration. Proceedings of the VLDB Endowment , 2010, 3(1-2): 1080-1090
|
23 |
Das Sarma A, Dong X, Halevy A. Bootstrapping pay-as-you-go data integration systems. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data . 2008, 861-874 doi: 10.1145/1376616.1376702
|
24 |
Suchanek F M, Abiteboul S, Senellart P. PARIS: probabilistic alignment of relations, instances, and schema. Proceedings of the VLDB Endowment , 2011, 5(3): 157-168
|
25 |
Huang J, Chen T, Doan A, Naughton J F. On the provenance of nonanswers to queries over extracted data. Proceedings of the VLDB Endowment , 2008, 1(1): 736-747
|
26 |
Ioannou E, Nejdl W, Niederée C, Velegrakis Y. On-the-fly entity-aware query processing in the presence of linkage. Proceedings of the VLDB Endowment , 2010, 3(1-2): 429-438
|
27 |
Chen Z, Kalashnikov D V, Mehrotra S. Exploiting context analysis for combining multiple entity resolution systems. In: Proceedings of the 35th SIGMOD International Conference on Management of Data . 2009, 207-218 doi: 10.1145/1559845.1559869
|
28 |
Whang S E, Menestrina D, Koutrika G, Theobald M, Garcia-Molina H. Entity resolution with iterative blocking. In: Proceedings of the 35th SIGMOD International Conference on Management of Data . 2009, 219-232 doi: 10.1145/1559845.1559870
|
29 |
Fan W, Jia X, Li J, Ma S. Reasoning about record matching rules. Proceedings of the VLDB Endowment , 2009, 2(1): 407-418
|
30 |
Rimal B P, Choi E, Lumb I. A taxonomy and survey of cloud computing systems. In: Proceedings of the 5th International Joint Conference on INC, IMS and IDC . 2009, 44-51
|
31 |
Aguilera M K, Golab W, Shah M A. A practical scalable distributed b-tree. Proceedings of the VLDB Endowment , 2008, 1(1): 598-609
|
32 |
Jagadish H V, Ooi B C, Vu Q H. BATON: a balanced tree structure for peer-to-peer networks. In: Proceedings of the 31st International Conference on Very Large Data Bases . 2005, 661-672
|
33 |
Wu S, Wu K L. An indexing framework for efficient retrieval on the cloud. In: Bulletin of the IEEE Computer Society Technical Committee on Data Engineering . 2009, 1-8
|
34 |
Das S, Sismanis Y, Beyer K S, Gemulla R, Haas P J, McPherson J. Ricardo: integrating R and Hadoop. In: Proceedings of the 2010 International Conference on Management of Data . 2010, 987-998
|
35 |
Wegener D, Mock M, Adranale D, Wrobel S. Toolkit-based highperformance data mining of large data on MapReduce clusters. In: Proceedings of the 2009 IEEE International Conference on Data Mining Workshops . 2009, 296-301 doi: 10.1109/ICDMW.2009.34
|
36 |
Chu C T, Kim S K, Lin Y A, Yu Y Y, Bradski G, Ng A Y, Olukotun K. Map-reduce for machine learning on multicore. In: Proceedings of the 2006 Conference Advances in Neural Information Processing Systems . 2007, 281-288
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|