Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front Comput Sci    0, Vol. Issue () : 157-164    https://doi.org/10.1007/s11704-013-3903-7
RESEARCH ARTICLE
Big data challenge: a data management perspective
Jinchuan CHEN, Yueguo CHEN, Xiaoyong DU(), Cuiping LI, Jiaheng LU(), Suyun ZHAO, Xuan ZHOU
Key Laboratory of Data Engineering and Knowledge Engineering, School of Information, Renmin University of China, Beijing 100872, China
 Download: PDF(347 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

There is a trend that, virtually everyone, ranging from big Web companies to traditional enterprisers to physical science researchers to social scientists, is either already experiencing or anticipating unprecedented growth in the amount of data available in their world, as well as new opportunities and great untapped value. This paper reviews big data challenges from a data management respective. In particular, we discuss big data diversity, big data reduction, big data integration and cleaning, big data indexing and query, and finally big data analysis and mining. Our survey gives a brief overview about big-data-oriented research and problems.

Keywords big data      performance      databases     
Corresponding Author(s): DU Xiaoyong,Email:duyong@ruc.edu.cn; LU Jiaheng,Email:jiahenglu@ruc.edu.cn   
Issue Date: 01 April 2013
 Cite this article:   
Xiaoyong DU,Cuiping LI,Jiaheng LU, et al. Big data challenge: a data management perspective[J]. Front Comput Sci, 0, (): 157-164.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-013-3903-7
https://academic.hep.com.cn/fcs/EN/Y0/V/I/157
1 Labrinidis A, Jagadish H. Challenges and opportunities with big data. Proceedings of the VLDB Endowment , 2012, 5(12): 2032-2033
2 Chang C, Kayed M, Girgis M R, Shaalan K F, others . A survey of web information extraction systems. IEEE Transactions on Knowledge and Data Engineering , 2006, 18(10): 1411-1428
doi: 10.1109/TKDE.2006.152
3 Lu J, Lu Y, Cong G. Reverse spatial and textual K nearest neighbor search. In: Proceedings of the 2011 International Conference on Management of Data . 2011, 349-360
4 Simmhan Y L, Plale B, Gannon D. A survey of data provenance in e-science. ACM Sigmod Record , 2005, 34(3): 31-36
doi: 10.1145/1084805.1084812
5 He B, Patel M, Zhang Z, Chang K C C. Accessing the deep web. Communications of the ACM , 2007, 50(5): 94-101
doi: 10.1145/1230819.1241670
6 Lu J, Senellart P, Lin C, Du X, Wang S, Chen X. Optimal top-k generation of attribute combinations based on ranked lists. In: Proceedings of the 2012 International Conference on Management of Data . 2012, 409-420
7 Aggarwal C C, Wang H. Managing and mining graph data. Springer Publishing Company, Incorporated , 2010
doi: 10.1007/978-1-4419-6045-0
8 Oceanbase . http://oceanbase.taobao.org
9 Sikka V, F?rber F, Lehner W, Cha S K, Peh T, Bornh?vd C. Efficient transaction processing in SAP HANA database: the end of a column store myth. In: Proceedings of the 2012 International Conference on Management of Data . 2012, 731-742
10 Neo4j . http://neo4j.org
11 Malewicz G, Austern M H, Bik A J, Dehnert J C, Horn I, Leiser N, Czajkowski G. Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 International Conference on Management of data . 2010, 135-146
12 Doan A, Naughton J F, Baid A, Chai X, Chen F, Chen T, Chu E, DeRose P, Gao B J, Gokhale C, Huang J, Shen W, Vuong B Q. The case for a structured approach to managing unstructured data. In: Proceedings of the 4th Biennial Conference on Innovative Data Systems Research . 2009
13 Jeffery S R, Franklin M J, Halevy A Y. Pay-as-you-go user feedback for dataspace systems. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data . 2008, 847-860
doi: 10.1145/1376616.1376701
14 Chai X, Vuong B Q, Doan A, Naughton J F. Efficiently incorporating user feedback into information extraction and integration programs. In: Proceedings of the 35th SIGMOD International Conference on Management of Data . 2009, 87-100
doi: 10.1145/1559845.1559857
15 Talukdar P P, Ives Z G, Pereira F. Automatically incorporating new sources in keyword search-based data integration. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data . 2010, 387-398
doi: 10.1145/1807167.1807211
16 Yakout M, Elmagarmid A K, Neville J, Ouzzani M, Ilyas I F. Guided data repair. Proceedings of the VLDB Endowment , 2011, 4(5): 279-289
17 Wang J, Kraska T, Franklin M J, Feng J. CrowdER: crowdsourcing entity resolution. Proceedings of the VLDB Endowment , 2012, 5(11): 1483-1494
18 Halevy A, Rajaraman A, Ordille J. Data integration: the teenage years. In: Proceedings of the 32nd International Conference on Very Large Data Bases . 2006, 9-16
19 Chen H, Ku W S, Wang H, Sun M T. Leveraging spatio-temporal redundancy for RFID data cleansing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data . 2010, 51-62
doi: 10.1145/1807167.1807176
20 Mahmoud H A, Aboulnaga A. Schema clustering and retrieval for multi-domain pay-as-you-go data integration systems. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data . 2010, 411-422
doi: 10.1145/1807167.1807213
21 Morton K, Bunker R, Mackinlay J, Morton R, Stolte C. Dynamic workload driven data integration in tableau. In: Proceedings of the 2012 International Conference on Management of Data . 2012, 807-816
22 Agrawal P, Sarma A D, Ullman J, Widom J. Foundations of uncertaindata integration. Proceedings of the VLDB Endowment , 2010, 3(1-2): 1080-1090
23 Das Sarma A, Dong X, Halevy A. Bootstrapping pay-as-you-go data integration systems. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data . 2008, 861-874
doi: 10.1145/1376616.1376702
24 Suchanek F M, Abiteboul S, Senellart P. PARIS: probabilistic alignment of relations, instances, and schema. Proceedings of the VLDB Endowment , 2011, 5(3): 157-168
25 Huang J, Chen T, Doan A, Naughton J F. On the provenance of nonanswers to queries over extracted data. Proceedings of the VLDB Endowment , 2008, 1(1): 736-747
26 Ioannou E, Nejdl W, Niederée C, Velegrakis Y. On-the-fly entity-aware query processing in the presence of linkage. Proceedings of the VLDB Endowment , 2010, 3(1-2): 429-438
27 Chen Z, Kalashnikov D V, Mehrotra S. Exploiting context analysis for combining multiple entity resolution systems. In: Proceedings of the 35th SIGMOD International Conference on Management of Data . 2009, 207-218
doi: 10.1145/1559845.1559869
28 Whang S E, Menestrina D, Koutrika G, Theobald M, Garcia-Molina H. Entity resolution with iterative blocking. In: Proceedings of the 35th SIGMOD International Conference on Management of Data . 2009, 219-232
doi: 10.1145/1559845.1559870
29 Fan W, Jia X, Li J, Ma S. Reasoning about record matching rules. Proceedings of the VLDB Endowment , 2009, 2(1): 407-418
30 Rimal B P, Choi E, Lumb I. A taxonomy and survey of cloud computing systems. In: Proceedings of the 5th International Joint Conference on INC, IMS and IDC . 2009, 44-51
31 Aguilera M K, Golab W, Shah M A. A practical scalable distributed b-tree. Proceedings of the VLDB Endowment , 2008, 1(1): 598-609
32 Jagadish H V, Ooi B C, Vu Q H. BATON: a balanced tree structure for peer-to-peer networks. In: Proceedings of the 31st International Conference on Very Large Data Bases . 2005, 661-672
33 Wu S, Wu K L. An indexing framework for efficient retrieval on the cloud. In: Bulletin of the IEEE Computer Society Technical Committee on Data Engineering . 2009, 1-8
34 Das S, Sismanis Y, Beyer K S, Gemulla R, Haas P J, McPherson J. Ricardo: integrating R and Hadoop. In: Proceedings of the 2010 International Conference on Management of Data . 2010, 987-998
35 Wegener D, Mock M, Adranale D, Wrobel S. Toolkit-based highperformance data mining of large data on MapReduce clusters. In: Proceedings of the 2009 IEEE International Conference on Data Mining Workshops . 2009, 296-301
doi: 10.1109/ICDMW.2009.34
36 Chu C T, Kim S K, Lin Y A, Yu Y Y, Bradski G, Ng A Y, Olukotun K. Map-reduce for machine learning on multicore. In: Proceedings of the 2006 Conference Advances in Neural Information Processing Systems . 2007, 281-288
[1] Yulin HE, Jiaqi CHEN, Jiaxing SHEN, Philippe FOURNIER-VIGER, Joshua Zhexue HUANG. Density estimation-based method to determine sample size for random sample partition of big data[J]. Front. Comput. Sci., 2024, 18(5): 185322-.
[2] Mingzhen LI, Changxi LIU, Jianjin LIAO, Xuegui ZHENG, Hailong YANG, Rujun SUN, Jun XU, Lin GAN, Guangwen YANG, Zhongzhi LUAN, Depei QIAN. Towards optimized tensor code generation for deep learning on sunway many-core processor[J]. Front. Comput. Sci., 2024, 18(2): 182101-.
[3] Shiyuan LIU, Yunchun LI, Hailong YANG, Ming DUN, Chen CHEN, Huaitao ZHANG, Wei LI. QAAS: quick accurate auto-scaling for streaming processing[J]. Front. Comput. Sci., 2024, 18(1): 181201-.
[4] Ashish SINGH, Abhinav KUMAR, Suyel NAMASUDRA. DNACDS: Cloud IoE big data security and accessing scheme based on DNA cryptography[J]. Front. Comput. Sci., 2024, 18(1): 181801-.
[5] Muazzam MAQSOOD, Sadaf YASMIN, Saira GILLANI, Maryam BUKHARI, Seungmin RHO, Sang-Soo YEO. An efficient deep learning-assisted person re-identification solution for intelligent video surveillance in smart cities[J]. Front. Comput. Sci., 2023, 17(4): 174329-.
[6] Jie JIA, Yi LIU, Guozhen ZHANG, Yulin GAO, Depei QIAN. Software approaches for resilience of high performance computing systems: a survey[J]. Front. Comput. Sci., 2023, 17(4): 174105-.
[7] Xiaoyan LIU, Yi LIU, Bohong YIN, Hailong YANG, Zhongzhi LUAN, Depei QIAN. swSpAMM: optimizing large-scale sparse approximate matrix multiplication on Sunway Taihulight[J]. Front. Comput. Sci., 2023, 17(4): 174104-.
[8] Chunxi ZHANG, Yuming LI, Rong ZHANG, Weining QIAN, Aoying ZHOU. Scalable and quantitative contention generation for performance evaluation on OLTP databases[J]. Front. Comput. Sci., 2023, 17(2): 172202-.
[9] Shuai XUE, Shang ZHAO, Quan CHEN, Zhuo SONG, Shanpei CHEN, Tao MA, Yong YANG, Wenli ZHENG, Minyi GUO. Kronos: towards bus contention-aware job scheduling in warehouse scale computers[J]. Front. Comput. Sci., 2023, 17(1): 171101-.
[10] Rong ZENG, Xiaofeng HOU, Lu ZHANG, Chao LI, Wenli ZHENG, Minyi GUO. Performance optimization for cloud computing systems in the microservice era: state-of-the-art and research opportunities[J]. Front. Comput. Sci., 2022, 16(6): 166106-.
[11] Donghui WANG, Peng CAI, Weining QIAN, Aoying ZHOU. Efficient and stable quorum-based log replication and replay for modern cluster-databases[J]. Front. Comput. Sci., 2022, 16(5): 165612-.
[12] Yao SONG, Limin XIAO, Liang WANG, Guangjun QIN, Bing WEI, Baicheng YAN, Chenhao ZHANG. GCSS: a global collaborative scheduling strategy for wide-area high-performance computing[J]. Front. Comput. Sci., 2022, 16(5): 165105-.
[13] Zhengxiong HOU, Hong SHEN, Xingshe ZHOU, Jianhua GU, Yunlan WANG, Tianhai ZHAO. Prediction of job characteristics for intelligent resource allocation in HPC systems: a survey and future directions[J]. Front. Comput. Sci., 2022, 16(5): 165107-.
[14] Xin YOU, Hailong YANG, Zhongzhi LUAN, Depei QIAN. Accelerating the cryo-EM structure determination in RELION on GPU cluster[J]. Front. Comput. Sci., 2022, 16(3): 163102-.
[15] Xiaotong WANG, Chunxi ZHANG, Junhua FANG, Rong ZHANG, Weining QIAN, Aoying ZHOU. A comprehensive study on fault tolerance in stream processing systems[J]. Front. Comput. Sci., 2022, 16(2): 162603-.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed