Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2018, Vol. 12 Issue (5) : 984-999    https://doi.org/10.1007/s11704-016-5551-1
RESEARCH ARTICLE
Efficient histogram-based range query estimation for dirty data
Yan ZHANG(), Hongzhi WANG(), Long YANG(), Jianzhong LI()
Department of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
 Download: PDF(564 KB)  
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

In recent years, data quality issues have attracted wide attentions. Data quality problems are mainly caused by dirty data. Currently, many methods for dirty data management have been proposed, and one of them is entity-based relational database in which one tuple represents an entity. The traditional query optimizations are not suitable for the new entity-based model. Then new query optimizations need to be developed. In this paper, we propose a new query selectivity estimation strategy based on histogram, and focus on solving the overestimation which traditional methods lead to. We prove our approaches are unbiased. The experimental results on both real and synthetic data sets show that our approaches can give good estimates with low error.

Keywords query estimation      data quality      histogram      dirty data management     
Corresponding Author(s): Yan ZHANG,Hongzhi WANG,Long YANG,Jianzhong LI   
Just Accepted Date: 19 July 2016   Online First Date: 22 September 2017    Issue Date: 21 September 2018
 Cite this article:   
Yan ZHANG,Hongzhi WANG,Long YANG, et al. Efficient histogram-based range query estimation for dirty data[J]. Front. Comput. Sci., 2018, 12(5): 984-999.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-016-5551-1
https://academic.hep.com.cn/fcs/EN/Y2018/V12/I5/984
1 Batini C, Scannapieco M. Data Quality: Concepts, Methodologies and Techniques. New York: Springer Publishing Company, Inc., 2006
2 Lenzerini M. Data integration: a theoretical perspective. In: Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. 2015, 233–246
3 Dong X L, Halevy A, Yu C. Data integration with uncertainty. The VLDB Journal—The International Journal on Very Large Data Bases, 2009, 18(2): 469–500
4 Redman T. The impact of poor data quality on the typical enterprise. Communications of the ACM, 1998, 41(2): 49–71
https://doi.org/10.1145/269012.269025
5 Raman D, Ton Z. Execution: the missing link in retail operations. Jutas Bus.l, 2001, 43(3): 489–503
https://doi.org/10.2307/41166093
6 English L P. Information quality management: the next frontier. In: Proceedings of ASQ World Conference on Quality and Improvement. 2001
7 Rahm E, Do H H. Data cleaning: problems and current approaches. IEEE Data Engineering Bulletin, 2000, 23(23): 3–13
8 Fan W F, Li J, Ma S, Tang N, Yu W. Interaction between record matching and data repairing. Journal of Data & Information Quality, 2011, 4(4): 469–480
https://doi.org/10.1145/1989323.1989373
9 Fuxman A D, Miller R J. First-order query rewriting for inconsistent databases. In: Proceedings of International Conference on Database Theory. 2005, 337–351
10 Andritsos P, Fuxman A, Miller R J. Clean answers over dirty databases: a probabilistic approach. IEEE Computer Society, 2006, 30
11 Wolf G, Kalavagattu A, Khatri H, Balakrishnan R, Chokshi B, Fan J, Chen Y, Kambhampati S. Query processing over incomplete autonomous databases: query rewriting using learned data dependencies. The VLDB Journal, 2009, 18(5): 1167–1190
https://doi.org/10.1007/s00778-009-0155-0
12 Fuxman A, Fazli E, Miller J. Conquer: efficient management of inconsistent databases. In: Proceedings of SIGMOD Conference. 2005, 155–166
https://doi.org/10.1145/1066157.1066176
13 Boulos J, Dalvi N, Mandhani B, Mathur S, Re C, Suciu D. MYSTIQ: a system for finding more answers by using probabilities. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2005, 891–893
https://doi.org/10.1145/1066157.1066277
14 Dalvi N, Suciu D. Management of probabilistic data: foundations and challenges. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2007, 1–12
https://doi.org/10.1145/1265530.1265531
15 Widom J. Trio: a system for integrated management of data, accuracy, and lineage. In: Proceedings of the Conference on Innovative Data Systems Research (CIDR). 2005, 262–276
16 Hassanzadeh O, Miller R J. Creating probabilistic databases from duplicated data. The VLDB Journal—The International Journal on Very Large Data Bases, 2009, 18(5): 1141–1166
17 Benjelloun O, Garcia-Molina H, Menestrina D, Whang S E, Su Q, Widom J. Swoosh: a generic approach to entity resolution. The VLDB Journal—The International Journal on Very Large Data Bases, 2009, 18(1): 255–276
18 Whang S E, Menestrina D, Koutrika G, Theobald M, Garcia-Molina H. Entity resolution with iterative blocking. In: Proceedings of the 35th SIGMOD International Conference on Management of Data. 2009, 219–232
https://doi.org/10.1145/1559845.1559870
19 Li Y, Wang H, Gao H. Efficient entity resolution based on sequence rules. In: Proceedings of Communications in Computer and Information Science. 2011, 381–388
https://doi.org/10.1007/978-3-642-21402-8_61
20 Lu W, Fung G P C, Du X, Zhou X, Chen L, Deng K. Approximate entity extraction in temporal databases. World Wide Web, 2011, 14(2): 157–186
https://doi.org/10.1007/s11280-011-0109-5
21 Zhang W J, Zhan L M, Zhang Y, Cheema M A, Lin X M. Efficient top-k similarity join processing over multi-valued objects. World Wide Web, 2014, 17(3): 285–309
https://doi.org/10.1007/s11280-012-0201-5
22 Ioannidis Y E. The history of histograms (abridged). In: Proceedings of the 29th International Conference on Very Large Data Bases. 2004, 19–30
23 Cormode G, Garofalakis M. Histograms and wavelets on probabilistic data. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(8): 1142–1157
https://doi.org/10.1109/TKDE.2010.66
24 Cormode G, Deligiannakis A, Garofalakis M, McGregor A. Probabilistic histograms for probabilistic data. Proceedings of the VLDB Endowment, 2009, 2(1): 526–537
https://doi.org/10.14778/1687627.1687687
25 Wang H Z, Liu X L, Li J Z, Tong X, Yang L, Li Y K. EntityManager: an entity-based dirty data management system. In: Proceedings of International Conference on Database Systems for Advanced Applications. 2013, 468–471
https://doi.org/10.1007/978-3-642-37450-0_38
26 Abiteboul S, Kanellakis P, Grahne G. On the representation and querying of sets of possible worlds. Theoretical Computer Science, 1987, 16(3): 34–48
https://doi.org/10.1145/38713.38724
27 Fuhr N, Rolleke T. A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Transactions on Information Systems, 1997, 15(1): 32–66
https://doi.org/10.1145/239041.239045
28 Lakshmanan L, Leone N, Ross R, Subrahmanian V S. Probview: a flexible probabilistic database system. ACM Transactions on Database Systems, 1997, 22(3): 419–469
https://doi.org/10.1145/261124.261131
29 Nierman A, Jagadish H. ProTDB: probabilistic data in XML. In: Proceedings of the 28th International Conference on Very Large Data Bases. 2002, 646–657
https://doi.org/10.1016/B978-155860869-6/50063-9
30 Jin C Q, Yi K, Chen L, Yu J X, Lin X. Sliding-window top-k queries on uncertain streams. Proceedings of the VLDB Endowment, 2008, 1(1): 301–312
https://doi.org/10.14778/1453856.1453892
31 Burdick D, Deshpande P M, Jayram T S, Ramakrishnan R, Vaithyanathan S. OLAP over uncertain and imprecise data. The VLDB Journal—The International Journal on Very Large Data Bases, 2007, 16(1): 123–144
32 Qi Y, Jain R, Singh S, Prabhakar S. Threshold query optimization for uncertain data. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2010, 315–326
https://doi.org/10.1145/1807167.1807203
33 Tao Y F, Cheng R, Xiao X K, Ngai W K, Kao B, Prabhakar S. Indexing multi-dimensional uncertain data with arbitrary probability density functions. In: Proceedings of the 31st International Conference on Very Large Data Bases. 2005, 922–933
34 Tao Y F, Xiao X K, Cheng R. Range search on multidimensional uncertain data. ACM Transactions on Database Systems, 2007, 32(3): 15
https://doi.org/10.1145/1272743.1272745
35 Dalvi N, Suciu D. Efficient query evaluation on probabilistic databases. In: Proceedings of International Conference on Very Large Databases. 2008, 16(1): 119–128
36 Cheng R, Kalashnikov D V, Prabhakar S. Evaluating probabilistic queries over imprecise data. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2003, 551–562
https://doi.org/10.1145/872757.872823
37 Pei J, Jiang B, Lin X M, Yuan Y D. Probabilistic skylines on uncertain data. In: Proceedings of the 33rd International Conference on Very Large Data Bases. 2007, 15–26
38 Dellis E, Seeger B. Efficient computation of reverse skyline queries. In: Proceedings of the 33rd International Conference on Very Large Data Bases. 2007, 291–302
39 Soliman M A, Ilyas I F, Chang K C C. Top-k query processing in uncertain databases. In: Proceedings of the 23rd IEEE International Conference on Data Engineering. 2007, 896–905
https://doi.org/10.1109/ICDE.2007.367935
40 Ge T, Zdonik S, Madden S. Top-k queries on uncertain data: on score distribution and typical answers. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2009, 375–388
https://doi.org/10.1145/1559845.1559886
41 Wang G R, Huo H, Han D H, Hui X Y. Query processing and optimization techniques over streamed fragmented XML. World Wide Web, 2008, 11(3): 339–359
https://doi.org/10.1007/s11280-007-0041-x
42 Barbosa D, Mignet L, Veltri P. Studying the XML Web: gathering statistics from an XML sample. World Wide Web, 2006, 9(2): 187–212
https://doi.org/10.1007/s11280-006-8437-6
43 Kooi R. The optimization of queries in relational databases. Dissertation for the Doctoral Degree. Cleveland, Ohio: Case Western Reserve University, 1980
44 Piatetsky-Shapiro G, Connell C. Accurate estimation of the number of tuples satisfying a condition. ACM SIGMOD Record, 1984, 14(2): 256–276
https://doi.org/10.1145/971697.602294
45 Ioannidis Y, Poosala V. Balancing histogram optimality and practicality for query result size estimation. ACM SIGMOD Record, 1995, 24(2): 233–244
https://doi.org/10.1145/568271.223841
46 Gunopulos D, Kollios G, Tsotras V J, Domeniconi C. Approximating multi-dimensional aggregate range queries over real attributes. ACM SIGMOD Record, 2000, 29(2): 463–474.
https://doi.org/10.1145/335191.335448
47 Bruno N, Chaudhuri S, Gravano L. STHoles: a multidimensional workload aware histogram. ACM SIGMOD Record, 2001, 30(2): 211–222
https://doi.org/10.1145/376284.375686
48 Haas P J, Naughton J F, Seshadri S, Swami A N. Selectivity and cost estimation for joins based on random sampling. Journal of Computer and System Sciences, 1996, 52(3): 550–569
https://doi.org/10.1006/jcss.1996.0041
49 Lipton R J, Naughton J F. Query size estimation by adaptive sampling. Journal of Computer and System Sciences, 1995, 51(1): 18–25
https://doi.org/10.1006/jcss.1995.1050
50 Olken F. Random sampling from databases. Dissertation for the Doctoral Degree. University of California at Berkeley, 1997
51 Ngu A, Harangsri B, Shepherd J. Query size estimation for joins using systematic sampling. Distributed and Parallel Databases, 2004, 15(3): 237–275
https://doi.org/10.1023/B:DAPD.0000018573.35050.25
52 Chaudhuri S, Das G, Narasayya V R. Optimized stratified sampling for approximate query processing. ACM Transactions on Database Systems, 2007, 32(2): 9
https://doi.org/10.1145/1242524.1242526
53 Zhang Y, Yang L, Wang H Z. Range query estimation for dirty data management system. In: Proceedings of International Conference on Web-Age Information Management. 2012, 152–164
https://doi.org/10.1007/978-3-642-32281-5_15
[1] Chenchen SUN,Derong SHEN,Yue KOU,Tiezheng NIE,Ge YU. A genetic algorithm based entity resolution approach with active learning[J]. Front. Comput. Sci., 2017, 11(1): 147-159.
[2] Vahid MEHRDAD,Hossein EBRAHIMNEZHAD. 3D object retrieval based on histogram of local orientation using one-shot score support vector machine[J]. Front. Comput. Sci., 2015, 9(6): 990-1005.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed