|
|
Automatic Web-based relational data imputation |
Hailong LIU(), Zhanhuai LI, Qun CHEN, Zhaoqiang CHEN |
School of Computer Science and Technology, Northwestern Polytechnical University, Xi’an 710072, China |
|
|
Abstract Data incompleteness is one of the most important data quality problems in enterprise information systems. Most existing data imputing techniques just deduce approximate values for the incomplete attributes by means of some specific data quality rules or some mathematical methods. Unfortunately, approximationmay be far away from the truth. Furthermore, when observed data is inadequate, they will not work well. The World Wide Web (WWW) has become the most important and the most widely used information source. Several current works have proven that using Web data can augment the quality of databases. In this paper, we propose a Web-based relational data imputing framework, which tries to automatically retrieve real values from the WWW for the incomplete attributes. In the paper, we try to take full advantage of relations among different kinds of objects based on the idea that the same kind of things must have the same kind of relations with their relatives in a specific world. Our proposed techniques consist of two automatic query formulation algorithms and one graph-based candidates extraction model. Several evaluations are proposed on two high-quality real datasets and one poor-quality real dataset to prove the effectiveness of our approaches.
|
Keywords
data incompleteness
imputation
World Wide Web
query formulation
candidate selection
semantic relation
|
Corresponding Author(s):
Hailong LIU
|
Just Accepted Date: 23 December 2016
Online First Date: 06 March 2018
Issue Date: 04 December 2018
|
|
1 |
Batista G E, Monard M C. An analysis of four missing data treatment methods for supervised learning. Applied Artificial Intelligence, 2003, 17(5–6): 519–533
https://doi.org/10.1080/713827181
|
2 |
Ramoni M, Sebastiani P. Robust learning with missing data. Machine Learning, 2001, 45(2): 147–170
https://doi.org/10.1023/A:1010968702992
|
3 |
Grzymala-Busse J W, Hu M. A comparison of several approaches to missing attribute values in data mining. In: Proceedings of the 2nd International Conference on Rough Sets and Current Trends in Computing. 2000, 378–385
|
4 |
Zhu X F, Zhang S C, Jin Z, Zhang Z L, Xu Z M. Missing value estimation for mixed-attribute data sets. IEEE Transactions on Knowledge and Data Engineering, 2011, 23(1): 110–121
https://doi.org/10.1109/TKDE.2010.99
|
5 |
Little R J, Rubin D B. Statistical Analysis with Missing Data. New York: John Wiley & Sons, 2002
https://doi.org/10.1002/9781119013563
|
6 |
Loshin D. Master Data Management. Boston: Morgan Kaufmann, 2010
|
7 |
Schlaefer N, Ko J, Betteridge J, Sautter G, Pathak M A, Nyberg E. Semantic extensions of the Ephyra QA system for TREC 2007. In: Proceedings of the 16th Text REtrieval Conference. 2007, 332–341
|
8 |
Huhtala Y, Kärkkäinen J, Porkka P, Toivonen H. Tane: an efficient algorithm for discovering functional and approximate dependencies. The Computer Journal, 1999, 42(2): 100–111
https://doi.org/10.1093/comjnl/42.2.100
|
9 |
Hollan J H. Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control and Artificial Intelligence. Cambridge, MA: MIT press, 1992
|
10 |
Goldberg D E. Genetic Algorithms in Search, Optimization, and Machine Learning. Pearson: Addison-Wesley Professional, 1989
|
11 |
Li Z X, Sharaf MA, Sitbon L, Sadiq S, Indulska M, Zhou X F. Webput: efficient Web-based data imputation. In: Proceedings of the 13th International Conference on Web Information Systems Engineering. 2012, 243–256
https://doi.org/10.1007/978-3-642-35063-4_18
|
12 |
Jurafsky D, James H. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech. Upper Saddle River: Pearson Education, 2000
|
13 |
Finkel J R, Grenager T, Manning C. Incorporating non-local information into information extraction systems by gibbs sampling. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. 2005, 363–370
https://doi.org/10.3115/1219840.1219885
|
14 |
Fader A, Soderland S, Etzioni O. Identifying relations for open information extraction. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2011, 1535–1545
|
15 |
Liu H L, Li Z H, Jin C Q, Chen Q. Web-based techniques for automatically detecting and correcting information errors in a database. In: Proceedings of the 3rd International Conference on Big Data and Smart Computing. 2016, 261–264
|
16 |
Lakshminarayan K, Harp S A, Goldman R, Samad T. Imputation of missing data using machine learning techniques. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. 1996, 140–145
|
17 |
Wang Q H, Rao J. Empirical likelihood-based inference in linear models with missing data. Scandinavian Journal of Statistics, 2002, 29(3): 563–576
https://doi.org/10.1111/1467-9469.00306
|
18 |
Zhang S C, Zhang J L, Zhu Z F, Qin Y S, Zhang C Q. Missing value imputation based on data clustering. Transactions on Computational Science, 2008, 128–138
https://doi.org/10.1007/978-3-540-79299-4_7
|
19 |
Yakout M, Elmagarmid A K, Neville J, Ouzzani M, Ilyas I F. Guided data repair. Proceedings of the VLDB Endowment, 2011, 4(5): 279–289
https://doi.org/10.14778/1952376.1952378
|
20 |
Tong Y X, Cao C C, Zhang C J, Li Y T, Chen L. Crowdcleaner: data cleaning for multi-version data on the Web via crowdsourcing. In: Proceedings of the 30th IEEE International Conference on Data Engineering. 2014, 1182–1185
https://doi.org/10.1109/ICDE.2014.6816736
|
21 |
Fan W F, Geerts F. Capturing missing tuples and missing values. In: Proceedings of the 29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. 2010, 169–178
https://doi.org/10.1145/1807085.1807109
|
22 |
Fan W F, Geerts F. Relative information completeness. ACM Transactions on Database Systems, 2010, 35(4): 97–106
https://doi.org/10.1145/1862919.1862924
|
23 |
Fan W F, Li J Z, Ma S, Tang N, Yu W Y. Towards certain fixes with editing rules and master data. Proceedings of the VLDB Endowment, 2010, 3(2): 213–238
https://doi.org/10.14778/1920841.1920867
|
24 |
Cirasella J. Google Sets, Google Suggest, and Google Search History: three more tools for the reference librarian’s bag of trick. The Reference Librarian, 2007, 48(1): 57–65
https://doi.org/10.1300/J120v48n99_04
|
25 |
Wang R C, Cohen W W. Language-independent set expansion of named entities using the Web. In: Proceedings of the 7th IEEE International Conference on Data Mining. 2007, 342–350
https://doi.org/10.1109/ICDM.2007.104
|
26 |
Wang R C, Cohen W W. Iterative set expansion of named entities using the Web. In: Proceedings of the 8th IEEE International Conference on Data Mining. 2008, 1091–1096
https://doi.org/10.1109/ICDM.2008.145
|
27 |
Sadamitsu K, Saito K, Imamura K, Kikui G. Entity set expansion using topic information. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 2011, 726–731
|
28 |
Dalvi B B, Cohen W W, Callan J. Websets: extracting sets of entities from the Web using unsupervised information extraction. In: Proceedings of the 5th ACM International Conference on Web Search and Data Mining. 2012, 243–252
https://doi.org/10.1145/2124295.2124327
|
29 |
Bian H Q, Chen Y G, Du X Y, Zhang X L. MetKB: enriching RDF knowledge bases with Web entity-attribute tables. In: Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. 2013, 2461–2464
https://doi.org/10.1145/2505515.2508209
|
30 |
Zhang X L, Chen Y G, Chen J C, Du X Y, Zou L. Mapping entityattribute Web tables to web-scale knowledge bases. In: Proceedings of the 18th International Conference on Database Systems for Advanced Applications. 2013, 108–122
https://doi.org/10.1007/978-3-642-37450-0_8
|
31 |
Li Z X, Sharaf M A, Sitbon L, Du X Y, Zho u X F. CoRE: a contextaware relation extraction method for relation completion. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(4): 836–849
https://doi.org/10.1109/TKDE.2013.148
|
32 |
Tang N, Vemuri V R. Web-based knowledge acquisition to impute missing values for classification. In: Proceedings of IEEE/WIC/ACM International Conference on Web Intelligence. 2004, 124–130
https://doi.org/10.1109/WI.2004.10114
|
33 |
Li Z X, Sharaf M A, Sitbon L, Sadiq S, Indulska M, Zhou X F. A web-based approach to data imputation. World Wide Web, 2014, 17(5): 873–897
https://doi.org/10.1007/s11280-013-0263-z
|
34 |
Li Z X, Shang S, Xie Q, Zhang X L. Cost reduction for web-based data imputation. In: Proceedings of the 19th International Conference on Database Systems for Advanced Applications. 2014, 438–452
https://doi.org/10.1007/978-3-319-05813-9_29
|
35 |
Soderland S. Learning information extraction rules for semi-structured and free text. Machine Learning, 1999, 34(1–3): 233–272
https://doi.org/10.1023/A:1007562322031
|
36 |
Liu H L, Li Z H, Chen Q, Chen Z Q. A review on web-based techniques for automatically detecting and correcting information errors in relational databases. Chinese Journal of Computers, 2016, 40(10): 2286–2304
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|