Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front Comput Sci    2012, Vol. 6 Issue (3) : 293-312    https://doi.org/10.1007/s11704-012-2002-5
RESEARCH ARTICLE
Linking temporal records
Pei LI1(), Xin Luna DONG2, Andrea MAURINO1, Divesh SRIVASTAVA2
1. Department of Informatics, Systems and Communication, University of Milan-Bicocca, Milan 20126, Italy; 2. Data Management Department, AT&T Labs-Research, Florham Park, NJ 07932, USA
 Download: PDF(1001 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

Many data sets contain temporal records which span a long period of time; each record is associated with a time stamp and describes some aspects of a real-world entity at a particular time (e.g., author information in DBLP). In such cases, we often wish to identify records that describe the same entity over time and so be able to perform interesting longitudinal data analysis. However, existing record linkage techniques ignore temporal information and fall short for temporal data.

This article studies linking temporal records. First, we apply time decay to capture the effect of elapsed time on entity value evolution. Second, instead of comparing each pair of records locally, we propose clustering methods that consider the time order of the records and make global decisions. Experimental results show that our algorithms significantly outperform traditional linkage methods on various temporal data sets.

Keywords temporal data      record linkage      data integration     
Corresponding Author(s): LI Pei,Email:pei.li@disco.unimib.it   
Issue Date: 01 June 2012
 Cite this article:   
Pei LI,Xin Luna DONG,Andrea MAURINO, et al. Linking temporal records[J]. Front Comput Sci, 2012, 6(3): 293-312.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-012-2002-5
https://academic.hep.com.cn/fcs/EN/Y2012/V6/I3/293
1 Elmagarmid A, Ipeirotis P, Verykios V. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering , 2007, 19(1): 1-16
pmid:16750682
2 Koudas N, Sarawagi S, Srivastava D. Record linkage: similarity measures and algorithms. In: Proceedings of the 25th ACM SIGMOD International Conference on Management of Data . 2006, 802-803
pmid:18344028
3 Weikum G, Ntarmos N, Spaniol M, Triantafillou P, Benczúr A, Kirkpatrick S, Rigaux P, Williamson M. Longitudinal analytics on web archive data: It’s about time! In: Proceedings of the Biennial Conference on Innovative Data Systems Research . 2011, 199-202
pmid:15569863
4 McCallum A, Nigam K, Ungar L. Efficient clustering of highdimensional data sets with application to reference matching. In: Proceedings of the 6th ACMSIGKDD International Conference on Knowledge Discovery and Data Mining . 2000, 169-178
5 Li P, Dong X, Maurino A, Srivastava D. Linking temporal records. Proceedings of the VLDB Endowment , 2011, 4(7): 956-967
6 Fan W, Geerts F, Wijsen J. Determining the currency of data. In: Proceedings of the 30th Symposium on Principles of Database Systems of Data . 2011, 71-82
7 Hassanzadeh O, Chiang F, Lee H, Miller R. Framework for evaluating clustering algorithms in duplicate detection. Proceedings of the VLDB Endowment , 2009, 2(1): 1282-1293
8 Fellegi I P, Sunter A B. A theory for record linkage. Journal of the American Statistical Association , 1969, 64(328): 1183-1210
9 Dey D. Entity matching in heterogeneous databases: A logistic regression approach. Decision Support Systems , 2008, 44(3): 740-747
10 Hernández M, Stolfo S. Real-world data is dirty: data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery , 1998, 2(1): 9-37
11 Domingos P. Multi-relational record linkage. In: Proceedings of the KDD-2004 Workshop on Multi-Relational Data Mining . 2004, 31-48
12 Winkler W. Methods for record linkage and bayesian networks. Technical report, Statistical Research Division, US Census Bureau, Washington, DC , 2002
13 Ananthakrishna R, Chaudhuri S, Ganti V. Eliminating fuzzy duplicates in data warehouses. In: Proceedings of the 28th International Conference on Very Large Data Bases . 2002, 586-597
14 Chen Z, Kalashnikov D, Mehrotra S. Exploiting relationships for object consolidation. In: Proceedings of the 2nd International Workshop on Information Quality in Information Systems . 2005, 47-58
15 On B, Koudas N, Lee D, Srivastava D. Group linkage. In: Proceedings of the 23rd IEEE International Conference on the Data Engineering . 2007, 496-505
16 Wijaya D, Bressan S. Ricochet: A family of unconstrained algorithms for graph clustering. In: Proceedings of the 14th International Conference on Database Systems for Advanced Applications . 2009, 153-167
17 Flake G, Tarjan R, Tsioutsiouliklis K. Graph clustering and minimum cut trees. Internet Mathematics , 2004, 1(4): 385-408
18 Yakout M, Elmagarmid A, Elmeleegy H, Ouzzani M, Qi A. Behavior based record linkage. Proceedings of the VLDB Endowment , 2010, 3(1-2): 439-448
19 Burdick D, Hernández MA, Ho H, Koutrika G, Krishnamurthy R, Popa L, Stanoi I, Vaithyanathan S, Das S R. Extracting, linking and integrating data from public sources: a financial case study. IEEE Data Engineering , 2011, 34(3): 60-67
20 Ozsoyoglu G, Snodgrass R. Temporal and real-time databases: a survey. IEEE Transactions on Knowledge and Data Engineering , 1995, 7(4): 513-532
21 Roddick J, Spiliopoulou M. A survey of temporal knowledge discovery paradigms and methods. IEEE Transactions on Knowledge and Data Engineering , 2002, 14(4): 750-767
22 Cohen E, Strauss M. Maintaining time-decaying stream aggregates. Journal of Algorithms , 2006, 59(1): 19-36
23 Cormode G, Shkapenyuk V, Srivastava D, Xu B. Forward decay: a practical time decay model for streaming systems. In: Proceedings of the 25th IEEE International Conference on Data Engineering . 2009, 138-149
[1] Genan DAI, Xiaoyang HU, Youming GE, Zhiqing NING, Yubao LIU. Attention based simplified deep residual network for citywide crowd flows prediction[J]. Front. Comput. Sci., 2021, 15(2): 152317-.
[2] Jinyu CHEN, Shihua ZHANG. Integrative cancer genomics: models, algorithms and analysis[J]. Front. Comput. Sci., 2017, 11(3): 392-406.
[3] Chenchen SUN,Derong SHEN,Yue KOU,Tiezheng NIE,Ge YU. A genetic algorithm based entity resolution approach with active learning[J]. Front. Comput. Sci., 2017, 11(1): 147-159.
[4] HAO Guoshun, MA Shilong, LV Jianghua, SUI Yuefei. Dynamic description logic model for data integration[J]. Front. Comput. Sci., 2008, 2(3): 306-330.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed