Knowledge discovery through directed probabilistic topic models: a survey
Knowledge discovery through directed probabilistic topic models: a survey
Ali DAUD1(), Juanzi LI1(), Lizhu ZHOU1(), Faqir MUHAMMAD2()
1. Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China; 2. Department of Mathematics and Statistics, Allama Iqbal Open University, Sector H-8, Islamabad 44000, Pakistan
Graphical models have become the basic framework for topic based probabilistic modeling. Especially models with latent variables have proved to be effective in capturing hidden structures in the data. In this paper, we survey an important subclass Directed Probabilistic Topic Models (DPTMs) with soft clustering abilities and their applications for knowledge discovery in text corpora. From an unsupervised learning perspective, “topics are semantically related probabilistic clusters of words in text corpora; and the process for finding these topics is called topic modeling”. In topic modeling, a document consists of different hidden topics and the topic probabilities provide an explicit representation of a document to smooth data from the semantic level. It has been an active area of research during the last decade. Many models have been proposed for handling the problems of modeling text corpora with different characteristics, for applications such as document classification, hidden association finding, expert finding, community discovery and temporal trend analysis. We give basic concepts, advantages and disadvantages in a chronological order, existing models classification into different categories, their parameter estimation and inference making algorithms with models performance evaluation measures. We also discuss their applications, open challenges and future directions in this dynamic area of research.
Corresponding Author(s):
DAUD Ali,Email:ali_msdb@hotmail.com; LI Juanzi,Email:ljz@keg.cs.tsinghua.edu.cn; ZHOU Lizhu,Email:dcszlz@tsinghua.edu.cn; MUHAMMAD Faqir,Email:aioufsd@yahoo.com
引用本文:
. Knowledge discovery through directed probabilistic topic models: a survey[J]. Frontiers of Computer Science in China, 2010, 4(2): 280-301.
Ali DAUD, Juanzi LI, Lizhu ZHOU, Faqir MUHAMMAD. Knowledge discovery through directed probabilistic topic models: a survey. Front Comput Sci Chin, 2010, 4(2): 280-301.
New York Times dataset (http://www.ldc.upenn.edu), Foreign broadcast information service FBIS dataset (http://www.fbis.gov)
Bigram Topic Model
IaCDPTMs
Gibbs EM
Topic Discovery
Psychological review abstracts dataset (http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm), 20 News group dataset (http://people.csail.mit.edu/jrennie/20Newsgroups/)
PAM
IrCDPTMs
Gibbs Sampling
Super and Sub Topic Discovery, Document Classification
NIPS00-12 Proceedings dataset (www.cs.toronto.edu/~roweis/data.html) , 20 Newsgroup dataset (http://www.cs.cmu.edu/~textlearning/), Rexa research paper search engine (http://Rexa.info)
TOT Model
TDPTMs
Gibbs Sampling
Topics Evolution over Time
State of the Union Addresses dataset (http://www.gutenberg.org/dirs/etext04/suall11.txt), Researchers Email Achieve, NIPS00-12 Proceedings dataset (www.cs.toronto.edu/~roweis/data.html)
Continues-Time Model
TDPTMs
Gibbs Sampling
Topics Evolution over Time and their Correlations
Rexa research paper search engine (http://Rexa.info)
CPLSA
IrCDPTMs
EM
Temporal (Entities-Topic) Correlations, Topics Evolution over Time, Event Impact Analysis
Abstracts of 282 papers of two Data Mining researchers, from ACM Digital library, MSN Space documents, Abstracts of 28 years’ SIGIR conferences from ACM Digital Library
HTMM
IaCDPTMs
EM and Forward-backward algorithm
Topic Discovery
NIPS00-12 Proceedings dataset (www.cs.toronto.edu/~roweis/data.html), used dataset (http://www.cs.huji.ac.il/~amitg/htmm.html)
TREC-1 AP newswire articles corpus, “Election 08” dataset (digg.com)
LTHM
IrCDPTMs
EM
Relationship between Topics and Links
Webkb web pages dataset (http://www.cs.huji.ac.il/~amitg/lthm.html), Wikipedia (http://www.cs.cmu.edu/~webkb/)
TAT
TDPTMs
Gibbs Sampling
Temporal Authors Interests and Correlations
Computer science research papers taken from http://www.informatik.uni-trier.de/~ley/db/
ACT
IrCDPTMs
Gibbs Sampling
Expertise Search in Academics Social Network
Computer science research papers taken from http://www.arnetminer.org/
STMS
IrCDPTMs
Gibbs Sampling
Expert Finding
Computer science research papers taken from http://www.informatik.uni-trier.de/~ley/db/
GLDA
IrCDPTMs
Gibbs Sampling
Conference Mining
Computer science research papers taken from http://www.informatik.uni-trier.de/~ley/db/
Tab.5
1
Popescul A, Flake G W, Lawrence S, Ungar L H, Giles C L. Clustering and identifying temporal trends in document databases. IEEE ADL , 2000, 173–182
2
McCallum A, Nigam K, Ungar L H. Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the 6th ACM SIGKDD , 2000, 169–178
3
Hofmann T. Probabilistic latent semantic analysis. In: Proceedings of the 15th Annual Conference on Uncertainty in Artificial Intelligence (UAI) , Stockholm, Sweden, July 30-August 1, 1999
4
Steyvers M, Griffiths T. Probabilistic topic models. In: Landauer T, Mcnamara D, Dennis S, Kintsch W (Eds), Latent Semantic Analysis: A Road to Meaning. Laurence Erlbaum , 2007
5
Heinrich G. Parameter Estimation for Text Analysis. Technical report , Version 2, February 2008
6
Smolensky P. Information processing in dynamical systems: foundations of harmony theory. In: Rumehart D E,McClelland J L (Eds), Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Volume 1: Foundations . McGraw-Hill, New York, 1986
7
Welling M, Rosen-Zvi M, Hinton G. Exponential family harmoniums with an application to information retrieval. In: Advances in Neural Information Processing Systems (NIPS).Cambridge, MA, MIT Press, 2004
8
Blei D M, Ng A Y, Jordan M I. Latent dirichlet allocation. Journal of Machine Learning Research , 2003, 3: 993–1022 doi: 10.1162/jmlr.2003.3.4-5.993
9
Rosen-Zvi M, Griffiths T, Steyvers M, Smyth P. The author-topic model for authors and documents. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence (UAI) , Banff, Canada, July 7–11, 2004
10
Griffiths T L, Steyvers M. Finding scientific topics. In: Proceedings of the National Academy of SciencesUSA, 2004, 101: 5228–5235 doi: 10.1073/pnas.0307752101
11
Teh Y W, Jordan M I, Beal M J, Blei D M. Hierarhical Dirichlet Processes. Technical Report 653, Department of Statistics, UC Berkeley , 2004
12
Blei D M, McAuliffe J. Supervised topic models. In: Advances in Neural Information Processing Systems (NIPS) 21Cambridge, MA, MIT Press, 2007, 121–128
13
Buntine W L. Operations for learning with graphical models. Journal of Artificial Intelligence Research , 1994, 2: 159–225
14
Steyvers M, Smyth P, Rosen-Zvi M, Griffiths T. Probabilistic author-topic models for information discovery. In: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . Seattle, Washington, August 22–25, 2004
15
Wang X, Li W, McCallum A. A continuous-time model of topic co-occurrence trends. In: AAAI Workshop on Event Detection . Boston, Massachusetts, USA, July 16–20, 2006
16
Nigam K, McCallum A K, Thrun S, Mitchell T. Text classification from labeled and unlabeled documents using EM. Journal of Machine Learning , 2000, 39(2–3): 103–134 doi: 10.1023/A:1007692713085
17
Griffiths T L, Steyvers M. A probabilistic approach to semantic representation. In: Proceedings of the 24th Conference of the Cognitive Science SocietyUSA, 2002
18
Griffiths T L, Steyvers M. Prediction and semantic association. In: Advances in Neural Information Processing Systems (NIPS) 15 . Cambridge, MA, MIT Press, 2003
19
Wray L, Buntine, Jakulin A. Applying discrete PCA in data analysis. In: Proceedings of 20th Conference on Uncertainty in Artificial Intelligence (UAI) , Banff, Canada, July7–11, 2004, 59–66
20
Minka T, Lafferty J. Expectation-propagation for the generative aspect model. In: Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence (UAI) , Alberta, Canada, August 1–4, 2002, 352–359
21
Hofmann T, Puzicha J, Jordan M I. Learning from dyadic data. In: Advances in Neural Information Processing Systems (NIPS) 11 . Cambridge, MA, MIT Press, 1999
22
Cohn D, Hofmann T. The missing link- a probabilistic model of document content and hypertext connectivity. In: Advances in Neural Information Processing Systems (NIPS) 13 . Cambridge, MA, MIT Press, 2001
23
Blei D M, Moreno P J. Topic segmentation with an aspect hidden Markov model. In: Proceedings of 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval , New Orleans. LA USA, September 9-13, 2001, 343–348
24
Erosheva E, Fienberg S, Lafferty J. Mixed-membership models of scientific publications. In: Proceedings of the National Academy of Sciences , USA, 2004, 101: 5220–5227 doi: 10.1073/pnas.0307760101
25
Nallapati R, Cohen W. Link-plsa-lda: A new unsupervised model for topics and influence of blogs. In: Proceedings of International Conference for Weblogs and Social Media , Seattle, Washington, USA, March 30-April 2, 2008
26
McCallum A, Corrada-Emmanuel A, Wang X. The Author-recipient-topic Model for Topic and Role Discovery in Social Networks: Experiments with Enron and Academic Email. Technical Report UM-CS-2004-096 , 2004
27
Blei D M, Lafferty J. Correlated topic models. In: Advances in Neural Information Processing Systems (NIPS) 18 . Cambridge, MA, MIT Press, 2006, 147–154
28
Li W, McCallum A. Pachinko allocation: Dag-structured mixture models of topic correlations. In: Proceedings of the 23rd International Conference on Machine Learning (ICML) , Pittsburgh, Pennsylvania, June 25-29, 2006, 577–584
29
Newman D, Chemudugunta C, Smyth P, Steyvers M. Statistical entity-topic models. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , Philadelphia, USA, August 20–23, 2006, 680–686
30
Zhang H, Giles C L, Foley H C, Yen J. Probabilistic community discovery using hierarchical latent Gaussian mixture model. In: Proceedings of 22nd AAAI Conference on Artificial Intelligence , Vancouver, British Columbia, Canada, July 22–26, 2007, 663–668
31
Dietz L, Bickel S, Scheffer T. Unsupervised prediction of citation influences. In: Proceedings of 24th International Conference on Machine Learning (ICML) , Corvallis, Oregon, USA, June 20–24, 2007
32
Gruber A, Rosen-Zvi M, Weiss Y. Latent topic models for hypertext. In: Proceedings of Uncertainty in Artificial Intelligence (UAI) , Helsinki, Finland, July 9–12, 2008
33
Tang J, Zhang J, Yao L, Li J, Zhang L, Su Z. ArnetMiner: extraction and mining of academic social networks. In: Proceedings of ACM SIGKDD , 2008
34
Daud A, Li J, Zhu L, Muhammad F. A generalized topic modeling approach for maven search. In: Proceedings of International Asia-Pacific Web Conference and Web-Age Information Management (APWEB-WAIM) , Suzhou, China, 2009
35
Daud A, Li J, Zhu L, Muhammad F. Conference mining via generalized topic modeling. In: Proceedings of European Conference on Machine Learning and Principles and Practices of Knowledge Discovery in Databases (ECML PKDD) , Bled, Slovenia, 2009
36
Griffiths T L, Steyvers M, Blei D M, Tenenbaum J B. Integrating topics and syntax. In: Advances in Neural Information Processing Systems (NIPS) 17 . Cambridge, MA, MIT Press, 2005, 537–544
37
Gruber A, Rosen-Zvi M, Weiss Y. Hidden topic Markov models. In: Proceedings of Artificial Intelligence and Statistics (AISTATS), San Juan , Puerto Rico, USA, March 21–24, 2007
38
Wallach J M. Topic modeling: Beyond bag-of-words. In: Proceedings of 23rd International Conference on Machine Learning (ICML) , Pittsburgh, Pennsylvania, USA, June 25–29, 2006
39
Mei Q, Zhai C X. A mixture model for contextual text mining. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , Philadelphia, USA, August 20–23, 2006, 649–655
Wang X, McCallum A, Wei X. Topical N-grams: phrase and topic discovery, with an application to information retrieval. In: Proceedings of the 7th IEEE International Conference on Data Mining (ICDM) , Omaha NE, USA, October 28–31, 2007
42
Rabiner L R. A tutorial on hidden Markov models and selected applications in speech recognition. In: Proceedings of the IEEE , 1989, 77(2): 257–286 doi: 10.1109/5.18626
43
Blei D M, Lafferty J. Dynamic topic models. In: Proceedings of 23rd International Conference on Machine Learning (ICML) , Pittsburgh, Pennsylvania, USA, June 25–29, 2006
44
Nallapati R, Cohen W, Ditmore S, Lafferty J, Ung K. Multiscale topic tomography. In: Proceedings of 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , San Jose, California, USA, August 12–15, 2007
45
Wang C, Blei M D, Heckerman D. Continuous time dynamic topic models. In: Proceedings of Uncertainty in Artificial Intelligence (UAI) , Helsinki, Finland, July 9–12, 2008
46
Uhlenbeck G E, Ornstein L S. On the theory of Brownian motion. Physics Reviews , 1930, 36: 823–841 doi: 10.1103/PhysRev.36.823
47
Wang X, McCallum A. Topics over time: A non-markov continuous-time model of topical trends. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , Philadelphia, USA, August 20–23, 2006
48
Daud A, Li J, Zhu L, Muhammad F. Exploiting temporal authors interests via temporal-author-topic modeling. In: Proceedings of 5th International Conference on Advance Data Mining and Applications (ADMA) , Beijing, China, 2009
49
Blei D M, Jordan M. Modeling annotated data. In: Proceedings of 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval , Toronto, Canada, July 28-August 1, 2003, 127–134
50
Flaherty P, Giaever G, Kumm J, Jordan M, Arkin A. A latent variable model for chemogenomic profiling. Bioinformatics , 2005, 21(15): 3286–3293 doi: 10.1093/bioinformatics/bti515
51
Murphy K. An Introduction to Graphical Models. Technical report , University of California, Berkeley, May 2001
52
Bilmes J A. A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov modals. Berkeley, ICSI TR-97-021 , 1997
53
Jordan M I, Ghahramani Z, Jaakkola T S, Saul L K. An introduction to variational methods for graphical models. In: Jordan M (Eds), Learning in Graphical Models . MIT Press, 1998
54
Buntine W. Variational Extensions to EM and Multinomial PCA. In: Elomaa T . (Eds.): ECML, LNAI 2430, Springer-Verlag , Berlin, 2002, 23–34
55
Gilks W R, Richardson S, Spiegelhalter D J. Markov Chain Monte Carlo in Practice. London: Chapman & Hall, 1996
56
Andrieu C, Freitas N D, Doucet A, Jordan M. An introduction to MCMC for machine learning. Journal of Machine Learning , 2003, 50: 5–43 doi: 10.1023/A:1020281327116
57
Erosheva E A. Grade of membership and latent structure models with applications to disability survey data. Unpublished doctoral dissertation, Department of Statistics , Carnegie Mellon University, 2002
58
Teh Y W, Newman D, Wellingm M. A collapsed variational Bayesian inference algorithm for latent dirichlet allocation. In: Advances in Neural Information Processing Systems (NIPS) . Cambridge, MA, MIT Press, 2006
59
Azzopardi L, Girolami M, Risjbergen K V. Investigating the relationship between language model perplexity and IR precision-recall measures. In: Proceedings of the 26th ACM SIGIR , Toronto, Canada, 2003
60
Zhang J, Tang J, Liu L, Li J. A mixture model for expert finding. In: Proceedings of the PAKDD , Washio T . (Eds). LNAI,2008, 5012: 466–478
61
Chang Y L, Chien J T. Latent dirichlet learning for document summarization. In: Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP) , 2009
62
Arora R, Ravindran B. Latent dirichlet allocation based multi-document summarization. In: Proceedings of the 2nd Workshop on Analytics for Noisy Unstructured Rext Data , 2008
63
Bíró I, Szabó J, Benczúr A A. Latent dirichlet allocation in web spam filtering. In: Proceedings of the Adversarial Information Retrieval on the Web (AIRWeb’08) , 2008
64
Elango P K, Jayaraman K. Clustering images using the latent dirichlet allocation model, 2005
65
Wang Y, Mori G. Human action recognition by semi-latent topic models. IEEE Transactions on Pattern Analysis and Machine Intelligence Special Issue on Probabilistic Graphical Models in Computer Vision (T-PAMI) , 2009
66
Wang Y, Sabzmeydani P, Mori G. Semi-latent dirichlet allocation: A hierarchical model for fuman action recognition. In: 2nd Workshop on Human Motion Understanding, Modeling, Capture and Animation (ICCV) , 2007
67
Rath T M, Lavrenko V, Manmatha R. A Statistical Approach to Retrieving Historical Manuscript Images Without Recognition. Technical Report , 2003