Front Comput Sci Chin    2010, Vol. 4 Issue (2) : 280-301
Knowledge discovery through directed probabilistic topic models: a survey
Ali DAUD1(), Juanzi LI1(), Lizhu ZHOU1(), Faqir MUHAMMAD2()
1. Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China; 2. Department of Mathematics and Statistics, Allama Iqbal Open University, Sector H-8, Islamabad 44000, Pakistan
Graphical models have become the basic framework for topic based probabilistic modeling. Especially models with latent variables have proved to be effective in capturing hidden structures in the data. In this paper, we survey an important subclass Directed Probabilistic Topic Models (DPTMs) with soft clustering abilities and their applications for knowledge discovery in text corpora. From an unsupervised learning perspective, “topics are semantically related probabilistic clusters of words in text corpora; and the process for finding these topics is called topic modeling”. In topic modeling, a document consists of different hidden topics and the topic probabilities provide an explicit representation of a document to smooth data from the semantic level. It has been an active area of research during the last decade. Many models have been proposed for handling the problems of modeling text corpora with different characteristics, for applications such as document classification, hidden association finding, expert finding, community discovery and temporal trend analysis. We give basic concepts, advantages and disadvantages in a chronological order, existing models classification into different categories, their parameter estimation and inference making algorithms with models performance evaluation measures. We also discuss their applications, open challenges and future directions in this dynamic area of research.

Keywords text corpora      parametric Directed Probabilistic Topic Mode (DPTMs)ls      soft clustering      unsupervised learning      knowledge discovery     
Corresponding Author(s): DAUD Ali,; LI Juanzi,; ZHOU Lizhu,; MUHAMMAD Faqir,   
Issue Date: 05 June 2010
Fig.1  Graphical models
Tab.1  Topic examples
BoldItalicNumber of documents
NNumber of words
TNumber of topics
ANumber of unique authors
V Number of unique words
NdNumber of word tokens in document d
BoldItalicdVector form of document d
BoldItalicdVector form of authors in document d
wdiThe ith word token in document d
zdiTopics assigned to word token wdi
xdiThe author associated with wdi
ydiThe timestamp associated with token wdi
θdMultinomial distribution over topics with parameter α
ΦzMultinomial distribution of words specific to z with parameter β
ΨzTime specific Beta distribution of topic z
αDirichlet distribution associated with topic z
βDirichlet distribution associated with word wdi
?Binomial Distribution associated with transition Ωi
rnRoot Node (or root topic)
RResponse variable used as observed value in supervised topic models
LLink between documents
dSource document
d'Target document
τLink value between documents
γDirichlet distribution associated with link τ
λMultinomial distribution for link generation between documents
BoldItalicClass of word, e.g., Noun Phrase (NP), Not Noun Phrase (NNP)
Tab.2  Notations
Tab.3  Polysemy with topics
Fig.2  Graph plate notations symbols
Year/TypeBasic PDPTMsInter-Document Correlated PDPTMsIntra-Document Correlated PDPTMsTemporal PDPTMsSupervised PDPTMs
2001PLSA→A Joint Probabilistic Model
2002A probabilistic Approach
2003LDA, A Topic ModelLDA → Corr-LDA
2004Discrete PCALDA →Mixed Membership Models,LDA →Author-Topic Model,LDAAuthor-Topic Model →ART
2006LDA → PAM,LDA → CTM,LDA →Statistical Entity-Topic ModelsBigram Topic Model, PLSA → CPLSALDA → TOT,LDA → DTM,(PAM, TOT) → Continuous Time Model
2007LDA → GWN-LDA,LDA → Citation Influence ModelLDA →HTMM,LDA → TNGMTTMLDA → sLDA
2008LDA →LTHM,LDA Author-Topic Model → ACT Model(A Joint Probabilistic Model, LDA) → Link-PLSA-LDALDADTM →cDTM
2009LDA → Generalized LDA,LDA Author-Topic ModelACT à Generalized ACTLDA Author-Topic Model → TAT
Tab.4  Historical paradigms of PDPTMs from 1999-2009
Fig.3  PLSA
Fig.4  Smoothed LDA
Fig.5  Author-Topic Model
Fig.6  LTHM Model
Fig.7  HMM-LDA Model
Fig.8  HTMM Model
Fig.9  Continuous-Time model
Fig.10  sLDA model
ModelsTypeParameter Estimation and Inference Making AlgorithmsProblem Domain (s)Dataset (s)
PLSABDPTMsEMRanking (automatic document indexing)LOB corpus, MED abstract dataset, CRAN abstracts dataset, CACM abstracts dataset, CISI abstracts dataset
A Joint Probabilistic Model IrCDPTMsEMDocument Classification, Relationship between Topics and LinksWebkb web pages dataset (, Cora abstracts dataset (
A probabilistic ApproachBDPTMsGibbs SamplingTopic Discovery (semantics of words)TASA corpus “a collection of children reading”
LDABDPTMsVariational EMTopic Discovery, Document Classification, Collaborative FilteringTREC AP newswire articles corpus, Reuters news articles dataset (, C Elegants Literature (, EachMovie collaborative filtering dataset
A Topic ModelBDPTMsGibbs SamplingTopic Discovery (semantics of words)TASA corpus “a collection of children reading”
Corr-LDASuDPTMsVariational EMAutomating Annotation, Text-based Image RetrievalCorel images and caption dataset
discrete (PCA)Gibbs SamplingText classification, Information Retrieval20 Newsgroup dataset (, Reuters news articles dataset (
Mixed-Membership Models IrCDPTMsEMTopic Discovery, Document ClassificationPNAS scientific articles dataset (
Author-Topic Model IrCDPTMsGibbs SamplingEntities and Topics Correlations, Topics Evolution over TimeCite seer dataset (
ART Model IrCDPTMsGibbs SamplingTopic and Role DiscoveryEnron email dataset (, Researchers email achieve
A Composite Model (HMM-LDA)IaCDPTMsGibbs SamplingDocument Classification, Part-of-Speech TaggingBrown and TASA corpus “a collection of children reading” datasets, NIPS00-12 Proceedings dataset (
LLDA Model SuDPTMsVariational EMTopic DiscoveryMicroarray dataset (
CTMIrCDPTMsVariational EMTopics CorrelationsJSTOR science articles dataset (
DTMTDPTMsVariational Kalman FilteringTopics Evolution over TimeJSTOR science articles dataset (
Statistical Entity-Topic ModelsIrCDPTMsGibbs SamplingEntities and Topics CorrelationsNew York Times dataset (, Foreign broadcast information service FBIS dataset (
Bigram Topic ModelIaCDPTMsGibbs EMTopic DiscoveryPsychological review abstracts dataset (, 20 News group dataset (
PAMIrCDPTMsGibbs SamplingSuper and Sub Topic Discovery, Document ClassificationNIPS00-12 Proceedings dataset ( , 20 Newsgroup dataset (, Rexa research paper search engine (
TOT ModelTDPTMsGibbs SamplingTopics Evolution over TimeState of the Union Addresses dataset (, Researchers Email Achieve, NIPS00-12 Proceedings dataset (
Continues-Time ModelTDPTMsGibbs SamplingTopics Evolution over Time and their CorrelationsRexa research paper search engine (
CPLSAIrCDPTMsEMTemporal (Entities-Topic) Correlations, Topics Evolution over Time, Event Impact AnalysisAbstracts of 282 papers of two Data Mining researchers, from ACM Digital library, MSN Space documents, Abstracts of 28 years’ SIGIR conferences from ACM Digital Library
HTMMIaCDPTMsEM and Forward-backward algorithmTopic DiscoveryNIPS00-12 Proceedings dataset (, used dataset (
MTTMTDPTMsVariational EMTopics Evolution over TimeJSTOR science articles dataset (
sLDA ModelSuDPTMsVariational EMRanking Movies and Web PagesNews paper movie reviews dataset (, Digg Links (
Citation Influence ModelIrCDPTMsGibbs SamplingCitation InfluenceCite seer dataset (
GWN-LDA ModelIrCDPTMsGibbs SamplingEntities and Topics CorrelationsNanoSci articles dataset (2000-2006) taken from (, Cite seer dataset (
TNG ModelIaCDPTMsGibbs SamplingTopic Discovery, Information RetrievalTREC dataset, NIPS00-12 Proceedings dataset (,
Link-PLSA-LDAIrCDPTMsVariational EMBlogs InfluenceNielsen Buzz metrics blogs postings dataset (
cDTMTDPTMsVariational Kalman FilteringTopics Evolution over Continuous TimeTREC-1 AP newswire articles corpus, “Election 08” dataset (
LTHMIrCDPTMsEMRelationship between Topics and LinksWebkb web pages dataset (, Wikipedia (
TATTDPTMsGibbs SamplingTemporal Authors Interests and CorrelationsComputer science research papers taken from
ACTIrCDPTMsGibbs SamplingExpertise Search in Academics Social NetworkComputer science research papers taken from
STMSIrCDPTMsGibbs SamplingExpert FindingComputer science research papers taken from
GLDAIrCDPTMsGibbs SamplingConference MiningComputer science research papers taken from
Tab.5  Summary of PDPTMs applications
