Please wait a minute...
Frontiers of Electrical and Electronic Engineering

ISSN 2095-2732

ISSN 2095-2740(Online)

CN 10-1028/TM

Front Elect Electr Eng    2012, Vol. 7 Issue (2) : 242-253    https://doi.org/10.1007/s11460-011-0174-7
RESEARCH ARTICLE
Can prior knowledge help graph-based methods for keyword extraction?
Zhiyuan LIU(), Maosong SUN
Department of Computer Science and Technology, State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing 100084, China
 Download: PDF(901 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

Graph-based methods are one of the widely used unsupervised approaches for keyword extraction. In this approach, words are linked according to their co-occurrences within the document. Afterwards, graph-based ranking algorithms are used to rank words and those with the highest scores are selected as keywords. Although graph-based methods are effective for keyword extraction, they rank words merely based on word graph topology. In fact, we have various prior knowledge to identify how likely the words are keywords. The knowledge of words may be frequency-based, position-based, or semantic-based. In this paper, we propose to incorporate prior knowledge with graph-based methods for keyword extraction and investigate the contributions of the prior knowledge. Experiments reveal that prior knowledge can significantly improve the performance of graph-based keyword extraction. Moreover, by combining prior knowledge with neighborhood knowledge, in experiments we achieve the best results compared to previous graph-based methods.

Keywords keyword extraction      prior knowledge      PageRank      DiffusionRank     
Corresponding Author(s): LIU Zhiyuan,Email:lzy.thu@gmail.com   
Issue Date: 05 June 2012
 Cite this article:   
Zhiyuan LIU,Maosong SUN. Can prior knowledge help graph-based methods for keyword extraction?[J]. Front Elect Electr Eng, 2012, 7(2): 242-253.
 URL:  
https://academic.hep.com.cn/fee/EN/10.1007/s11460-011-0174-7
https://academic.hep.com.cn/fee/EN/Y2012/V7/I2/242
Fig.1  Performance of Biased-PageRank on precision, recall, and F-measure with different damping factor (trust ratio =1.0) and different trust ratio (damping factor =0.85). The number of extracted keywords is set to =10. (a) Precision versus (=1.0); (b) recall versus (=1.0); (c) F-measure versus (=1.0); (d) precision versus (=0.85); (e) recall versus (=0.85); (f) F-measure versus (=0.85)
Fig.2  Performance of DiffusionRank on precision, recall, and F-measure versus different trust ratio . The number of extracted keywords =10 and the diffusion factor =0.5 and 2.0. (a) Precision (=0.5); (b) recall (=0.5); (c) F-measure (=0.5); (d) precision (=2.0); (e) recall (=2.0); (f) F-measure (=2.0)
Fig.3  Performance of DiffusionRank with measure on precision, recall, and F-measure with different numbers of extracted keywords. The trust ratio =0.1 and the diffusion factor =0.5
methodcorrectprecisionrecallF-measure
PageRank7440.2430.3000.269
Biased-PageRank7910.2590.3190.286
DiffusionRank7930.2590.3200.287
Tab.1  Results of PageRank and DiffusionRank on total correct extracted keywords, precision, recall, and F-measure when the number of extracted keywords is =10
methodcorrectprecisionrecallF-measure
PageRank8180.2680.3300.296
Biased-PageRank8810.2880.3550.318
DiffusionRank8850.2900.3570.320
Tab.2  Results of PageRank and DiffusionRank with neighborhood knowledge on total number of correct extracted keywords, precision, recall, and F-measure when the number of extracted keywords is =10
Fig.4  Performance of PageRank and DiffusionRank using neighborhood knowledge on precision, recall, and F-measure versus different numbers of extracted keywords. Here, the number of nearest neighbors is set =5. For DiffusionRank, the prior knowledge is calculated using and the trust ratio =0.1 and diffusion factor =0.5. (a) precision; (b) recall; (c) F-measure
1 Turney P D. Learning to extract keyphrases from text. Technical Report ERB-1057 . Ottawa: National Research Council Canada, 1999
2 Liu Z, Li P, Zheng Y, Sun M. Clustering to find exemplar terms for keyphrase extraction. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing . 2009, 257-266
3 Liu Z, Sun M. Domain-specific term rankings using topic models. In: Proceedings of the 6th Asia Information Retrieval Societies Conference. Lecture notes in Computer Science , 2010, 6458: 454-465
4 Liu Z, Shi C, Sun M. FolkDiffusion: A graph-based tag suggestion method for folksonomies. In: Proceedings of the 6th Asia Information Retrieval Societies Conference. Lecture notes in Computer Science , 2010, 6458: 231-240
5 Liu Z, Huang W, Zheng Y, Sun M. Automatic keyphrase extraction via topic decomposition. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing . 2010, 366-376
6 Liu Z, Chen X, Zheng Y, Sun M. Automatic keyphrase extraction by bridging vocabulary gap. In: Proceedings of the Fifth Conference on Computational Natural Language Learning . 2011, 135-144
7 Mihalcea R, Tarau P. TextRank: Bringing order into texts. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing . 2004, 404-411
8 Wan X, Xiao J. Single document keyphrase extraction using neighborhood knowledge. In: Proceedings of the 23rd National Conference on Artificial Intelligence . 2008, 855-860
9 Wan X, Xiao J. CollabRank: Towards a collaborative approach to single-document keyphrase extraction. In: Proceedings of the 22nd International Conference on Computational Linguistics . 2008, 969-976
10 Litvak M, Last M. Graph-based keyword extraction for single-document summarization. In: Proceedings of the Workshop Multi-Source Multilingual Information Extraction and Summarization . 2008, 17-24
11 Huang C, Tian Y, Zhou Z, Ling C X, Huang T. Keyphrase extraction using semantic networks structure analysis. In: Proceedings of the Sixth IEEE International Conference on Data Mining . 2006, 275-284
12 Page L, Brin S, Motwani R, Winograd T. The PageRank citation ranking: Bringing order to the web. Technical Report. Stanford Digital Library Technologies Project , 1998, 1-17
13 Gyongyi Z, Garcia-Molina H, Pedersen J. Combating web spam with trustrank. In: Proceedings of the Thirtieth International Conference on Very Large Data Bases . 2004, 576-587
14 Yang H, King I, Lyu M R. DiffusionRank: A possible penicillin for web spamming. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval . 2007, 431-438
15 Ma H, Yang H, Lyu M R, King I. Mining social networks using heat diffusion processes for marketing candidates selection. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management . 2008,233-242
16 Ma H, Yang H, King I, Lyu M R. Learning latent semantic relations from clickthrough data for query suggestion. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management . 2008, 709-718
17 Baeza-Yates R, Ribeiro-Neto B. Modern Information Retrieval. Upper Saddle River: Addison-Wesley , 1999
18 Manning C D, Raghavan P, Schütze H. Introduction to Information Retrieval. New York , NY: Cambridge University Press, 2008
19 Croft B, Metzler D, Strohman T. Search Engines: Information Retrieval in Practice. Upper Saddle River: Addison-Wesley , 2009
20 Frank E, Paynter G W, Witten I H, Gutwin C, Nevill-Manning C G. Domain-specific keyphrase extraction. In: Proceedings of the 16th International Joint Conference on Artificial Intelligence . 1999, 668-673
21 Medelyan O, Witten I H. Domain-independent automatic keyphrase indexing with small training sets. Journal of the American Society for Information Science and Technology , 2008, 59(7): 1026-1040
doi: 10.1002/asi.20790
22 Blei D M, Ng A Y, Jordan M I. Latent Dirichlet allocation. Journal of Machine Learning Research , 2003, 3: 993-1022
23 Landauer T K, Foltz P W, Laham D. An introduction to latent semantic analysis. Discourse Processes , 1998, 25(2-3): 259-284
doi: 10.1080/01638539809545028
24 Hofmann T. Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval . 1999, 1-8
25 Minka T, Lafferty J. Expectation-propagation for the generative aspect model. In: Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence . 2002, 352-359
26 Griffiths T L, Steyvers M. Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America , 2004, 101(Suppl 1): 5228-5235
doi: 10.1073/pnas.0307752101 pmid:14872004
27 Zhai C. Statistical language models for information retrieval. Synthesis Lectures on Human Language Technologies , 2008, 1(1): 1-141
doi: 10.2200/S00158ED1V01Y200811HLT001
28 Hulth A. Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing . 2003, 216-223
29 Over P, Liggett W, Gilbert H, Sakharov A, Thatcher M. Introduction to DUC-2001: An intrinsic evaluation of generic news text summarization systems. In: Proceedings of 2001 Document Understanding Conference . 2001
30 Turney P D. Learning algorithms for keyphrase extraction. Information Retrieval , 2000, 2(4): 303-336
doi: 10.1023/A:1009976227802
[1] Bao-Liang LU, Xiao-Lin WANG, Yang YANG, Hai ZHAO. Learning from imbalanced data sets with a Min-Max modular support vector machine[J]. Front Elect Electr Eng Chin, 2011, 6(1): 56-71.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed