Can prior knowledge help graph-based methods for keyword extraction?

Please wait a minute...

Shopping Cart Email List Chinese

Advanced Search Fig/Tab

Frontiers of Electrical and Electronic Engineering

ISSN 2095-2732

ISSN 2095-2740(Online)

CN 10-1028/TM

Featured Articles | Most Downloaded | Most Reading

Online Submission Free E-mail Alert Subscribe to Our RSS Content Feed

Front Elect Electr Eng

2012, Vol. 7

Issue (2) : 242-253 https://doi.org/10.1007/s11460-011-0174-7

RESEARCH ARTICLE

Can prior knowledge help graph-based methods for keyword extraction?

Zhiyuan LIU(

), Maosong SUN

Department of Computer Science and Technology, State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing 100084, China

Download: PDF(901 KB) HTML
Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks

Abstract

Graph-based methods are one of the widely used unsupervised approaches for keyword extraction. In this approach, words are linked according to their co-occurrences within the document. Afterwards, graph-based ranking algorithms are used to rank words and those with the highest scores are selected as keywords. Although graph-based methods are effective for keyword extraction, they rank words merely based on word graph topology. In fact, we have various prior knowledge to identify how likely the words are keywords. The knowledge of words may be frequency-based, position-based, or semantic-based. In this paper, we propose to incorporate prior knowledge with graph-based methods for keyword extraction and investigate the contributions of the prior knowledge. Experiments reveal that prior knowledge can significantly improve the performance of graph-based keyword extraction. Moreover, by combining prior knowledge with neighborhood knowledge, in experiments we achieve the best results compared to previous graph-based methods.

Keywords keyword extraction prior knowledge PageRank DiffusionRank

Corresponding Author(s): LIU Zhiyuan,Email:lzy.thu@gmail.com

Issue Date: 05 June 2012

Cite this article:

Zhiyuan LIU,Maosong SUN. Can prior knowledge help graph-based methods for keyword extraction?[J]. Front Elect Electr Eng, 2012, 7(2): 242-253.

URL:

https://academic.hep.com.cn/fee/EN/10.1007/s11460-011-0174-7
https://academic.hep.com.cn/fee/EN/Y2012/V7/I2/242

Fig.1 Performance of Biased-PageRank on precision, recall, and F-measure with different damping factor (trust ratio =1.0) and different trust ratio (damping factor =0.85). The number of extracted keywords is set to =10. (a) Precision versus (=1.0); (b) recall versus (=1.0); (c) F-measure versus (=1.0); (d) precision versus (=0.85); (e) recall versus (=0.85); (f) F-measure versus (=0.85)

Fig.2 Performance of DiffusionRank on precision, recall, and F-measure versus different trust ratio . The number of extracted keywords =10 and the diffusion factor =0.5 and 2.0. (a) Precision (=0.5); (b) recall (=0.5); (c) F-measure (=0.5); (d) precision (=2.0); (e) recall (=2.0); (f) F-measure (=2.0)

Fig.3 Performance of DiffusionRank with measure on precision, recall, and F-measure with different numbers of extracted keywords. The trust ratio =0.1 and the diffusion factor =0.5

Tab.1 Results of PageRank and DiffusionRank on total correct extracted keywords, precision, recall, and F-measure when the number of extracted keywords is =10

Tab.2 Results of PageRank and DiffusionRank with neighborhood knowledge on total number of correct extracted keywords, precision, recall, and F-measure when the number of extracted keywords is =10

Fig.4 Performance of PageRank and DiffusionRank using neighborhood knowledge on precision, recall, and F-measure versus different numbers of extracted keywords. Here, the number of nearest neighbors is set =5. For DiffusionRank, the prior knowledge is calculated using and the trust ratio =0.1 and diffusion factor =0.5. (a) precision; (b) recall; (c) F-measure

1	Turney P D. Learning to extract keyphrases from text. Technical Report ERB-1057 . Ottawa: National Research Council Canada, 1999
2	Liu Z, Li P, Zheng Y, Sun M. Clustering to find exemplar terms for keyphrase extraction. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing . 2009, 257-266
3	Liu Z, Sun M. Domain-specific term rankings using topic models. In: Proceedings of the 6th Asia Information Retrieval Societies Conference. Lecture notes in Computer Science , 2010, 6458: 454-465
4	Liu Z, Shi C, Sun M. FolkDiffusion: A graph-based tag suggestion method for folksonomies. In: Proceedings of the 6th Asia Information Retrieval Societies Conference. Lecture notes in Computer Science , 2010, 6458: 231-240
5	Liu Z, Huang W, Zheng Y, Sun M. Automatic keyphrase extraction via topic decomposition. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing . 2010, 366-376
6	Liu Z, Chen X, Zheng Y, Sun M. Automatic keyphrase extraction by bridging vocabulary gap. In: Proceedings of the Fifth Conference on Computational Natural Language Learning . 2011, 135-144
7	Mihalcea R, Tarau P. TextRank: Bringing order into texts. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing . 2004, 404-411
8	Wan X, Xiao J. Single document keyphrase extraction using neighborhood knowledge. In: Proceedings of the 23rd National Conference on Artificial Intelligence . 2008, 855-860
9	Wan X, Xiao J. CollabRank: Towards a collaborative approach to single-document keyphrase extraction. In: Proceedings of the 22nd International Conference on Computational Linguistics . 2008, 969-976
10	Litvak M, Last M. Graph-based keyword extraction for single-document summarization. In: Proceedings of the Workshop Multi-Source Multilingual Information Extraction and Summarization . 2008, 17-24
11	Huang C, Tian Y, Zhou Z, Ling C X, Huang T. Keyphrase extraction using semantic networks structure analysis. In: Proceedings of the Sixth IEEE International Conference on Data Mining . 2006, 275-284
12	Page L, Brin S, Motwani R, Winograd T. The PageRank citation ranking: Bringing order to the web. Technical Report. Stanford Digital Library Technologies Project , 1998, 1-17
13	Gyongyi Z, Garcia-Molina H, Pedersen J. Combating web spam with trustrank. In: Proceedings of the Thirtieth International Conference on Very Large Data Bases . 2004, 576-587
14	Yang H, King I, Lyu M R. DiffusionRank: A possible penicillin for web spamming. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval . 2007, 431-438
15	Ma H, Yang H, Lyu M R, King I. Mining social networks using heat diffusion processes for marketing candidates selection. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management . 2008,233-242
16	Ma H, Yang H, King I, Lyu M R. Learning latent semantic relations from clickthrough data for query suggestion. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management . 2008, 709-718
17	Baeza-Yates R, Ribeiro-Neto B. Modern Information Retrieval. Upper Saddle River: Addison-Wesley , 1999
18	Manning C D, Raghavan P, Schütze H. Introduction to Information Retrieval. New York , NY: Cambridge University Press, 2008
19	Croft B, Metzler D, Strohman T. Search Engines: Information Retrieval in Practice. Upper Saddle River: Addison-Wesley , 2009
20	Frank E, Paynter G W, Witten I H, Gutwin C, Nevill-Manning C G. Domain-specific keyphrase extraction. In: Proceedings of the 16th International Joint Conference on Artificial Intelligence . 1999, 668-673
21	Medelyan O, Witten I H. Domain-independent automatic keyphrase indexing with small training sets. Journal of the American Society for Information Science and Technology , 2008, 59(7): 1026-1040 doi: 10.1002/asi.20790
22	Blei D M, Ng A Y, Jordan M I. Latent Dirichlet allocation. Journal of Machine Learning Research , 2003, 3: 993-1022
23	Landauer T K, Foltz P W, Laham D. An introduction to latent semantic analysis. Discourse Processes , 1998, 25(2-3): 259-284 doi: 10.1080/01638539809545028
24	Hofmann T. Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval . 1999, 1-8
25	Minka T, Lafferty J. Expectation-propagation for the generative aspect model. In: Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence . 2002, 352-359
26	Griffiths T L, Steyvers M. Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America , 2004, 101(Suppl 1): 5228-5235 doi: 10.1073/pnas.0307752101 pmid:14872004
27	Zhai C. Statistical language models for information retrieval. Synthesis Lectures on Human Language Technologies , 2008, 1(1): 1-141 doi: 10.2200/S00158ED1V01Y200811HLT001
28	Hulth A. Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing . 2003, 216-223
29	Over P, Liggett W, Gilbert H, Sakharov A, Thatcher M. Introduction to DUC-2001: An intrinsic evaluation of generic news text summarization systems. In: Proceedings of 2001 Document Understanding Conference . 2001
30	Turney P D. Learning algorithms for keyphrase extraction. Information Retrieval , 2000, 2(4): 303-336 doi: 10.1023/A:1009976227802

[1]	Bao-Liang LU, Xiao-Lin WANG, Yang YANG, Hai ZHAO. Learning from imbalanced data sets with a Min-Max modular support vector machine[J]. Front Elect Electr Eng Chin, 2011, 6(1): 56-71.

Viewed

Full text

Abstract

Cited

Shared

Discussed