Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2020, Vol. 14 Issue (4) : 144607    https://doi.org/10.1007/s11704-019-8324-9
RESEARCH ARTICLE
Diversification on big data in query processing
Meifan ZHANG, Hongzhi WANG(), Jianzhong LI, Hong GAO
Department of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
 Download: PDF(647 KB)  
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

Recently, in the area of big data, some popular applications such as web search engines and recommendation systems, face the problem to diversify results during query processing. In this sense, it is both significant and essential to propose methods to deal with big data in order to increase the diversity of the result set. In this paper, we firstly define the diversity of a set and the ability of an element to improve the overall diversity. Based on these definitions, we propose a diversification framework which has good performance in terms of effectiveness and efficiency. Also, this framework has theoretical guarantee on probability of success. Secondly, we design implementation algorithms based on this framework for both numerical and string data. Thirdly, for numerical and string data respectively, we carry out extensive experiments on real data to verify the performance of our proposed framework, and also perform scalability experiments on synthetic data.

Keywords diversification      query processing      big data     
Corresponding Author(s): Hongzhi WANG   
Just Accepted Date: 20 March 2019   Issue Date: 11 March 2020
 Cite this article:   
Meifan ZHANG,Hongzhi WANG,Jianzhong LI, et al. Diversification on big data in query processing[J]. Front. Comput. Sci., 2020, 14(4): 144607.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-019-8324-9
https://academic.hep.com.cn/fcs/EN/Y2020/V14/I4/144607
1 M Drosou, E Pitoura. Search result diversification. Special Interest Group on Management of Data Record, 2010, 39(1): 41–47
https://doi.org/10.1145/1860702.1860709
2 M Drosou, H V Jagadish, E Pitoura, J Stoyanovich. Diversity in big data: a review. Big Data, 2017, 5(2): 73
https://doi.org/10.1089/big.2016.0054
3 A Angel, N Koudas. Efficient diversity-aware search. In: Proceedings of the ACM SIGMOD International Conference on Management of Data. 2011, 781–792
https://doi.org/10.1145/1989323.1989405
4 M R Vieira, H L Razente, M C Barioni, M Hadjieleftheriou, D, Jr C T Srivastava, V J Tsotras. On query result diversification. In: Proceedings of International Conference on Data Engineering. 2011, 1163–1174
https://doi.org/10.1109/ICDE.2011.5767846
5 R Agrawal, S Gollapudi, A Halverson, S Ieong. Diversifying search results. In: Proceedings of the 2nd International Conference on Web Search and Web Data Mining. 2009, 5–14
https://doi.org/10.1145/1498759.1498766
6 A Ashkan, B Kveton, S Berkovsky, Z Wen. Optimal greedy diversity for recommendation. In: Proceedings of the 24th International Joint Conference on Artificial Intelligence. 2015, 1742–1748
7 S Gollapudi, A Sharma. An axiomatic approach for result diversification. In: Proceedings of the 18th International Conference on World Wide Web. 2009, 381–390
https://doi.org/10.1145/1526709.1526761
8 M Zhang, N Hurley. Avoiding monotony: improving the diversity of recommendation lists. In: Proceedings of ACM Conference on Recommender Systems. 2008, 123–130
https://doi.org/10.1145/1454008.1454030
9 K Liu, E Terzi, T Grandison. Highlighting diverse concepts in documents. In: Proceedings of the SIAM International Conference on Data Mining. 2009, 545–556
https://doi.org/10.1137/1.9781611972795.47
10 A D Sarma, S Gollapudi, S Ieong. Bypass rates: reducing query abandonment using negative inferences. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2008, 177–185
11 T Wu, L Chen, P Hui, C J Zhang, W Li. Hear the whole story: towards the diversity of opinion in crowdsourcing markets. Proceedings of the VLDB Endowment, 2015, 8(5): 485–496
https://doi.org/10.14778/2735479.2735482
12 C L Clarke, M Kolla, G V Cormack, O Vechtomova, A Ashkan, S Buttcher, I MacKinnon. Novelty and diversity in information retrieval evaluation. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 2008, 659–666
https://doi.org/10.1145/1390334.1390446
13 Y Zhang, J P Callan, T P Minka. Novelty and redundancy detection in adaptive filtering. In: Proceedings of the 25th Annual International ACMSIGIR Conference on Research and Development in Information Retrieval. 2002, 81–88
https://doi.org/10.1145/564376.564393
14 R L Santos, C Macdonald, I Ounis. Exploiting query reformulations for web search result diversification. In: Proceedings of the 19th International Conference on World Wide Web. 2010, 881–890
https://doi.org/10.1145/1772690.1772780
15 A M Ozdemiray, I S Altingovde. Explicit search result diversification using score and rank aggregation methods. Journal of the Association for Information Science and Technology, 2015, 66(6): 1212–1228
https://doi.org/10.1002/asi.23259
16 J Carbinell, J Goldstein. The use of MMR, diversity-based reranking for reordering documents and producing summaries. Special Interest Group on Information Retrieval Forum, 2017, 51(2): 209–210
https://doi.org/10.1145/3130348.3130369
17 G Capannini, F M Nardini, R Perego, Silvestri F. Efficient diversification of web search results. Proceedings of the VLDB Endowment, 2011, 4(7): 451–459
https://doi.org/10.14778/1988776.1988781
18 C Ziegler, S M Mcnee, J A Konstan, G. LausenImproving recommendation lists through topic diversification. In: Proceedings of the 14th International Conference on World Wide Web. 2005, 22–32
https://doi.org/10.1145/1060745.1060754
19 F Radlinski, S T Dumais. Improving personalized web search using result diversification. In: Proceedings of the 29th Annual International ACMSIGIR Conference on Research and Development in Information Retrieval. 2006, 691–692
https://doi.org/10.1145/1148170.1148320
20 C Yu, L V Lakshmanan, S Ameryahia.It takes variety to make a world:diversification in recommender systems. In: Proceedings of the 12thInternational Conference on Extending Database Technology. 2009, 368–378
https://doi.org/10.1145/1516360.1516404
21 E Vee, U Srivastava, J Shanmugasundaram, P Bhat, S A Yahia. Efficient computation of diverse query results. In: Proceedings of the 24th International Conference on Data Engineering. 2008, 228–236
https://doi.org/10.1109/ICDE.2008.4497431
22 M Drosou, E Pitoura. Diverse set selection over dynamic data. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(5): 1102–1116
https://doi.org/10.1109/TKDE.2013.44
23 Y Zhu, Y Lan, J Guo, X Cheng, S Niu. Learning for search result diversification. In: Proceedings of the 27th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2014, 293–302
https://doi.org/10.1145/2600428.2609634
24 L Xia, J Xu, Y Lan, J Guo, X Cheng. Learning maximal marginal relevance model via directly optimizing diversity evaluation measures. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2015, 113–122
https://doi.org/10.1145/2766462.2767710
25 J Xu, L Xia, Y Lan, J Guo, X Cheng. Directly optimize diversity evaluation measures: a new approach to search result diversification. ACM Transactions on Intelligent Systems and Technology, 2017, 8(3): 41
https://doi.org/10.1145/2983921
26 L Xia, J Xu, Y Lan, J Guo, X Cheng. Modeling document novelty with neural tensor network for search result diversification. In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2016, 395–404
https://doi.org/10.1145/2911451.2911498
27 E Erkut, Y Ülküsal, O Yeniçerioglu. A comparison of p-dispersion heuristics. Computers & Operations Research, 1994, 21(10): 1103–1113
https://doi.org/10.1016/0305-0548(94)90041-8
28 Z Baryossef, T S Jayram, R Kumar, D Sivakumar, L Trevisan. Counting distinct elements in a data stream. In: Proceedings of International Workshop on Randomization and Approximation Techniques in Computer Science. 2002, 1–10
https://doi.org/10.1007/3-540-45726-7_1
29 T H Cormen, C E Leiserson, R RivestL, C Stein. Introduction to Algorithms. 2nd ed. Cambridge: The MIT Press and McGraw-Hill Book Company, 2001
30 M Mitzenmacher, E Upfal. Probability and Computing- Randomized Algorithms and Probabilistic Analysis. Cambridge: Cambridge University Press, 2005
https://doi.org/10.1017/CBO9780511813603
31 M Hadjieleftheriou, C Li. Efficient approximate search on string collections. Proceedings of the VLDB Endowment, 2009, 2(2): 1660–1661
https://doi.org/10.14778/1687553.1687623
[1] FCS-0011-18324-MZ_suppl_1 Download
[1] Zhihan JIANG, Yan LIU, Xiaoliang FAN, Cheng WANG, Jonathan LI, Longbiao CHEN. Understanding urban structures and crowd dynamics leveraging large-scale vehicle mobility data[J]. Front. Comput. Sci., 2020, 14(5): 145310-.
[2] Wanyu CHEN, Fei CAI, Honghui CHEN, Maarten DE RIJKE. Personalized query suggestion diversification in information retrieval[J]. Front. Comput. Sci., 2020, 14(3): 143602-.
[3] Xingyue CHEN, Tao SHANG, Feng ZHANG, Jianwei LIU, Zhenyu GUAN. Dynamic data auditing scheme for big data storage[J]. Front. Comput. Sci., 2020, 14(1): 219-229.
[4] Samuel IRVING, Bin LI, Shaoming CHEN, Lu PENG, Weihua ZHANG, Lide DUAN. Computer comparisons in the presence of performance variation[J]. Front. Comput. Sci., 2020, 14(1): 21-41.
[5] Min NIE, Lei YANG, Jun SUN, Han SU, Hu XIA, Defu LIAN, Kai YAN. Advanced forecasting of career choices for college students based on campus big data[J]. Front. Comput. Sci., 2018, 12(3): 494-503.
[6] Xuegang HU, Peng ZHOU, Peipei LI, Jing WANG, Xindong WU. A survey on online feature selection with streaming features[J]. Front. Comput. Sci., 2018, 12(3): 479-493.
[7] Xiaoye MIAO, Yunjun GAO, Su GUO, Wanqi LIU. Incomplete data management: a survey[J]. Front. Comput. Sci., 2018, 12(1): 4-25.
[8] Chaofeng SHA,Keqiang WANG,Dell ZHANG,Xiaoling WANG,Aoying ZHOU. Optimizing top-k retrieval: submodularity analysis and search strategies[J]. Front. Comput. Sci., 2016, 10(3): 477-487.
[9] Wuyang JU,Jianxin LI,Weiren YU,Richong ZHANG. iGraph: an incremental data processing system for dynamic graph[J]. Front. Comput. Sci., 2016, 10(3): 462-476.
[10] Shuai MA,Jia LI,Chunming HU,Xuelian LIN,Jinpeng HUAI. Big graph search: challenges and techniques[J]. Front. Comput. Sci., 2016, 10(3): 387-398.
[11] Rong ZHANG,Wenzhe YU,Chaofeng SHA,Xiaofeng HE,Aoying ZHOU. Product-oriented review summarization and scoring[J]. Front. Comput. Sci., 2015, 9(2): 210-223.
[12] Jinchuan CHEN, Yueguo CHEN, Xiaoyong DU, Cuiping LI, Jiaheng LU, Suyun ZHAO, Xuan ZHOU. Big data challenge: a data management perspective[J]. Front Comput Sci, 2013, 7(2): 157-164.
[13] Ling LIU. Computing infrastructure for big data processing[J]. Front Comput Sci, 2013, 7(2): 165-170.
[14] Chunjie LUO, Jianfeng ZHAN, Zhen JIA, Lei WANG, Gang LU, Lixin ZHANG, Cheng-Zhong XU, Ninghui SUN. CloudRank-D: benchmarking and ranking cloud computing systems for data processing applications[J]. Front Comput Sci, 2012, 6(4): 347-362.
[15] Min XIE, Laks V. S. LAKSHMANAN, Peter T. WOOD. Composite recommendations: from items to packages[J]. Front Comput Sci, 2012, 6(3): 264-277.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed