Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2019, Vol. 13 Issue (2) : 343-356    https://doi.org/10.1007/s11704-016-6252-5
RESEARCH ARTICLE
Optimizing partitioning strategies for faster inverted index compression
Xingshen SONG1(), Yuexiang YANG1, Yu JIANG1, Kun JIANG2
1. College of Computer, National University of Defense Technology, Changsha 410000, China
2. School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an 710000, China
 Download: PDF(876 KB)  
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

The inverted index is a key component for search engines to manage billions of documents and quickly respond to users’ queries.Whereas substantial effort has been devoted to reducing space occupancy and decoding speed, the encoding speed when constructing the index has been overlooked. Partitioning the index aligning to its clustered distribution can effectively minimize the compressed size while accelerating its construction procedure. In this study, we introduce compression speed as one criterion to evaluate compression techniques, and thoroughly analyze the performance of different partitioning strategies. Optimizations are also proposed to enhance state-of-the-art methods with faster compression speed and more flexibility to partition an index. Experiments show that our methods offer a much better compression speed, while retaining an excellent space occupancy and decompression speed. networks.

Keywords inverted index      index compression      optimal partition      approximation algorithm     
Corresponding Author(s): Xingshen SONG   
Just Accepted Date: 30 September 2016   Online First Date: 24 January 2018    Issue Date: 08 April 2019
 Cite this article:   
Xingshen SONG,Yuexiang YANG,Yu JIANG, et al. Optimizing partitioning strategies for faster inverted index compression[J]. Front. Comput. Sci., 2019, 13(2): 343-356.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-016-6252-5
https://academic.hep.com.cn/fcs/EN/Y2019/V13/I2/343
1 C DManning, P Raghavan, HSchütze. Introduction to Information Retrieval, Vol. 1. Cambridge: Cambridge University Press, 2008
https://doi.org/10.1017/CBO9780511809071
2 I HWitten, AMoffat, T CBell. Managing Gigabytes: Compressing and Indexing Documents and Images. San Francisco, CA: Morgan Kaufmann, 1999
3 JZobel, AMoffat. Inverted files for text search engines. ACM Computing Surveys, 2006, 38(2): 6
https://doi.org/10.1145/1132956.1132959
4 MCatena, C Macdonald, IOunis. On inverted index compression for search engine efficiency. In: Proceedings of European Conference on Information Retrieval. 2014, 359–371
https://doi.org/10.1007/978-3-319-06028-6_30
5 DLemire, L Boytsov. Decoding billions of integers per second through vectorization. Software: Practice and Experience, 2015, 45(1): 1–29
https://doi.org/10.1002/spe.2203
6 GOttaviano, N Tonellotto, RVenturini. Optimal space-time tradeoffs for inverted indexes. In: Proceedings of the 8th ACM International Conference on Web Search and Data Mining. 2015, 47–56
https://doi.org/10.1145/2684822.2685297
7 FSilvestri, R Venturini. Vsencoding: efficient coding and fast decoding of integer lists via dynamic programming. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management. 2010, 1219–1228
https://doi.org/10.1145/1871437.1871592
8 HYan, SDing, TSuel. Inverted index compression and query processing with optimized document ordering. In: Proceedings of the 18th International Conference on World Wide Web. 2009, 401–410
https://doi.org/10.1145/1526709.1526764
9 GOttaviano, RGrossi. Semi-indexing semi-structured data in tiny space. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. 2011, 1485–1494
https://doi.org/10.1145/2063576.2063790
10 V NAnh, AMoffat. Inverted index compression using word-aligned binary codes. Information Retrieval, 2005, 8(1): 151–166
https://doi.org/10.1023/B:INRT.0000048490.99518.5c
11 V NAnh, AMoffat. Index compression using 64-bit words. Software: Practice and Experience, 2010, 40(2): 131–147
https://doi.org/10.1002/spe.948
12 V NAnh, AMoffat. Index compression using fixed binary codewords. In: Proceedings of the 15th Australasian Database Conference. 2004, 61–67
13 RDelbru, S Campinas, GTummarello. Searching Web data: an entity retrieval and high-performance indexing model. Journal of Web Semantics, 2012, 10: 33–58
https://doi.org/10.1016/j.websem.2011.04.004
14 GOttaviano, R Venturini . Partitioned elias-fano indexes. In: Proceedings of the 37th international ACM SIGIR Conference on Research & Development in Information Retrieval. 2014, 273–282
https://doi.org/10.1145/2600428.2609615
15 PFerragina, INitto, RVenturini. On optimally partitioning a text to improve its compression. Algorithmica, 2011, 61(1): 51–74
https://doi.org/10.1007/s00453-010-9437-6
16 ATrotman. Compression, SIMD, and postings lists. In: Proceedings of the Australasian Document Computing Symposium. 2014
https://doi.org/10.1145/2682862.2682870
17 SDing, TSuel. Faster top-k document retrieval using block-max indexes. In: Proceedings of the 34th international ACM SIGIR Conference on Research and Development in Information Retrieval. 2011, 993–1002
https://doi.org/10.1145/2009916.2010048
18 GNavarro, S J Puglisi. Dual-sorted inverted lists. In: Proceedings of International Symposium on String Processing and Information Retrieval. 2010, 309–321
https://doi.org/10.1007/978-3-642-16321-0_33
19 CDimopoulos, S Nepomnyachiy, TSuel. Optimizing top-k document retrieval strategies for block-max indexes. In: Proceedings of the 6th ACMInternational Conference onWeb Search and DataMining. 2013, 113–122
https://doi.org/10.1145/2433396.2433412
20 A AStepanov, A R Gangolli, D ERose, R JErnst, P SOberoi. SIMDbased decoding of posting lists. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. 2011, 317–326
21 W XZhao, XZhang, DLemire, D Shan, J YNie, H FYan, J RWen. A general SIMD-based approach to accelerating compression algorithms. ACM Transactions on Information Systems, 2015, 33(3): 15
https://doi.org/10.1145/2735629
22 JGoldstein, R Ramakrishnan, UShaft. Compressing relations and indexes. In: Proceedings of the 14th International Conference on Data Engineering. 1998, 370–379
https://doi.org/10.1109/ICDE.1998.655800
23 PBoldi , SVigna. Compressed perfect embedded skip lists for quick inverted-index lookups. In: Proceedings of International Symposium on String Processing and Information Retrieval. 2005, 25–28
https://doi.org/10.1007/11575832_3
24 SJonassen, S E Bratsberg. Efficient compressed inverted index skipping for disjunctive text-queries. In: Proceedings of European Conference on Information Retrieval. 2011, 530–542
https://doi.org/10.1007/978-3-642-20161-5_53
25 G MSacco. Fast block-compressed inverted lists. In: Proceedings of International Conference on Database and Expert Systems Applications. 2012, 412–421
https://doi.org/10.1007/978-3-642-32600-4_30
26 J SCulpepper, AMoffat. Efficient set intersection for inverted indexing. ACM Transactions on Information Systems, 2010, 29(1): 1
https://doi.org/10.1145/1877766.1877767
27 N YAo, FZhang, DWu, D SStones, GWang, X G Liu, JLiu, SLin. Efficient parallel lists intersection and index compression algorithms using graphics processing units. Proceedings of the VLDB Endowment. 2011, 8(4): 470–481
https://doi.org/10.14778/2002974.2002975
28 DLemire, L Boytsov, NKurz. SIMD compression and the intersection of sorted integers. Software: Practice and Experience, 2016, 46(6): 723–749
29 T HCormen, C E Leiserson, R LRivest, CStein. Introduction to Algorithms, Vol 3. Cambridge, MA: The MIT Press, 2009
30 SGog, R Venturini. Succinct data structures in information retrieval: theory and practice. In: Proceedings of the 39th International ACM SIGIR Conference on Research & Development in Information Retrieval. 2016, 1231–1233
https://doi.org/10.1145/2911451.2914802
[1] Yudong QIN, Deke GUO, Zhiyao HU, Bangbang REN. Uncertain multicast under dynamic behaviors[J]. Front. Comput. Sci., 2020, 14(1): 130-145.
[2] Kun SU, Gongping YANG, Lu YANG, Peng SU, Yilong YIN. Non-negative locality-constrained vocabulary tree for finger vein image retrieval[J]. Front. Comput. Sci., 2019, 13(2): 318-332.
[3] Le DONG, Wenpu DONG, Ning FENG, Mengdie MAO, Long CHEN, Gaipeng KONG. Color space quantization-based clustering for image retrieval[J]. Front. Comput. Sci., 2017, 11(6): 1023-1035.
[4] Peng ZHANG. Unbalanced graph cuts with minimum capacity[J]. Front. Comput. Sci., 2014, 8(4): 676-683.
[5] Deying LI, Qinghua ZHU, Jiannong CAO, . Approximation algorithm for constructing data aggregation trees for wireless sensor networks[J]. Front. Comput. Sci., 2009, 3(4): 524-534.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed