|
|
Optimizing partitioning strategies for faster inverted index compression |
Xingshen SONG1( ), Yuexiang YANG1, Yu JIANG1, Kun JIANG2 |
1. College of Computer, National University of Defense Technology, Changsha 410000, China 2. School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an 710000, China |
|
|
Abstract The inverted index is a key component for search engines to manage billions of documents and quickly respond to users’ queries.Whereas substantial effort has been devoted to reducing space occupancy and decoding speed, the encoding speed when constructing the index has been overlooked. Partitioning the index aligning to its clustered distribution can effectively minimize the compressed size while accelerating its construction procedure. In this study, we introduce compression speed as one criterion to evaluate compression techniques, and thoroughly analyze the performance of different partitioning strategies. Optimizations are also proposed to enhance state-of-the-art methods with faster compression speed and more flexibility to partition an index. Experiments show that our methods offer a much better compression speed, while retaining an excellent space occupancy and decompression speed. networks.
|
Keywords
inverted index
index compression
optimal partition
approximation algorithm
|
Corresponding Author(s):
Xingshen SONG
|
Just Accepted Date: 30 September 2016
Online First Date: 24 January 2018
Issue Date: 08 April 2019
|
|
1 |
C DManning, P Raghavan, HSchütze. Introduction to Information Retrieval, Vol. 1. Cambridge: Cambridge University Press, 2008
https://doi.org/10.1017/CBO9780511809071
|
2 |
I HWitten, AMoffat, T CBell. Managing Gigabytes: Compressing and Indexing Documents and Images. San Francisco, CA: Morgan Kaufmann, 1999
|
3 |
JZobel, AMoffat. Inverted files for text search engines. ACM Computing Surveys, 2006, 38(2): 6
https://doi.org/10.1145/1132956.1132959
|
4 |
MCatena, C Macdonald, IOunis. On inverted index compression for search engine efficiency. In: Proceedings of European Conference on Information Retrieval. 2014, 359–371
https://doi.org/10.1007/978-3-319-06028-6_30
|
5 |
DLemire, L Boytsov. Decoding billions of integers per second through vectorization. Software: Practice and Experience, 2015, 45(1): 1–29
https://doi.org/10.1002/spe.2203
|
6 |
GOttaviano, N Tonellotto, RVenturini. Optimal space-time tradeoffs for inverted indexes. In: Proceedings of the 8th ACM International Conference on Web Search and Data Mining. 2015, 47–56
https://doi.org/10.1145/2684822.2685297
|
7 |
FSilvestri, R Venturini. Vsencoding: efficient coding and fast decoding of integer lists via dynamic programming. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management. 2010, 1219–1228
https://doi.org/10.1145/1871437.1871592
|
8 |
HYan, SDing, TSuel. Inverted index compression and query processing with optimized document ordering. In: Proceedings of the 18th International Conference on World Wide Web. 2009, 401–410
https://doi.org/10.1145/1526709.1526764
|
9 |
GOttaviano, RGrossi. Semi-indexing semi-structured data in tiny space. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. 2011, 1485–1494
https://doi.org/10.1145/2063576.2063790
|
10 |
V NAnh, AMoffat. Inverted index compression using word-aligned binary codes. Information Retrieval, 2005, 8(1): 151–166
https://doi.org/10.1023/B:INRT.0000048490.99518.5c
|
11 |
V NAnh, AMoffat. Index compression using 64-bit words. Software: Practice and Experience, 2010, 40(2): 131–147
https://doi.org/10.1002/spe.948
|
12 |
V NAnh, AMoffat. Index compression using fixed binary codewords. In: Proceedings of the 15th Australasian Database Conference. 2004, 61–67
|
13 |
RDelbru, S Campinas, GTummarello. Searching Web data: an entity retrieval and high-performance indexing model. Journal of Web Semantics, 2012, 10: 33–58
https://doi.org/10.1016/j.websem.2011.04.004
|
14 |
GOttaviano, R Venturini . Partitioned elias-fano indexes. In: Proceedings of the 37th international ACM SIGIR Conference on Research & Development in Information Retrieval. 2014, 273–282
https://doi.org/10.1145/2600428.2609615
|
15 |
PFerragina, INitto, RVenturini. On optimally partitioning a text to improve its compression. Algorithmica, 2011, 61(1): 51–74
https://doi.org/10.1007/s00453-010-9437-6
|
16 |
ATrotman. Compression, SIMD, and postings lists. In: Proceedings of the Australasian Document Computing Symposium. 2014
https://doi.org/10.1145/2682862.2682870
|
17 |
SDing, TSuel. Faster top-k document retrieval using block-max indexes. In: Proceedings of the 34th international ACM SIGIR Conference on Research and Development in Information Retrieval. 2011, 993–1002
https://doi.org/10.1145/2009916.2010048
|
18 |
GNavarro, S J Puglisi. Dual-sorted inverted lists. In: Proceedings of International Symposium on String Processing and Information Retrieval. 2010, 309–321
https://doi.org/10.1007/978-3-642-16321-0_33
|
19 |
CDimopoulos, S Nepomnyachiy, TSuel. Optimizing top-k document retrieval strategies for block-max indexes. In: Proceedings of the 6th ACMInternational Conference onWeb Search and DataMining. 2013, 113–122
https://doi.org/10.1145/2433396.2433412
|
20 |
A AStepanov, A R Gangolli, D ERose, R JErnst, P SOberoi. SIMDbased decoding of posting lists. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. 2011, 317–326
|
21 |
W XZhao, XZhang, DLemire, D Shan, J YNie, H FYan, J RWen. A general SIMD-based approach to accelerating compression algorithms. ACM Transactions on Information Systems, 2015, 33(3): 15
https://doi.org/10.1145/2735629
|
22 |
JGoldstein, R Ramakrishnan, UShaft. Compressing relations and indexes. In: Proceedings of the 14th International Conference on Data Engineering. 1998, 370–379
https://doi.org/10.1109/ICDE.1998.655800
|
23 |
PBoldi , SVigna. Compressed perfect embedded skip lists for quick inverted-index lookups. In: Proceedings of International Symposium on String Processing and Information Retrieval. 2005, 25–28
https://doi.org/10.1007/11575832_3
|
24 |
SJonassen, S E Bratsberg. Efficient compressed inverted index skipping for disjunctive text-queries. In: Proceedings of European Conference on Information Retrieval. 2011, 530–542
https://doi.org/10.1007/978-3-642-20161-5_53
|
25 |
G MSacco. Fast block-compressed inverted lists. In: Proceedings of International Conference on Database and Expert Systems Applications. 2012, 412–421
https://doi.org/10.1007/978-3-642-32600-4_30
|
26 |
J SCulpepper, AMoffat. Efficient set intersection for inverted indexing. ACM Transactions on Information Systems, 2010, 29(1): 1
https://doi.org/10.1145/1877766.1877767
|
27 |
N YAo, FZhang, DWu, D SStones, GWang, X G Liu, JLiu, SLin. Efficient parallel lists intersection and index compression algorithms using graphics processing units. Proceedings of the VLDB Endowment. 2011, 8(4): 470–481
https://doi.org/10.14778/2002974.2002975
|
28 |
DLemire, L Boytsov, NKurz. SIMD compression and the intersection of sorted integers. Software: Practice and Experience, 2016, 46(6): 723–749
|
29 |
T HCormen, C E Leiserson, R LRivest, CStein. Introduction to Algorithms, Vol 3. Cambridge, MA: The MIT Press, 2009
|
30 |
SGog, R Venturini. Succinct data structures in information retrieval: theory and practice. In: Proceedings of the 39th International ACM SIGIR Conference on Research & Development in Information Retrieval. 2016, 1231–1233
https://doi.org/10.1145/2911451.2914802
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|