FOLLOWUS
College of Computer, National University of Defense Technology, Changsha 410073, China
E-mail:jk_365@126.com
Published:2016-01,
Received:06 June 2015,
Accepted:2015-10-14
Scan QR Code
KUN JIANG, YUE-XIANG YANG. Efficient dynamic pruning on largest scores first (LSF) retrieval. [J]. Frontiers of information technology & electronic engineering, 2016, 17(1): 1-14.
KUN JIANG, YUE-XIANG YANG. Efficient dynamic pruning on largest scores first (LSF) retrieval. [J]. Frontiers of information technology & electronic engineering, 2016, 17(1): 1-14. DOI: 10.1631/FITEE.1500190.
目的
2
不断增长的网页数量和查询请求量对搜索引擎的查询性能提出非常大的挑战。当前索引遍历算法存在的大量候选文档失效问题依然制约着搜索引擎查询性能的提升。本文通过研究倒排索引的遍历方式和动态剪枝算法来加快搜索引擎top-
k
查询处理的性能。
创新点
2
提出最大重要度优先(Largest Scores First,LSF)查询算法,使得具有较高重要度的查询词项所指向的倒排链表能够优先得到处理。提出两种精确的动态剪枝算法:基于LSF的去除倒排链表技术(List Omitting,LSF_LO)和基于LSF的文档部分打分技术(Partial Scoring,LSF_PS)。
方法
2
首先,通过对现有动态剪枝算法的对比分析得出词项重要度对于搜索引擎top-
k
查询性能的影响:优先处理重要度较高的查询词项能够快速提升结果集的阈值,从而避免对估计得分较低的文档的处理。其次,通过设计倒排链表实体的各种操作方法来实现对倒排链表按照最大重要度的排序和处理,给出算法的伪码并分析了算法的计算复杂度。最后,利用最大重要度优先查询算法在top-
k
查询中的优势,实时估计每个倒排项在每计算一个词项的贡献之后的最大可能分数,同时在一个倒排链表遍历结束后估计其剩余最大可能贡献分数,避免对于估计最大得分低于结果集阈值的文档的各种处理操作,从而达到对搜索引擎top-
k
查询性能的提升。
结论
2
提出了LSF查询和其上的两种动态剪枝算法LSF_LO和LSF_PS。实验结果表明本文所提LSF查询相比传统DAAT查询在性能上有了明显的提升。
Inverted index traversal techniques have been studied in addressing the query processing performance challenges of web search engines
but still leave much room for improvement. In this paper
we focus on the inverted index traversal on document-sorted indexes and the optimization technique called dynamic pruning
which can efficiently reduce the hardware computational resources required. We propose another novel exhaustive index traversal scheme called largest scores first (LSF) retrieval
in which the candidates are first selected in the posting list of important query terms with the largest upper bound scores and then fully scored with the contribution of the remaining query terms. The scheme can effectively reduce the memory consumption of existing term-at-a-time (TAAT) and the candidate selection cost of existing document-at-a-time (DAAT) retrieval at the expense of revisiting the posting lists of the remaining query terms. Preliminary analysis and implementation show comparable performance between LSF and the two well-known baselines. To further reduce the number of postings that need to be revisited
we present efficient rank safe dynamic pruning techniques based on LSF
including two important optimizations called list omitting (LSF_LO) and partial scoring (LSF_PS) that make full use of query term importance. Finally
experimental results with the TREC GOV2 collection show that our new index traversal approaches reduce the query latency by almost 27% over the WAND baseline and produce slightly better results compared with the MaxScore baseline
while returning the same results as exhaustive evaluation.
倒排索引索引遍历查询延迟最大重要度优先查询动态剪枝
Inverted indexIndex traversalQuery latencyLargest scores first (LSF) retrievalDynamic pruning
VN Anh,,,A Moffat..Simplified similarity scoring using term ranks..2005..Proc.28th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval..226--233..DOI:10.1145/1076034.1076075http://doi.org/10.1145/1076034.1076075..
VN Anh,,,A Moffat..Pruned query evaluation using pre-computed impacts..2006..Proc. 29th Annual ACM SIGIR Conf. on Research and Development in Information Retrieval..372--379..DOI:10.1145/1148170.1148235http://doi.org/10.1145/1148170.1148235..
VN Anh,,,A Moffat,,,等..Index compression using 64-bit words..Softw. Pract. Exper.,,2010..40((2):):131--147..DOI:10.1002/spe.948http://doi.org/10.1002/spe.948..
C Badue,,,B Ribeiro-Neto,,,R Baeza-Yates,,,等..Distributed query processing using partitioned inverted files..2001..Proc. 8th Int. Symp. on String Processing and Information Retrieval..10--20..DOI:10.1109/SPIRE.2001.989733http://doi.org/10.1109/SPIRE.2001.989733..
AZ Broder,,,D Carmel,,,M Herscovici,,,等..Efficient query evaluation using a two-level retrieval process..2003..Proc. 12th Int. Conf. on Information and Knowledge Management..426--434..DOI:10.1145/956863.956944http://doi.org/10.1145/956863.956944..
C Buckley,,,AF Lewit..Optimization of inverted vector searches..1985..Proc. 8th Annual Int.ACM SIGIR Conf. on Research and Development in Information Retrieval..97--110..DOI:10.1145/253495.253515http://doi.org/10.1145/253495.253515..
S Büttcher,,,CLA Clarke..Index compression isgood, especially for random access..2007..Proc. 16th ACM Conf. on Information and Knowledge Management..761--770..DOI:10.1145/1321440.1321546http://doi.org/10.1145/1321440.1321546..
S Büttcher,,,CLA Clarke,,,GV Cormack..Information Retrieval: Implementing and Evaluating Search Engines,,2010..:USA:The MIT Press,,..
K Chakrabarti,,,S Chaudhuri,,,V Ganti..Intervalbased pruning for top-k processing over compressed lists..2011..Proc. 27th Int. Conf. on Data Engineering..709--720..DOI:10.1109/ICDE.2011.5767855http://doi.org/10.1109/ICDE.2011.5767855..
B Croft,,,D Metzler,,,T Strohman..Search Engines: Information Retrieval in Practice,,2010..:USA:Addison Wesley,,..
J Dean..Challenges in building large-scale informationretrieval systems: invited talk..2009..Proc. 2nd ACM Int. Conf. on Web Search and Data Mining..1DOI:10.1145/1498759.1498761http://doi.org/10.1145/1498759.1498761..
R Delbru,,,S Campinas,,,G Tummarello..Searching web data: an entity retrieval and high-performance indexing model..Web Semant. Sci. Serv. Agents World Wide Web,,2012..1033--58..DOI:10.1016/j.websem.2011.04.004http://doi.org/10.1016/j.websem.2011.04.004..
C Dimopoulos,,,S Nepomnyachiy,,,T Suel..Optimizing top-k document retrieval strategies for block-max indexes..2013..Proc. 6th ACM Int. Conf. on Web Search and Data Mining..113--122..DOI:10.1145/2433396.2433412http://doi.org/10.1145/2433396.2433412..
S Ding,,,T Suel..Faster top-kdocument retrievalusing block-max indexes..2011..Proc. 34th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval..993--1002..DOI:10.1145/2009916.2010048http://doi.org/10.1145/2009916.2010048..
M Fontoura,,,V Josifovski,,,JH Liu,,,等..Evaluation strategies for top-k queries over memory-resident inverted indexes..Proc. VLDB Endow,,2011..1213--1224....
K Jiang,,,YX Yang..Exhaustive hybrid postinglists traversing technique..2015..Proc. 5th Int. Conf. onIntelligence Science and Big Data Engineering..1--11..DOI:10.1007/978-3-319-23862-3_1http://doi.org/10.1007/978-3-319-23862-3_1..
K Jiang,,,XS Song,,,YX Yang..Performance evaluation of inverted index traversal techniques..2014..Proc. 17thInt. Conf. on Computational Science and Engineering..1715--1720..DOI:10.1109/CSE.2014.315http://doi.org/10.1109/CSE.2014.315..
S Jonassen,,,SE Bratsberg..Efficient compressed inverted index skipping for disjunctive text-queries..2011..Proc.33rd European Conf. on Advances in Information Retrieval..530--542..DOI:10.1007/978-3-642-20161-5_53http://doi.org/10.1007/978-3-642-20161-5_53..
P Lacour,,,C Macdonald,,,I Ounis..Efficiency comparison of document matching techniques..2008..Proc. European Conf. on Information Retrieval..37--46....
N Lester,,,A Moffat,,,W Webber,,,等..Spacelimited ranked query evaluation using adaptive pruning..2005..Proc. 6th Int. Conf. on Web Information Systems Engineering..470--477..DOI:10.1007/11581062_37http://doi.org/10.1007/11581062_37..
C Macdonald,,,I Ounis,,,N Tonellotto..Upperbound approximations for dynamic pruning..ACM Trans. Inform. Syst.,,2011..29((4):):17.1--17.28..DOI:10.1145/2037661.2037662http://doi.org/10.1145/2037661.2037662..
CD Manning,,,P Raghavan,,,H Schütze..Introduction to Information Retrieval,,2008..:USA:Cambridge University Press, Cambridge,,..
S Melink,,,S Raghavan,,,B Yang,,,等..Building adistributed full-text index for the Web..2001..Proc. 10th Int.Conf. on World Wide Web..396--406..DOI:10.1145/371920.372095http://doi.org/10.1145/371920.372095..
A Moffat,,,J Zobel..Self-indexing inverted files for fast text retrieval..ACM Trans. Inform. Syst.,,1996..14((4):):349--379..DOI:10.1145/237496.237497http://doi.org/10.1145/237496.237497..
I Ounis,,,G Amati,,,V Plachouras,,,等..Terrier:a high performance and scalable information retrievalplatform..2006..Proc. OSIR Workshop..18--25....
D Puppin,,,F Silvestri,,,R Perego,,,等..Tuningthe capacity of search engines: load-driven routing andincremental caching to reduce and balance the load..ACM Trans. Inform. Syst.,,2010..28((2):):5.1--5.36..DOI:10.1145/1740592.1740593http://doi.org/10.1145/1740592.1740593..
F Silvestri,,,R Venturini..VSEncoding: efficientcoding and fast decoding of integer lists via dynamicprogramming..2010..Proc. 19th ACM Int. Conf. on Information and Knowledge Management..1219--1228..DOI:10.1145/1871437.1871592http://doi.org/10.1145/1871437.1871592..
T Strohman,,,WB Croft..Efficient document retrieval in main memory..2007..Proc. 30th Annual Int. ACMSIGIR Conf. on Research and Development in Information Retrieval..175--182..DOI:10.1145/1277741.1277774http://doi.org/10.1145/1277741.1277774..
T Strohman,,,H Turtle,,,W.B Croft..Optimizationstrategies for complex queries..2005..Proc. 28th Annual Int.ACM SIGIR Conf. on Research and Development inInformation Retrieval..219--225..DOI:10.1145/1076034.1076074http://doi.org/10.1145/1076034.1076074..
H Turtle,,,J Flood..Query evaluation: strategies andoptimizations..Inform. Process. Manag.,,1995..31((6):):831--850..DOI:10.1016/0306-4573(95)00020-Hhttp://doi.org/10.1016/0306-4573(95)00020-H..
LD Wang,,,J Lin,,,D Metzler..A cascade rankingmodel for efficient ranked retrieval..2011..Proc. 34th Int.ACM SIGIR Conf. on Research and Development inInformation Retrieval..105--114..DOI:10.1145/2009916.2009934http://doi.org/10.1145/2009916.2009934..
J Zobel,,,A Moffat..Inverted files for text searchengines..ACM Comput. Surv,,2006..38((2):):6.1--6.56..DOI:10.1145/1132956.1132959http://doi.org/10.1145/1132956.1132959..
M Zukowski,,,S Heman,,,N Nes,,,等..Super-scalarRAM-CPU cache compression..2006..Proc. 22nd Int. Conf.on Data Engineering..59DOI:10.1109/ICDE.2006.150http://doi.org/10.1109/ICDE.2006.150..
Publicity Resources
Related Articles
Related Author
Related Institution