Loading Now

Summary of Semi-parametric Retrieval Via Binary Bag-of-tokens Index, by Jiawei Zhou et al.


Semi-Parametric Retrieval via Binary Bag-of-Tokens Index

by Jiawei Zhou, Li Dong, Furu Wei, Lei Chen

First submitted to arxiv on: 3 May 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
SemI-parametric Disentangled Retrieval (SiDR) is a novel bi-encoder retrieval framework designed to address the growing need for efficient, cost-effective, and parameter-agnostic indexing in emerging applications. Unlike existing neural retrieval methods that rely solely on embeddings as indexes, SiDR decouples retrieval index from neural parameters, enabling a non-parametric tokenization index for search. This innovative approach achieves BM25-like indexing complexity with significantly better effectiveness. The paper presents comprehensive evaluation results across 16 retrieval benchmarks, demonstrating that SiDR outperforms both neural and term-based retrieval baselines under the same indexing workload. Specifically, SiDR excels in three key areas: (i) embedding-based indexing, where it surpasses conventional neural retrievers while maintaining similar training complexity; (ii) tokenization-based indexing, which drastically reduces indexing cost and time, matching traditional term-based retrieval while outperforming BM25 on all in-domain datasets; and (iii) late parametric mechanism, which matches BM25 index preparation time while outperforming other neural retrieval baselines in effectiveness.
Low GrooveSquid.com (original content) Low Difficulty Summary
Imagine searching for information quickly and efficiently without sacrificing accuracy. That’s what researchers have achieved with a new way of indexing called SemI-parametric Disentangled Retrieval (SiDR). Unlike traditional search methods, SiDR uses two separate indexes to find the right information: one based on words and another based on meanings. This innovative approach is faster and more accurate than existing methods. In tests, SiDR performed better than other popular search methods across 16 different types of searches. It’s an important breakthrough that could make searching for information easier and more efficient.

Keywords

» Artificial intelligence  » Embedding  » Encoder  » Tokenization