Adaptive Threshold

Survey

Adaptive thresh­old (or Ranked List Truncation) has been de­vel­op­ing since the 2000s. Prior to Deep Learning era, the ap­proach is to model the score dis­tri­b­u­tion:

  • The sim­i­lar­ity score is the mix­ture of dis­tri­b­u­tion.
  • Commonly, it is formed from 2 dis­tri­b­u­tions: (1) rel­e­vant score, and (2) ir­rel­e­vant score.
    • Relevant score: com­monly mod­elled as Gaussian dis­tri­b­u­tion.
    • Irrelevant score: mod­elled as (truncated) ex­po­nen­tial dis­tri­b­u­tion.

Pre-DL approach to adaptive threshold

To learn the pa­ra­me­ters of those dis­tri­b­u­tions, an EM al­go­rithm is ap­plied with la­belled train­ing data. A re­cent at­tempt uses ex­treme value the­ory to model the score (pa­per).

Now comes the DL era when we can model a DL model as a dis­crim­i­na­tive net­work to de­ter­mine if the score be­longs to rel­e­vant/​ir­rel­e­vant prod­ucts. BiCut forms the prob­lem as whether the cur­rent prod­uct is a good ter­mi­nal point given the ranked list and trains a net­work to pre­dict whether to con­tinue or stop. Choppy on the other hand, went back to the tra­di­tional ap­proach: mod­el­ling the score dis­tri­b­u­tion given the ranked list us­ing a DL net­work.

Despite re­cent novel ap­proaches to trun­cated rerank list, this bench­mark gives some note­wor­thy in­sights:

  • DL-based ap­proaches aren’t sig­nif­i­cantly bet­ter than un­su­per­vised/​sim­pler ap­proaches (top-K cut off).
  • Incorporating query em­bed­ding to the net­work also does­n’t show any ad­van­tages.
  • Distribution-based ap­proaches are gen­er­ally bet­ter than se­quence-based ap­proaches.
  • With good re­triev­ers (the re­call stage), even a sim­ple ap­proach like re­triev­ing top-K can yield good trade-off be­tween ef­fec­tive­ness/​ef­fi­ciency.

Recent benchmark

Note that the au­thors use NDCG@10 to eval­u­ate.

What we gather from those find­ings:

  • Good em­bed­ding is still the key even with/​with­out adap­tive thresh­old.
  • The greedy ap­proach is al­ready good enough. More com­pli­cated (DL-based) meth­ods may not be worth due to neg­li­gi­ble im­prove­ment and high la­tency tax.

Design

My idea is much sim­pler: when­ever the score drops dra­mat­i­cally (due to the fact that one of the em­bed­ding has neg­li­gi­ble score), we stop re­turn­ing re­sults. Then, the prob­lem be­comes find­ing the peak[1] of the change in the stream of scores.

Below is one ex­am­ple - it shows the score value of top 1000 re­sults with the blue color in­di­cat­ing pos­i­tive, red as neg­a­tive and black is unan­no­tated; the left­most dashed red line shows where the cut­off point.

Score Visualization Score Gradient

For prod­uct search, re­call is an im­por­tant fac­tor, hence we don’t want the cut­off to be ap­plied pre­ma­turely. In my im­ple­men­ta­tion, I al­ways re­turn top 50 re­sults with the as­sump­tion that it will in­clude all ex­act matches and rel­e­vant prod­ucts. Be­yond the 50th re­sult, there tends to be a mix of rel­e­vant/​par­tial rel­e­vant re­sults, which can be safely trimmed. Plus, us­ing an adap­tive thresh­old helps pre­vent the reranker from push­ing ir­rel­e­vant prod­ucts to the top of the list. The 2nd ex­am­ple shows when my ap­proach does not help, i.e., the cut­off is ap­plied too late and the fi­nal re­sults con­tains a large amount of ir­rel­e­vant re­sults. This prob­a­bly is an ex­act match search, where users copy prod­uct ti­tle to the search bar.

Why not use a mix­ture model ap­proach? It needs data to train; al­though ChatGPT can surely an­no­tate them, it would com­pli­cate the re­lease time­line. Mean­while, DL-based ap­proach seems too ex­ces­sive and has­n’t shown sig­nif­i­cant im­prove­ment.

Caveats

This surely is not an op­ti­mal so­lu­tion. After re­leas­ing this fea­ture, we ob­served sev­eral is­sues:

  • The cur­rent im­ple­men­ta­tion pri­or­i­tizes pre­ci­sion by cut­ting off the rank list as early as pos­si­ble, which dra­mat­i­cally re­duces re­call. This is detri­men­tal to prod­uct search.
  • The score dis­tri­b­u­tion for each query is heav­ily in­flu­enced by the query type, whether it’s ex­act, fea­ture, or symp­tom-based search. Each type may ex­hibit a dif­fer­ent dis­tri­b­u­tion. For in­stance, we might want to con­tinue re­turn­ing par­tially rel­e­vant re­sults for ex­act searches, as this can en­hance re­call and po­ten­tially im­prove con­ver­sion rates.

References

  1. Arampatzis, Avi, Jaap Kamps, and Stephen Robertson. Where to stop read­ing a ranked list? Threshold op­ti­miza­tion us­ing trun­cated score dis­tri­b­u­tions.” Proceedings of the 32nd in­ter­na­tional ACM SIGIR con­fer­ence on Research and de­vel­op­ment in in­for­ma­tion re­trieval. 2009.
  2. Bahri, Dara, et al. Choppy: Cut trans­former for ranked list trun­ca­tion.” Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2020.
  3. Bahri, Dara, et al. Surprise: Result List Truncation via Extreme Value Theory.” Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2023.
  4. Meng, Chuan, et al. Ranked List Truncation for Large Language Model-based Re-Ranking.” Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2024.
  5. Brakel, J.P.G. van (2014). Robust peak de­tec­tion al­go­rithm us­ing z-scores”. Stack Overflow. Available at: https://​stack­over­flow.com/​ques­tions/​22583391/​peak-sig­nal-de­tec­tion-in-re­al­time-time­series-data/​22640362#22640362 (version: 2020 – 11-08).

  1. I also spot­ted a bug. ↩︎