Adaptive Threshold

Survey

Adaptive thresh­old (or Ranked List Truncation) has been evolv­ing since the 2000s. Prior to the Deep Learning era, re­searchers mod­eled the score dis­tri­b­u­tion (Arampatzis et al. 2009):

  • The sim­i­lar­ity score fol­lows a mix­ture of dis­tri­b­u­tions.
  • Typically, it com­prises two com­po­nents: (1) rel­e­vant scores, and (2) ir­rel­e­vant scores.
    • Relevant scores: com­monly mod­eled as a Gaussian dis­tri­b­u­tion.
    • Irrelevant scores: mod­eled as a (truncated) ex­po­nen­tial dis­tri­b­u­tion.

Pre-DL approach to adaptive threshold

To learn the pa­ra­me­ters of those dis­tri­b­u­tions, an EM al­go­rithm is ap­plied with la­belled train­ing data. A re­cent at­tempt uses ex­treme value the­ory to model the score (pa­per) (Bahri et al. 2023).

Enter the DL era, where we can train a dis­crim­i­na­tive net­work to de­ter­mine whether a score be­longs to rel­e­vant or ir­rel­e­vant prod­ucts. BiCut (Bahri et al. 2020) frames the prob­lem as de­cid­ing whether the cur­rent prod­uct serves as a good stop­ping point, train­ing a net­work to pre­dict continue” or stop.” Choppy (Meng and oth­ers 2024), on the other hand, re­vis­its the tra­di­tional ap­proach: mod­el­ing the score dis­tri­b­u­tion given the ranked list us­ing a deep net­work.

Despite these novel ap­proaches to trun­cated rerank­ing, a re­cent bench­mark (Meng and oth­ers 2024) of­fers some note­wor­thy in­sights:

  • DL-based ap­proaches aren’t sig­nif­i­cantly bet­ter than un­su­per­vised or sim­pler meth­ods (top-K cut­off).
  • Incorporating query em­bed­dings into the net­work shows no clear ad­van­tage.
  • Distribution-based ap­proaches gen­er­ally out­per­form se­quence-based ones.
  • With ef­fec­tive re­triev­ers (at the re­call stage), even a straight­for­ward top-K ap­proach yields a fa­vor­able trade-off be­tween ef­fec­tive­ness and ef­fi­ciency.

Recent benchmark

Note that the au­thors use NDCG@10 to eval­u­ate.

Key take­aways from these find­ings:

  • Quality em­bed­dings re­main es­sen­tial, with or with­out adap­tive thresh­old­ing.
  • The greedy ap­proach al­ready per­forms well. More so­phis­ti­cated (DL-based) meth­ods may not jus­tify their com­plex­ity given the mar­ginal im­prove­ments and la­tency over­head.

Design

My idea is much sim­pler: when­ever the score drops dra­mat­i­cally (indicating one of the em­bed­dings has a neg­li­gi­ble score), we stop re­turn­ing re­sults. The prob­lem then be­comes find­ing the peak(van Brakel 2014)[1] in the stream of score changes.

Below is an ex­am­ple show­ing the score val­ues of the top 1,000 re­sults—blue in­di­cates pos­i­tive, red in­di­cates neg­a­tive, and black marks unan­no­tated items. The left­most dashed red line shows the cut­off point.

Score Visualization Score Gradient

For prod­uct search, re­call mat­ters sig­nif­i­cantly, so we want to avoid pre­ma­ture cut­offs. In my im­ple­men­ta­tion, I al­ways re­turn the top 50 re­sults, as­sum­ing this set will in­clude all ex­act matches and rel­e­vant prod­ucts. Be­yond the 50th re­sult, re­sults tend to mix rel­e­vant and par­tially rel­e­vant items—a zone that can be safely trimmed. Ad­di­tion­ally, adap­tive thresh­old­ing helps pre­vent the reranker from push­ing ir­rel­e­vant prod­ucts to the top of the list. The sec­ond ex­am­ple il­lus­trates a case where this ap­proach falls short: the cut­off oc­curs too late, and the fi­nal re­sults con­tain too many ir­rel­e­vant items. This likely rep­re­sents an ex­act-match search, where users copy a prod­uct ti­tle into the search bar.

Why not adopt the mix­ture model ap­proach? It re­quires train­ing data; al­though ChatGPT could cer­tainly an­no­tate them, do­ing so would com­pli­cate the re­lease time­line. Mean­while, DL-based ap­proaches seem ex­ces­sive and haven’t demon­strated sig­nif­i­cant im­prove­ments.

Caveats

This cer­tainly is­n’t an op­ti­mal so­lu­tion. After re­leas­ing this fea­ture, we ob­served sev­eral is­sues:

  • The cur­rent im­ple­men­ta­tion pri­or­i­tizes pre­ci­sion by cut­ting off the rank list as early as pos­si­ble, which dra­mat­i­cally re­duces re­call—a sig­nif­i­cant draw­back for prod­uct search.
  • The score dis­tri­b­u­tion for each query de­pends heav­ily on the query type: ex­act, fea­ture-based, or symp­tom-based searches. Each type may ex­hibit a dif­fer­ent dis­tri­b­u­tion. For in­stance, con­tin­u­ing to re­turn par­tially rel­e­vant re­sults for ex­act searches might en­hance re­call and po­ten­tially im­prove con­ver­sion rates.

  1. I also spot­ted a bug. See also (van Brakel 2014). ↩︎

References

  • Arampatzis, Avi, Jaap Kamps, and Stephen Robertson. 2009. Where to Stop Reading a Ranked List? Threshold Optimization Using Truncated Score Distributions.” In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval.
  • Bahri, Dara, Angelos Katharopoulos, Jean Thollot, and François Fleuret. 2020. Choppy: Cut Transformer for Ranked List Truncation.” In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval.
  • Bahri, Dara, Donald Metzler, and oth­ers. 2023. Surprise: Result List Truncation via Extreme Value Theory.” In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval.
  • Brakel, J.P.G. van. 2014. Robust Peak Detection Algorithm Using Z-Scores.” Stack Overflow.
  • Meng, Chuan and oth­ers. 2024. Ranked List Truncation for Large Language Model-Based Re-Ranking.” In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval.