ReScorer: An aggregation and alignment technique for building trust into LLM reasons

Jay Mohta; Brian de Silva; Sugumar Murugesan; Dantong Liu; Yan Xu; Mingwei Shen

Publication

ReScorer: An aggregation and alignment technique for building trust into LLM reasons

By Jay Mohta, Brian de Silva, Sugumar Murugesan, Dantong Liu, Yan Xu, Mingwei Shen

2024

Download Copy BibTeX

Share

Download

Copy BibTeX

Share

Large language models (LLMs) offer substantial potential for automating labeling tasks, showcasing robust zero-shot performance across diverse classification tasks. The LLM-generated reasons that accompany these classifications contain signals about the quality of the classifications. Estimates of quality of these reasons can, in essence, be used to detect potentially incorrect predictions. Conventional metrics for scoring reasons such as ROUGE-L and BLEU scores depend on ground truth reference reasons, which are challenging and expensive to acquire, and are not available at inference time for new examples. In this paper, we use a product classification dataset to evaluate two reasoning scoring strategies that do not rely on reference reasons: one involving an LLM-based scorer and another using recently proposed ROSCOE metrics. Our analysis reveals that LLM-based approaches are computationally intensive, while aligning ROSCOE metrics with human judgment presents challenges. Consequently, we propose an extension to the ROSCOE framework called ReScorer, which achieves 7% better alignment with human judgment compared to LLM-based evaluation and 59% better than ROSCOE, while being 89% cheaper compared to LLM-based scoring.

ReScorer: An aggregation and alignment technique for building trust into LLM reasons

Latest news

Work with us