Prometheus

Inducing Fine-grained Evaluation Capability in Language Models

Seungone Kim*

KAIST LK Lab

Jamin Shin*

NAVER AI Lab

Yejin Cho*

KAIST XFACT Lab

Shayne Longpre

MIT Media Lab

Hwaran Lee

NAVER AI Lab

Sangdoo Yun

NAVER AI Lab

Seongjin Shin

NAVER Cloud

Sungdong Kim

KAIST LK Lab

James Thorne

KAIST XFACT Lab

Minjoon Seo

KAIST LK Lab

Paper Code
Data Submit 7B Submit 13B Submit

How can you evaluate whether your LLM is humorous or not? Among various versions during development, how can you track whether your LLM is inspiring while being culturally sensitive?

fine_grained_eval.

Current evaluation resources (e.g., MMLU, Big Bench, AlpacaFarm) are confined to generic, single-dimensional evaluation metrics that are either too domain/task specific (e.g., EM, Rouge) or coarse-grained (e.g., helpfulness/harmlessness). To overcome this issue, recent work has introduced Fine-grained Evaluation (e.g., VicunaBench, MTBench, Flask), which measures a LLM’s performance based on diverse skill sets (e.g., Creativity, Writing Ability, Role Playing Ability, Logical Ability) based on GPT-4 evaluation.

However, employing GPT-4 as an evaluator LM has the following disadvantages:


To this end, we introduce Prometheus 🔥, a fully open-source LLM (7B & 13B) that shows high correlation with both human evaluators and GPT-4!



Inducing Fine-grained Evaluation Capability

The main obstacle of obtaining a language model specialized on evaluation is because it needs to know the important aspects tailored with the instruction and should be able to internally estimate what the answer of the instruction might be in the first place. After then, the evaluator LM could assess the quality of the responses based on the information derived from the previous two steps.

Our main intuition is that by incorporating the appropriate reference materials, the evaluator LM could solely focus on assessing the quality of the response instead of determining the important aspects or solving the instruction itself.

input_output_format.

Specifically, we append a Score Rubric and a Reference Answer for the following purpose:



The Feedback Collection Dataset

Along with the model, we release the Feedback Collection, which is a new feedback dataset that was used to train Prometheus 🔥!

Compared to previous feedback datasets (e.g., Selfee, Shepherd), the Feedback Collection consists of 1K fine-grained score rubrics, 20K instructions, and 100K responses and language feedback generated by GPT-4. </br>

The construction process is consisted of (1) obtaining 50 seed score rubrics from human annotators, (2) expanding the score rubrics with brainstorming and paraphrasing, (3) obtaining instructions closely related to the score rubric, and (4) acquiring the remaining components for each instance.

Animation of the overall workflow of EvalLM where users sample inputs from a dataset, generate outputs from each input using two different prompts, and then comparatively evaluate these outputs on user-defined criteria.

The main considerations while constructing the Feedback Collection were:


For more information about our work, please check out our paper! Also, we plan to continually update our model based on your feedback! Feel free to reach out to us via email or twitter!

Bibtex

@misc{kim2023prometheus,
      title={Prometheus: Inducing Fine-grained Evaluation Capability in Language Models}, 
      author={Seungone Kim and Jamin Shin and Yejin Cho and Joel Jang and Shayne Longpre and Hwaran Lee and Sangdoo Yun and Seongjin Shin and Sungdong Kim and James Thorne and Minjoon Seo},
      year={2023},
      eprint={2310.08491},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Logo of LKLab Logo of KAIST Logo of NAVER

This research was supported by the KAIST-NAVER Hypercreative AI Center.