## Quick Start

### 1. Prepare

#### 1.1 Prepare Dataset

### 2. Inference LLMs on CriticBench

You need to inference LLMs to be evaluated on our proposed CriticBench, and generation results on CriticBench can be found in `inference/outputs` folder. 
If you are interested with our prompts for LLM, they are shown in [inference/utils/prompts.py](inference/utils/prompts.py).
Specifically, the inference code should be like:
```python
# this line loads all the evaluation dataset in CriticBench from `inference/utils`
datasets = load_all_datasets(args['data_dir'])

# these lines init the tokenizer and models from huggingface
tokenizer = AutoTokenizer.from_pretrained(
    args['model_name'],
    trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    args['model_name'], 
    device_map="auto", 
    trust_remote_code=True
).cuda().eval()

...

# inference the LLM and save the results in json file format
for abbr, dataset in tqdm(datasets.items()):
    path = os.path.join(folder_path, abbr + ".json")
    results = {}
    for item in tqdm(dataset['dev']):        
        
        # If you want to inference other LLMs, please revise this line
        response, history = model.chat(tokenizer, item['question'], history=[])
            
        results[str(len(results))] = {
            'origin_prompt': item['question'],
            'prediction': response
        }
    # save the results into json file, with the abbr as the file name
    with open(path, 'w') as f:
        json.dump(results, f, ensure_ascii=False, indent=4)
```

### 3. Compute the Evaluation Results on CriticBench

After getting the generation results under `inference/outputs`, your next step is to compute the objective and subjective scores in our proposed CriticBench using our toolkit.
See more details about the objective and subjective scores in Section 4 of our paper.

We provide two ways for computing the `objective` and `subjective` scores in `critic_bench` folder.
* Objective scores could be computed automatically without any cost
* Subjective scores rely on the advanced GPT-4-turbo model for automatic evaluation

#### Compute Scores

It is easy to compute the scores by running following commands.

Before running this code, please make sure that your own OpenAI API key in [critic_bench/run.sh](critic_bench/run.sh) is set.

```bash
export OPENAI_API_KEY=...
```

Then, running the following codes for evaluation:

```bash
./run.sh <dimension> <format> <split> <save_dir>
```

* `dimension` denotes critique dimensions defined in our proposed CriticBench, which are `feedback`, `correction`, `comp_feedback`, and `meta_feedback`. Refer to more details about these critique dimensions in Section 2 of our paper.
* `format` denotes the critique format `objective` and `subjective`. Objective scores are spearman correlation, pass rate, preference accuracy that can be computed automatically without any cost, while subjective scores are computed by prompting GPT-4-turbo to compare generated critiques and our human-annotated high-quality critiques in CriticBench.
* `split` denotes the `test` or `dev` set to be evaluated.
* `save_dir` is any text path saving the evaluation results.

In [run.sh](critic_bench/run.sh) file, you could find the corresponding commands for objective and subjective evaluation process. 
For example, for the feedback critique dimension, the objective evaluation is like:
```bash
python run_feedback.py --root_dir "../data/CriticBench" --prediction_dir "../example_data/prediction_v1.3" --split $3 --obj True
```
* `root_dir` contains the path of the `test` and `dev` set in CriticBench.
* `prediction_dir` contains the inference results of LLMs to be evaluated. We also provide the inference results of some representation LLMs  in `example_data`. If you want to evaluate your own LLMs, please refer to `inference/README.md` for more details, and the `prediction_dir` could be set as `../inference/outputs`.
* `split` denotes whether the `test` or the `dev` set is used.
* `obj` denotes that the objective evaluation is activated

For the subjective evaluation of the feedback critique dimension, the evaluation command is like:
```bash
python run_feedback.py --root_dir "../data/CriticBench" --prediction_dir "../example_data/prediction_v1.3" --evaluation_dir "../example_data/evaluation_v1.3/" --batch_size 1 --split $3 --obj False
```
* `evaluation_dir` saves the subjective evaluation scores of GPT-4, which can be re-loaded if the subjective evaluation process borke off. The order of the samples in each file in `evaluation_dir` follows the order in the original data in CriticBench (`data/CriticBench`).
* `batch_size` controls the number of the process for access GPT-4 API under multiprocessing setting.

The evaluation results of GPT-4 under `save_dir` is `jsonl`, and each line contains the evaluation results. The chain-of-thought evaluation results prompted by GPT-4 is in the `evaluation` key-value in each line, which is a `dict` consisting of the chain-of-thought rationale about GPT-4 (key-value `cot`) and a Likert score (key-value `score`) for each critiques, ranging from 1 to 10.
* 1 denotes the worst performance
* 10 denotes the best performance
* 8 denotes the comparable performance with our human-annotated high-quality critiques, and scores higher than 8 denotes the better performance of evaluated critiques.

## License

This project is released under the Apache 2.0 [license](./LICENSE).
