# Evaluation

## Common Arguments

These arguments are common across all benchmarks. Note that you must select at least one benchmark from the list provided in [this section](#arguments-for-each-benchmark).

<table><thead><tr><th width="190">args</th><th>default</th><th>description</th></tr></thead><tbody><tr><td>ckpt_path</td><td>upstage/SOLAR-10.7B-Instruct-v1.0</td><td>Model name or ckpt_path to be evaluated</td></tr><tr><td>output_path</td><td>./results</td><td>Path to save evaluation results</td></tr><tr><td>model_name</td><td>SOLAR-10.7B-Instruct-v1.0</td><td>Model name used in saving eval results</td></tr><tr><td>use_fast_tokenizer</td><td>False</td><td>Flag to use fast tokenizer</td></tr><tr><td>use_flash_attention_2</td><td>False</td><td>Flag to use flash attention 2 (highly suggested)</td></tr></tbody></table>

### Example

> By running the code below, <mark style="color:blue;">`h6_en`</mark>, <mark style="color:blue;">`mt_bench`</mark>, <mark style="color:blue;">`ifeval`</mark>, <mark style="color:blue;">`eq_bench`</mark> benchmarks will execute.

```
python3 evaluator.py \
    --h6_en \
    --data_parallel 8 \
    --mt_bench \
    --num_gpus_total 8 \
    --parallel_api 4 \
    --ifeval \
    --eq_bench \
    --devices 0,1,2,3,4,5,6,7 \
    --ckpt_path upstage/SOLAR-10.7B-Instruct-v1.0
```

## Arguments for Each Benchmark

{% tabs %}
{% tab title="H6" %}
Evaluate a LLM with the benchmarks in Open LLM Leaderbaord

<table><thead><tr><th width="167">args</th><th width="86">default</th><th>description</th></tr></thead><tbody><tr><td>h6_en</td><td>False</td><td>ARC, HellaSwag, MMLU, TruthfulQA, Winogrande, GSM8K </td></tr><tr><td>use_vllm</td><td>False</td><td>Inference with vLLM</td></tr><tr><td>model_parallel</td><td>1</td><td>Size of model_parallel</td></tr><tr><td>data_parallel</td><td>1</td><td>Size of data_parallel</td></tr></tbody></table>

## Example

#### Library

```
import evalverse as ev

evaluator = ev.Evaluator()
evaluator.run(
    model="upstage/SOLAR-10.7B-Instruct-v1.0",
    benchmark="h6_en",
    data_parallel=8,
    output_path="./results"
)
```

#### CLI

<pre><code><strong>python3 evaluator.py \
</strong>    --ckpt_path upstage/SOLAR-10.7B-Instruct-v1.0 \
    --h6_en \
    --data_parallel 8 \
    --output_path ./results
</code></pre>

## Result

```
📦results
 ┗ 📂SOLAR-10.7B-Instruct-v1.0
   ┗ 📂h6_en
     ┣ 📜arc_challenge_25.json
     ┣ 📜gsm8k_5.json
     ┣ 📜hellaswag_10.json
     ┣ 📜mmlu_5.json
     ┣ 📜truthfulqa_mc2_0.json
     ┗ 📜winogrande_5.json
```

{% endtab %}

{% tab title="MT-bench" %}
To run this `mt_bench` benchmark, OpenAI API Key in `.env` file is required.

| args                  | default | description                                        |
| --------------------- | ------- | -------------------------------------------------- |
| mt\_bench             | False   | Evaluate a LLM with MT-Bench                       |
| baselines             | None    | The baseline LLMs to compare                       |
| judge\_model          | gpt-4   | The model used for judge                           |
| num\_gpus\_per\_model | 1       | The number of GPUs per model (for gen\_answer)     |
| num\_gpus\_total      | 1       | The total number of GPUs (for gen\_answer)         |
| parallel\_api         | 1       | The number of concurrent API calls (for gen\_judge |

## Example

#### Library

```
import evalverse as ev

evaluator = ev.Evaluator()
evaluator.run(
    model="upstage/SOLAR-10.7B-Instruct-v1.0",
    benchmark="mt_bench",
    num_gpus_total=8,
    parallel_api=4,
    output_path="./results"
)
```

#### CLI

<pre><code><strong>python3 evaluator.py \
</strong>    --ckpt_path upstage/SOLAR-10.7B-Instruct-v1.0 \
    --mt_bench \
    --num_gpus_total 8 \
    --parallel_api 4 \
    --output_path ./results
</code></pre>

## Result

```
📦results
 ┗ 📂SOLAR-10.7B-Instruct-v1.0
   ┗ 📂mt_bench
     ┣ 📂model_answer
     ┃ ┗ 📜SOLAR-10.7B-Instruct-v1.0.jsonl
     ┣ 📂model_judgment
     ┃ ┗ 📜gpt-4_single.jsonl
     ┗ 📜scores.txt
```

{% endtab %}

{% tab title="IFEval" %}
Evaluate a LLM with Instruction Following Eval

<table><thead><tr><th width="183">args</th><th width="86">default</th><th>description</th></tr></thead><tbody><tr><td>ifeval</td><td>False</td><td>Evaluate a LLM with Instruction Following Eval</td></tr><tr><td>gpu_per_inst_eval</td><td>1</td><td>Num GPUs per ifeval process. Set to > 1 for larger models. Keep <code>len(devices) / gpu_per_inst_eval = 1</code> for repeatable results</td></tr></tbody></table>

## Example

#### Library

```
import evalverse as ev

evaluator = ev.Evaluator()
evaluator.run(
    model="upstage/SOLAR-10.7B-Instruct-v1.0",
    benchmark="ifeval",
    devices="0,1,2,3,4,5,6,7",
    output_path="./results"
)
```

#### CLI

```
python3 evaluator.py \
    --ckpt_path upstage/SOLAR-10.7B-Instruct-v1.0 \
    --ifeval \
    --devices 0,1,2,3,4,5,6,7 \
    --output_path ./results
```

## Result

```
📦results
 ┗ 📂SOLAR-10.7B-Instruct-v1.0
   ┗ 📂ifeval
     ┣ 📜eval_results_loose.jsonl
     ┣ 📜eval_results_strict.jsonl
     ┣ 📜output.jsonl
     ┗ 📜scores.txt
```

{% endtab %}

{% tab title="EQ-Bench" %}
Evaluate a LLM with EQ-Bench

<table><thead><tr><th width="183">args</th><th width="104">default</th><th>description</th></tr></thead><tbody><tr><td>eq_bench</td><td>False</td><td>Evaluate a LLM with EQ-Bench</td></tr><tr><td>eq_bench_prompt_type</td><td>"ChatML"</td><td>Set chat template</td></tr><tr><td>eq_bench_lora_path</td><td>None</td><td>Path to LORA adapters</td></tr><tr><td>eq_bench_quantization</td><td>None</td><td>Whether to quantize the LLM when loading</td></tr></tbody></table>

## Example

#### Library

```
import evalverse as ev

evaluator = ev.Evaluator()
evaluator.run(
    model="upstage/SOLAR-10.7B-Instruct-v1.0",
    benchmark="eq_bench",
    eq_bench_prompt_type="Solar-v1",
    devices="0,1,2,3,4,5,6,7",
    output_path="./results"
)
```

#### CLI

```
python3 evaluator.py \
    --ckpt_path upstage/SOLAR-10.7B-Instruct-v1.0 \
    --eq_bench \
    --eq_bench_prompt_type Solar-v1 \
    --devices 0,1,2,3,4,5,6,7 \
    --output_path ./results
```

## Result

```
📦results
 ┗ 📂SOLAR-10.7B-Instruct-v1.0
   ┗ 📂eq_bench
     ┣ 📜benchmark_results.csv
     ┗ 📜raw_results.json
```

{% endtab %}
{% endtabs %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://evalverse.gitbook.io/evalverse-docs/documents/evaluation.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
