Evaluation

You can run various evaluation by simply adding a few arguments

Common Arguments

These arguments are common across all benchmarks. Note that you must select at least one benchmark from the list provided in this section.

args

default

description

ckpt_path

upstage/SOLAR-10.7B-Instruct-v1.0

Model name or ckpt_path to be evaluated

output_path

./results

Path to save evaluation results

model_name

SOLAR-10.7B-Instruct-v1.0

Model name used in saving eval results

use_fast_tokenizer

False

Flag to use fast tokenizer

use_flash_attention_2

False

Flag to use flash attention 2 (highly suggested)

Example

By running the code below, h6_en, mt_bench, ifeval, eq_bench benchmarks will execute.

python3 evaluator.py \
    --h6_en \
    --data_parallel 8 \
    --mt_bench \
    --num_gpus_total 8 \
    --parallel_api 4 \
    --ifeval \
    --eq_bench \
    --devices 0,1,2,3,4,5,6,7 \
    --ckpt_path upstage/SOLAR-10.7B-Instruct-v1.0

Arguments for Each Benchmark

Evaluate a LLM with the benchmarks in Open LLM Leaderbaord

args

default

description

h6_en

False

ARC, HellaSwag, MMLU, TruthfulQA, Winogrande, GSM8K

use_vllm

False

Inference with vLLM

model_parallel

Size of model_parallel

data_parallel

Size of data_parallel

Example

Library

import evalverse as ev

evaluator = ev.Evaluator()
evaluator.run(
    model="upstage/SOLAR-10.7B-Instruct-v1.0",
    benchmark="h6_en",
    data_parallel=8,
    output_path="./results"
)

CLI

python3 evaluator.py \
    --ckpt_path upstage/SOLAR-10.7B-Instruct-v1.0 \
    --h6_en \
    --data_parallel 8 \
    --output_path ./results

Result

📦results
 ┗ 📂SOLAR-10.7B-Instruct-v1.0
   ┗ 📂h6_en
     ┣ 📜arc_challenge_25.json
     ┣ 📜gsm8k_5.json
     ┣ 📜hellaswag_10.json
     ┣ 📜mmlu_5.json
     ┣ 📜truthfulqa_mc2_0.json
     ┗ 📜winogrande_5.json

To run this mt_bench benchmark, OpenAI API Key in .env file is required.

args

default

description

mt_bench

False

Evaluate a LLM with MT-Bench

baselines

None

The baseline LLMs to compare

judge_model

gpt-4

The model used for judge

num_gpus_per_model

The number of GPUs per model (for gen_answer)

num_gpus_total

The total number of GPUs (for gen_answer)

parallel_api

The number of concurrent API calls (for gen_judge

Example

Library

import evalverse as ev

evaluator = ev.Evaluator()
evaluator.run(
    model="upstage/SOLAR-10.7B-Instruct-v1.0",
    benchmark="mt_bench",
    num_gpus_total=8,
    parallel_api=4,
    output_path="./results"
)

CLI

python3 evaluator.py \
    --ckpt_path upstage/SOLAR-10.7B-Instruct-v1.0 \
    --mt_bench \
    --num_gpus_total 8 \
    --parallel_api 4 \
    --output_path ./results

Result

📦results
 ┗ 📂SOLAR-10.7B-Instruct-v1.0
   ┗ 📂mt_bench
     ┣ 📂model_answer
     ┃ ┗ 📜SOLAR-10.7B-Instruct-v1.0.jsonl
     ┣ 📂model_judgment
     ┃ ┗ 📜gpt-4_single.jsonl
     ┗ 📜scores.txt

Evaluate a LLM with Instruction Following Eval

args

default

description

ifeval

False

Evaluate a LLM with Instruction Following Eval

gpu_per_inst_eval

Num GPUs per ifeval process. Set to > 1 for larger models. Keep len(devices) / gpu_per_inst_eval = 1 for repeatable results

Example

Library

import evalverse as ev

evaluator = ev.Evaluator()
evaluator.run(
    model="upstage/SOLAR-10.7B-Instruct-v1.0",
    benchmark="ifeval",
    devices="0,1,2,3,4,5,6,7",
    output_path="./results"
)

CLI

python3 evaluator.py \
    --ckpt_path upstage/SOLAR-10.7B-Instruct-v1.0 \
    --ifeval \
    --devices 0,1,2,3,4,5,6,7 \
    --output_path ./results

Result

📦results
 ┗ 📂SOLAR-10.7B-Instruct-v1.0
   ┗ 📂ifeval
     ┣ 📜eval_results_loose.jsonl
     ┣ 📜eval_results_strict.jsonl
     ┣ 📜output.jsonl
     ┗ 📜scores.txt

Evaluate a LLM with EQ-Bench

args

default

description

eq_bench

False

Evaluate a LLM with EQ-Bench

eq_bench_prompt_type

"ChatML"

Set chat template

eq_bench_lora_path

None

Path to LORA adapters

eq_bench_quantization

None

Whether to quantize the LLM when loading

Example

Library

import evalverse as ev

evaluator = ev.Evaluator()
evaluator.run(
    model="upstage/SOLAR-10.7B-Instruct-v1.0",
    benchmark="eq_bench",
    eq_bench_prompt_type="Solar-v1",
    devices="0,1,2,3,4,5,6,7",
    output_path="./results"
)

CLI

python3 evaluator.py \
    --ckpt_path upstage/SOLAR-10.7B-Instruct-v1.0 \
    --eq_bench \
    --eq_bench_prompt_type Solar-v1 \
    --devices 0,1,2,3,4,5,6,7 \
    --output_path ./results

Result

📦results
 ┗ 📂SOLAR-10.7B-Instruct-v1.0
   ┗ 📂eq_bench
     ┣ 📜benchmark_results.csv
     ┗ 📜raw_results.json

PreviousQuickstart NextReport

Last updated 1 year ago