Evaluation

You can run various evaluation by simply adding a few arguments

Common Arguments

These arguments are common across all benchmarks. Note that you must select at least one benchmark from the list provided in this section.

args
default
description

ckpt_path

upstage/SOLAR-10.7B-Instruct-v1.0

Model name or ckpt_path to be evaluated

output_path

./results

Path to save evaluation results

model_name

SOLAR-10.7B-Instruct-v1.0

Model name used in saving eval results

use_fast_tokenizer

False

Flag to use fast tokenizer

use_flash_attention_2

False

Flag to use flash attention 2 (highly suggested)

Example

By running the code below, h6_en, mt_bench, ifeval, eq_bench benchmarks will execute.

python3 evaluator.py \
    --h6_en \
    --data_parallel 8 \
    --mt_bench \
    --num_gpus_total 8 \
    --parallel_api 4 \
    --ifeval \
    --eq_bench \
    --devices 0,1,2,3,4,5,6,7 \
    --ckpt_path upstage/SOLAR-10.7B-Instruct-v1.0

Arguments for Each Benchmark

Evaluate a LLM with the benchmarks in Open LLM Leaderbaord

args
default
description

h6_en

False

ARC, HellaSwag, MMLU, TruthfulQA, Winogrande, GSM8K

use_vllm

False

Inference with vLLM

model_parallel

1

Size of model_parallel

data_parallel

1

Size of data_parallel

Example

Library

import evalverse as ev

evaluator = ev.Evaluator()
evaluator.run(
    model="upstage/SOLAR-10.7B-Instruct-v1.0",
    benchmark="h6_en",
    data_parallel=8,
    output_path="./results"
)

CLI

python3 evaluator.py \
    --ckpt_path upstage/SOLAR-10.7B-Instruct-v1.0 \
    --h6_en \
    --data_parallel 8 \
    --output_path ./results

Result

📦results
 ┗ 📂SOLAR-10.7B-Instruct-v1.0
   ┗ 📂h6_en
     ┣ 📜arc_challenge_25.json
     ┣ 📜gsm8k_5.json
     ┣ 📜hellaswag_10.json
     ┣ 📜mmlu_5.json
     ┣ 📜truthfulqa_mc2_0.json
     ┗ 📜winogrande_5.json

Last updated