Evaluation
You can run various evaluation by simply adding a few arguments
Common Arguments
These arguments are common across all benchmarks. Note that you must select at least one benchmark from the list provided in this section.
ckpt_path
upstage/SOLAR-10.7B-Instruct-v1.0
Model name or ckpt_path to be evaluated
output_path
./results
Path to save evaluation results
model_name
SOLAR-10.7B-Instruct-v1.0
Model name used in saving eval results
use_fast_tokenizer
False
Flag to use fast tokenizer
use_flash_attention_2
False
Flag to use flash attention 2 (highly suggested)
Example
By running the code below,
h6_en
,mt_bench
,ifeval
,eq_bench
benchmarks will execute.
python3 evaluator.py \
--h6_en \
--data_parallel 8 \
--mt_bench \
--num_gpus_total 8 \
--parallel_api 4 \
--ifeval \
--eq_bench \
--devices 0,1,2,3,4,5,6,7 \
--ckpt_path upstage/SOLAR-10.7B-Instruct-v1.0
Arguments for Each Benchmark
Evaluate a LLM with the benchmarks in Open LLM Leaderbaord
h6_en
False
ARC, HellaSwag, MMLU, TruthfulQA, Winogrande, GSM8K
use_vllm
False
Inference with vLLM
model_parallel
1
Size of model_parallel
data_parallel
1
Size of data_parallel
Example
Library
import evalverse as ev
evaluator = ev.Evaluator()
evaluator.run(
model="upstage/SOLAR-10.7B-Instruct-v1.0",
benchmark="h6_en",
data_parallel=8,
output_path="./results"
)
CLI
python3 evaluator.py \
--ckpt_path upstage/SOLAR-10.7B-Instruct-v1.0 \
--h6_en \
--data_parallel 8 \
--output_path ./results
Result
📦results
┗ 📂SOLAR-10.7B-Instruct-v1.0
┗ 📂h6_en
┣ 📜arc_challenge_25.json
┣ 📜gsm8k_5.json
┣ 📜hellaswag_10.json
┣ 📜mmlu_5.json
┣ 📜truthfulqa_mc2_0.json
┗ 📜winogrande_5.json
Last updated