Evaluation
You can run various evaluation by simply adding a few arguments
Common Arguments
These arguments are common across all benchmarks. Note that you must select at least one benchmark from the list provided in this section.
ckpt_path
upstage/SOLAR-10.7B-Instruct-v1.0
Model name or ckpt_path to be evaluated
output_path
./results
Path to save evaluation results
model_name
SOLAR-10.7B-Instruct-v1.0
Model name used in saving eval results
use_fast_tokenizer
False
Flag to use fast tokenizer
use_flash_attention_2
False
Flag to use flash attention 2 (highly suggested)
Example
By running the code below,
h6_en,mt_bench,ifeval,eq_benchbenchmarks will execute.
python3 evaluator.py \
--h6_en \
--data_parallel 8 \
--mt_bench \
--num_gpus_total 8 \
--parallel_api 4 \
--ifeval \
--eq_bench \
--devices 0,1,2,3,4,5,6,7 \
--ckpt_path upstage/SOLAR-10.7B-Instruct-v1.0Arguments for Each Benchmark
Evaluate a LLM with the benchmarks in Open LLM Leaderbaord
h6_en
False
ARC, HellaSwag, MMLU, TruthfulQA, Winogrande, GSM8K
use_vllm
False
Inference with vLLM
model_parallel
1
Size of model_parallel
data_parallel
1
Size of data_parallel
Example
Library
import evalverse as ev
evaluator = ev.Evaluator()
evaluator.run(
model="upstage/SOLAR-10.7B-Instruct-v1.0",
benchmark="h6_en",
data_parallel=8,
output_path="./results"
)CLI
python3 evaluator.py \
--ckpt_path upstage/SOLAR-10.7B-Instruct-v1.0 \
--h6_en \
--data_parallel 8 \
--output_path ./resultsResult
📦results
┗ 📂SOLAR-10.7B-Instruct-v1.0
┗ 📂h6_en
┣ 📜arc_challenge_25.json
┣ 📜gsm8k_5.json
┣ 📜hellaswag_10.json
┣ 📜mmlu_5.json
┣ 📜truthfulqa_mc2_0.json
┗ 📜winogrande_5.jsonTo run this mt_bench benchmark, OpenAI API Key in .env file is required.
mt_bench
False
Evaluate a LLM with MT-Bench
baselines
None
The baseline LLMs to compare
judge_model
gpt-4
The model used for judge
num_gpus_per_model
1
The number of GPUs per model (for gen_answer)
num_gpus_total
1
The total number of GPUs (for gen_answer)
parallel_api
1
The number of concurrent API calls (for gen_judge
Example
Library
import evalverse as ev
evaluator = ev.Evaluator()
evaluator.run(
model="upstage/SOLAR-10.7B-Instruct-v1.0",
benchmark="mt_bench",
num_gpus_total=8,
parallel_api=4,
output_path="./results"
)CLI
python3 evaluator.py \
--ckpt_path upstage/SOLAR-10.7B-Instruct-v1.0 \
--mt_bench \
--num_gpus_total 8 \
--parallel_api 4 \
--output_path ./resultsResult
📦results
┗ 📂SOLAR-10.7B-Instruct-v1.0
┗ 📂mt_bench
┣ 📂model_answer
┃ ┗ 📜SOLAR-10.7B-Instruct-v1.0.jsonl
┣ 📂model_judgment
┃ ┗ 📜gpt-4_single.jsonl
┗ 📜scores.txtEvaluate a LLM with Instruction Following Eval
ifeval
False
Evaluate a LLM with Instruction Following Eval
gpu_per_inst_eval
1
Num GPUs per ifeval process. Set to > 1 for larger models. Keep len(devices) / gpu_per_inst_eval = 1 for repeatable results
Example
Library
import evalverse as ev
evaluator = ev.Evaluator()
evaluator.run(
model="upstage/SOLAR-10.7B-Instruct-v1.0",
benchmark="ifeval",
devices="0,1,2,3,4,5,6,7",
output_path="./results"
)CLI
python3 evaluator.py \
--ckpt_path upstage/SOLAR-10.7B-Instruct-v1.0 \
--ifeval \
--devices 0,1,2,3,4,5,6,7 \
--output_path ./resultsResult
📦results
┗ 📂SOLAR-10.7B-Instruct-v1.0
┗ 📂ifeval
┣ 📜eval_results_loose.jsonl
┣ 📜eval_results_strict.jsonl
┣ 📜output.jsonl
┗ 📜scores.txtEvaluate a LLM with EQ-Bench
eq_bench
False
Evaluate a LLM with EQ-Bench
eq_bench_prompt_type
"ChatML"
Set chat template
eq_bench_lora_path
None
Path to LORA adapters
eq_bench_quantization
None
Whether to quantize the LLM when loading
Example
Library
import evalverse as ev
evaluator = ev.Evaluator()
evaluator.run(
model="upstage/SOLAR-10.7B-Instruct-v1.0",
benchmark="eq_bench",
eq_bench_prompt_type="Solar-v1",
devices="0,1,2,3,4,5,6,7",
output_path="./results"
)CLI
python3 evaluator.py \
--ckpt_path upstage/SOLAR-10.7B-Instruct-v1.0 \
--eq_bench \
--eq_bench_prompt_type Solar-v1 \
--devices 0,1,2,3,4,5,6,7 \
--output_path ./resultsResult
📦results
┗ 📂SOLAR-10.7B-Instruct-v1.0
┗ 📂eq_bench
┣ 📜benchmark_results.csv
┗ 📜raw_results.jsonLast updated