Evaluation
You can run various evaluation by simply adding a few arguments
Common Arguments
These arguments are common across all benchmarks. Note that you must select at least one benchmark from the list provided in this section.
ckpt_path
upstage/SOLAR-10.7B-Instruct-v1.0
Model name or ckpt_path to be evaluated
output_path
./results
Path to save evaluation results
model_name
SOLAR-10.7B-Instruct-v1.0
Model name used in saving eval results
use_fast_tokenizer
False
Flag to use fast tokenizer
use_flash_attention_2
False
Flag to use flash attention 2 (highly suggested)
Example
By running the code below,
h6_en,mt_bench,ifeval,eq_benchbenchmarks will execute.
python3 evaluator.py \
--h6_en \
--data_parallel 8 \
--mt_bench \
--num_gpus_total 8 \
--parallel_api 4 \
--ifeval \
--eq_bench \
--devices 0,1,2,3,4,5,6,7 \
--ckpt_path upstage/SOLAR-10.7B-Instruct-v1.0Arguments for Each Benchmark
To run this mt_bench benchmark, OpenAI API Key in .env file is required.
mt_bench
False
Evaluate a LLM with MT-Bench
baselines
None
The baseline LLMs to compare
judge_model
gpt-4
The model used for judge
num_gpus_per_model
1
The number of GPUs per model (for gen_answer)
num_gpus_total
1
The total number of GPUs (for gen_answer)
parallel_api
1
The number of concurrent API calls (for gen_judge
Example
Library
CLI
Result
Last updated