Evaluation
You can run various evaluation by simply adding a few arguments
Last updated
You can run various evaluation by simply adding a few arguments
Last updated
These arguments are common across all benchmarks. Note that you must select at least one benchmark from the list provided in this section.
args | default | description |
---|---|---|
By running the code below,
h6_en
,mt_bench
,ifeval
,eq_bench
benchmarks will execute.
args | default | description |
---|---|---|
args | default | description |
---|---|---|
args | default | description |
---|---|---|
args | default | description |
---|---|---|
ckpt_path
upstage/SOLAR-10.7B-Instruct-v1.0
Model name or ckpt_path to be evaluated
output_path
./results
Path to save evaluation results
model_name
SOLAR-10.7B-Instruct-v1.0
Model name used in saving eval results
use_fast_tokenizer
False
Flag to use fast tokenizer
use_flash_attention_2
False
Flag to use flash attention 2 (highly suggested)
h6_en
False
ARC, HellaSwag, MMLU, TruthfulQA, Winogrande, GSM8K
use_vllm
False
Inference with vLLM
model_parallel
1
Size of model_parallel
data_parallel
1
Size of data_parallel
mt_bench
False
Evaluate a LLM with MT-Bench
baselines
None
The baseline LLMs to compare
judge_model
gpt-4
The model used for judge
num_gpus_per_model
1
The number of GPUs per model (for gen_answer)
num_gpus_total
1
The total number of GPUs (for gen_answer)
parallel_api
1
The number of concurrent API calls (for gen_judge
ifeval
False
Evaluate a LLM with Instruction Following Eval
gpu_per_inst_eval
1
Num GPUs per ifeval process. Set to > 1 for larger models. Keep len(devices) / gpu_per_inst_eval = 1
for repeatable results
eq_bench
False
Evaluate a LLM with EQ-Bench
eq_bench_prompt_type
"ChatML"
Set chat template
eq_bench_lora_path
None
Path to LORA adapters
eq_bench_quantization
None
Whether to quantize the LLM when loading