Evalverse Docs
  • Evalverse
  • 🏃LET'S START!
    • Install
    • Quickstart
  • 📃DOCUMENTS
    • Evaluation
    • Report
    • FAQs
  • 🌌Portal
    • 🤗HuggingFace Space
    • 💬Discord
    • GitHub
    • GitHub Issues
Powered by GitBook
On this page
  • Common Arguments
  • Example
  • Arguments for Each Benchmark
  1. DOCUMENTS

Evaluation

You can run various evaluation by simply adding a few arguments

PreviousQuickstartNextReport

Last updated 1 year ago

Common Arguments

These arguments are common across all benchmarks. Note that you must select at least one benchmark from the list provided in .

args
default
description

ckpt_path

upstage/SOLAR-10.7B-Instruct-v1.0

Model name or ckpt_path to be evaluated

output_path

./results

Path to save evaluation results

model_name

SOLAR-10.7B-Instruct-v1.0

Model name used in saving eval results

use_fast_tokenizer

False

Flag to use fast tokenizer

use_flash_attention_2

False

Flag to use flash attention 2 (highly suggested)

Example

By running the code below, h6_en, mt_bench, ifeval, eq_bench benchmarks will execute.

python3 evaluator.py \
    --h6_en \
    --data_parallel 8 \
    --mt_bench \
    --num_gpus_total 8 \
    --parallel_api 4 \
    --ifeval \
    --eq_bench \
    --devices 0,1,2,3,4,5,6,7 \
    --ckpt_path upstage/SOLAR-10.7B-Instruct-v1.0

Arguments for Each Benchmark

Evaluate a LLM with the benchmarks in Open LLM Leaderbaord

args
default
description

h6_en

False

ARC, HellaSwag, MMLU, TruthfulQA, Winogrande, GSM8K

use_vllm

False

Inference with vLLM

model_parallel

1

Size of model_parallel

data_parallel

1

Size of data_parallel

Example

Library

import evalverse as ev

evaluator = ev.Evaluator()
evaluator.run(
    model="upstage/SOLAR-10.7B-Instruct-v1.0",
    benchmark="h6_en",
    data_parallel=8,
    output_path="./results"
)

CLI

python3 evaluator.py \
    --ckpt_path upstage/SOLAR-10.7B-Instruct-v1.0 \
    --h6_en \
    --data_parallel 8 \
    --output_path ./results

Result

📦results
 ┗ 📂SOLAR-10.7B-Instruct-v1.0
   ┗ 📂h6_en
     ┣ 📜arc_challenge_25.json
     ┣ 📜gsm8k_5.json
     ┣ 📜hellaswag_10.json
     ┣ 📜mmlu_5.json
     ┣ 📜truthfulqa_mc2_0.json
     ┗ 📜winogrande_5.json

To run this mt_bench benchmark, OpenAI API Key in .env file is required.

args
default
description

mt_bench

False

Evaluate a LLM with MT-Bench

baselines

None

The baseline LLMs to compare

judge_model

gpt-4

The model used for judge

num_gpus_per_model

1

The number of GPUs per model (for gen_answer)

num_gpus_total

1

The total number of GPUs (for gen_answer)

parallel_api

1

The number of concurrent API calls (for gen_judge

Example

Library

import evalverse as ev

evaluator = ev.Evaluator()
evaluator.run(
    model="upstage/SOLAR-10.7B-Instruct-v1.0",
    benchmark="mt_bench",
    num_gpus_total=8,
    parallel_api=4,
    output_path="./results"
)

CLI

python3 evaluator.py \
    --ckpt_path upstage/SOLAR-10.7B-Instruct-v1.0 \
    --mt_bench \
    --num_gpus_total 8 \
    --parallel_api 4 \
    --output_path ./results

Result

📦results
 ┗ 📂SOLAR-10.7B-Instruct-v1.0
   ┗ 📂mt_bench
     ┣ 📂model_answer
     ┃ ┗ 📜SOLAR-10.7B-Instruct-v1.0.jsonl
     ┣ 📂model_judgment
     ┃ ┗ 📜gpt-4_single.jsonl
     ┗ 📜scores.txt

Evaluate a LLM with Instruction Following Eval

args
default
description

ifeval

False

Evaluate a LLM with Instruction Following Eval

gpu_per_inst_eval

1

Num GPUs per ifeval process. Set to > 1 for larger models. Keep len(devices) / gpu_per_inst_eval = 1 for repeatable results

Example

Library

import evalverse as ev

evaluator = ev.Evaluator()
evaluator.run(
    model="upstage/SOLAR-10.7B-Instruct-v1.0",
    benchmark="ifeval",
    devices="0,1,2,3,4,5,6,7",
    output_path="./results"
)

CLI

python3 evaluator.py \
    --ckpt_path upstage/SOLAR-10.7B-Instruct-v1.0 \
    --ifeval \
    --devices 0,1,2,3,4,5,6,7 \
    --output_path ./results

Result

📦results
 ┗ 📂SOLAR-10.7B-Instruct-v1.0
   ┗ 📂ifeval
     ┣ 📜eval_results_loose.jsonl
     ┣ 📜eval_results_strict.jsonl
     ┣ 📜output.jsonl
     ┗ 📜scores.txt

Evaluate a LLM with EQ-Bench

args
default
description

eq_bench

False

Evaluate a LLM with EQ-Bench

eq_bench_prompt_type

"ChatML"

Set chat template

eq_bench_lora_path

None

Path to LORA adapters

eq_bench_quantization

None

Whether to quantize the LLM when loading

Example

Library

import evalverse as ev

evaluator = ev.Evaluator()
evaluator.run(
    model="upstage/SOLAR-10.7B-Instruct-v1.0",
    benchmark="eq_bench",
    eq_bench_prompt_type="Solar-v1",
    devices="0,1,2,3,4,5,6,7",
    output_path="./results"
)

CLI

python3 evaluator.py \
    --ckpt_path upstage/SOLAR-10.7B-Instruct-v1.0 \
    --eq_bench \
    --eq_bench_prompt_type Solar-v1 \
    --devices 0,1,2,3,4,5,6,7 \
    --output_path ./results

Result

📦results
 ┗ 📂SOLAR-10.7B-Instruct-v1.0
   ┗ 📂eq_bench
     ┣ 📜benchmark_results.csv
     ┗ 📜raw_results.json
📃
this section