Evalverse

The Universe of Evaluation. All about the evaluation for LLMs.

Introduction

Evalverse is a freely accessible, open-source project designed to support your LLM (Large Language Model) evaluations. We provide a simple, standardized, and user-friendly solution for the processing and management of LLM evaluations, catering to the needs of AI research engineers and scientists. We also support no-code evaluation processes for people who may have less experience working with LLMs. Moreover, you will receive a well-organized report with figures summarizing the evaluation results.

Key Features

  • Unified evaluation with Submodules: For unified and expandable evaluation, Evalverse utilizes Git submodules to integrate external evaluation frameworks such as lm-evaluation-harness and FastChat. Thus, one can easily add new submodules to support more external evaluation frameworks. Not only that, one can always fetch upstream changes of the submodules to stay up-to-date with evaluation processes in the fast-paced LLM field.

  • No-code evaluation request: Evalverse supports no-code evaluation via Slack requests. The user types Request! in a direct message or Slack channel with an activate Evalverse Slack bot. The Slack bot asks the user to enter the model name in the Huggingface hub or the local model directory path and executes the evaluation process.

  • LLM evaluation report: Evalverse can also provide evaluation reports on finished evaluation in a no-code manner. To receive the evaluation report, the user first types Report!. Once the user selects model and evaluation criteria, Evalverse calculates the average scores and rankings using the evaluation results stored in the Database and provides a report with a performance table and a visualized graph.

Supported evaluations

EvaluationOriginal repository

H6 (Open LLM Leaderboard)

MT-bench

IFEval

EQ-Bench

Architecture of Evalverse

  • Submodule. The Submodule serves as the evaluation engine that is responsible for the heavy lifting involved in evaluating LLMs. Publicly available LLM evaluation libraries can be integrated into Evalverse as submodules. This component makes Evalverse expandable, thereby ensuring that the library remains up-to-date.

  • Connector. The Connector plays a role in linking the Submodules with the Evaluator. It contains evaluation scripts, along with the necessary arguments, from various external libraries.

  • Evaluator. The Evaluator performs the requested evaluations on the Compute Cluster by utilizing the evaluation scripts from the Connector. The Evaluator can receive evaluation requests either from the Reporter, which facilitates a no-code evaluation approach, or directly from the end-user for code-based evaluation.

  • Compute Cluster. The Compute Cluster is the collection of hardware accelerators needed to execute the LLM evaluation processes. When the Evaluator schedules an evaluation job to be ran, the Compute Cluster fetches the required model and data files from the Database. The results of the evaluation jobs are sent to the Database for storage.

  • Database. The Database stores the model files and data needed in the evaluation processes, along with evaluation results. The stored evaluation results are used by the Reporter to create evaluation reports for the user.

  • Reporter. The Reporter handles the evaluation and report requests sent by the users, allowing for a no-code approach to LLM evaluation. The Reporter sends the requested evaluation jobs to the Evaluator and fetches the evaluation results from the Database, which are sent to the user via an external communication platform such as Slack. Through this, users can receive table and figure that summarize evaluation results.

Citation

If you want to cite our 🌌 Evalverse project, feel free to use the following bibtex

@misc{evalverse,
  title = {Evalverse},
  author = {Jihoo Kim, Wonho Song, Dahyun Kim, Yoonsoo Kim, Yungi Kim, Chanjun Park},
  year = {2024},
  publisher = {GitHub, Upstage AI},
  howpublished = {\url{https://github.com/UpstageAI/evalverse}},
}

Last updated