Skip to content

HiThink-Research/GAGE

Repository files navigation

📐 GAGE: General AI evaluation and Gauge Engine

Python Code Style License Status

English · 中文

📧 Contact: zhangrongjunchen@myhexin.com

Overview · Sample Schema · Smart Defaults · Game Arena · Arena Visual Control · Agent Eval · Benchmark · Contributing · Standards


GAGE is a unified, extensible evaluation framework for large language models, multimodal models, audio models, diffusion models, agents, and game environments. It provides one evaluation engine for datasets, model backends, metrics, arena runtimes, structured outputs, and replayable artifacts.

Game Arena Showcase

Gomoku GameArena demoDoudizhu GameArena demoMahjong GameArena demo

Space Invaders demo Mario demo VizDoom demo

Why GAGE?

  • Fast evaluation engine: Run local smoke tests, model-backed jobs, and larger benchmark batches through the same pipeline shape.
  • Unified evaluation surface: Datasets, backends, role adapters, metrics, and output contracts are configured instead of hand-wired per benchmark.
  • Game and agent sandboxing: Game Arena, AppWorld, SWE-bench-style agent tasks, GUI interaction, and tool-augmented workflows share the same run/output model.
  • Replayable GameKit runtime: Gomoku, Tic-Tac-Toe, Doudizhu, Mahjong, PettingZoo Space Invaders, Retro Mario, and ViZDoom now emit structured arena traces plus arena_visual sessions.
  • Operational visibility: Runs write summary.json, sample outputs, logs, and visual artifacts so failures can be inspected after the fact.

Design Overview

Core design philosophy: everything is a step, everything is configurable.

Architecture Design

End-to-end flow

Orchestration Design

Step view

Game Arena Design

GameArena runtime core design

Quick Start

1. Installation

# If you are in the mono-repo root:
cd gage-eval-main

# Python 3.10+ recommended
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

For Game Arena LLM configs, use the *_openai_gamekit.yaml variants and export OPENAI_API_KEY. The model defaults to gpt-5.4; set GAGE_GAME_ARENA_LLM_MODEL to override it, or set OPENAI_API_BASE for an OpenAI-compatible endpoint.

2. Run a Basic Demo

python run.py \
  --config config/run_configs/demo_echo_run_1.yaml \
  --output-dir runs \
  --run-id demo_echo

3. View Reports

Default output structure:

runs/<run_id>/
  events.jsonl
  samples.jsonl
  summary.json
  samples/
    <task_id>/
      <sample_id>.json

Advanced Configurations

Scenario Config Example Description
GameArena Human-vs-AI config/custom/doudizhu/doudizhu_human_visual_gamekit.yaml Browser-controlled Doudizhu match against LLM players
GameArena Pure Human Control config/custom/retro_mario/retro_mario_human_visual_gamekit.yaml Browser-controlled real-time Retro Mario session
Agent Evaluation config/custom/appworld/appworld_official_jsonl.yaml AppWorld sandbox evaluation
Code Gen config/custom/swebench_pro/swebench_pro_smoke_agent.yaml SWE-bench style smoke run; Docker required
Text config/custom/aime24/aime2024_chat.yaml AIME, GPQA, Math500, and related text benchmarks
Multimodal config/custom/mathvista/chat.yaml MathVista and related multimodal benchmarks
LLM Judge config/custom/examples/single_task_local_judge_qwen.yaml Local LLM judge example

Roadmap

  • Agent evaluation: Add stronger native agent benchmarking support with trajectory scoring and safety checks.
  • Game Arena expansion: Grow the GameKit catalog and keep browser control, replay, and output contracts consistent.
  • Gage-Client: Add a client tool for configuration management, failure diagnostics, and benchmark onboarding.
  • Distributed inference: Support multi-node task sharding and load balancing for large runs.
  • Benchmark expansion: Continue adding benchmark configs, metrics, and troubleshooting guidance.

Status

This project is in internal validation; APIs, configs, and docs may change rapidly.

About

General AI evaluation and Gauge Engine. A unified evaluation engine for LLMs, MLLMs, audio, and diffusion models.

Topics

Resources

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors