English · 中文
📧 Contact: zhangrongjunchen@myhexin.com
Overview · Sample Schema · Smart Defaults · Game Arena · Arena Visual Control · Agent Eval · Benchmark · Contributing · Standards
GAGE is a unified, extensible evaluation framework for large language models, multimodal models, audio models, diffusion models, agents, and game environments. It provides one evaluation engine for datasets, model backends, metrics, arena runtimes, structured outputs, and replayable artifacts.
- Fast evaluation engine: Run local smoke tests, model-backed jobs, and larger benchmark batches through the same pipeline shape.
- Unified evaluation surface: Datasets, backends, role adapters, metrics, and output contracts are configured instead of hand-wired per benchmark.
- Game and agent sandboxing: Game Arena, AppWorld, SWE-bench-style agent tasks, GUI interaction, and tool-augmented workflows share the same run/output model.
- Replayable GameKit runtime: Gomoku, Tic-Tac-Toe, Doudizhu, Mahjong, PettingZoo Space Invaders, Retro Mario, and ViZDoom now emit structured arena traces plus
arena_visualsessions. - Operational visibility: Runs write
summary.json, sample outputs, logs, and visual artifacts so failures can be inspected after the fact.
Core design philosophy: everything is a step, everything is configurable.
# If you are in the mono-repo root:
cd gage-eval-main
# Python 3.10+ recommended
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtFor Game Arena LLM configs, use the *_openai_gamekit.yaml variants and export OPENAI_API_KEY. The model defaults to gpt-5.4; set GAGE_GAME_ARENA_LLM_MODEL to override it, or set OPENAI_API_BASE for an OpenAI-compatible endpoint.
python run.py \
--config config/run_configs/demo_echo_run_1.yaml \
--output-dir runs \
--run-id demo_echoDefault output structure:
runs/<run_id>/
events.jsonl
samples.jsonl
summary.json
samples/
<task_id>/
<sample_id>.json
| Scenario | Config Example | Description |
|---|---|---|
| GameArena Human-vs-AI | config/custom/doudizhu/doudizhu_human_visual_gamekit.yaml |
Browser-controlled Doudizhu match against LLM players |
| GameArena Pure Human Control | config/custom/retro_mario/retro_mario_human_visual_gamekit.yaml |
Browser-controlled real-time Retro Mario session |
| Agent Evaluation | config/custom/appworld/appworld_official_jsonl.yaml |
AppWorld sandbox evaluation |
| Code Gen | config/custom/swebench_pro/swebench_pro_smoke_agent.yaml |
SWE-bench style smoke run; Docker required |
| Text | config/custom/aime24/aime2024_chat.yaml |
AIME, GPQA, Math500, and related text benchmarks |
| Multimodal | config/custom/mathvista/chat.yaml |
MathVista and related multimodal benchmarks |
| LLM Judge | config/custom/examples/single_task_local_judge_qwen.yaml |
Local LLM judge example |
- Agent evaluation: Add stronger native agent benchmarking support with trajectory scoring and safety checks.
- Game Arena expansion: Grow the GameKit catalog and keep browser control, replay, and output contracts consistent.
- Gage-Client: Add a client tool for configuration management, failure diagnostics, and benchmark onboarding.
- Distributed inference: Support multi-node task sharding and load balancing for large runs.
- Benchmark expansion: Continue adding benchmark configs, metrics, and troubleshooting guidance.
This project is in internal validation; APIs, configs, and docs may change rapidly.








