📐 GAGE: General AI evaluation and Gauge Engine

English · 中文

📧 Contact: zhangrongjunchen@myhexin.com

Overview · Sample Schema · Smart Defaults · Game Arena · Arena Visual Control · Agent Eval · Benchmark · Contributing · Standards

GAGE is a unified, extensible evaluation framework for large language models, multimodal models, audio models, diffusion models, agents, and game environments. It provides one evaluation engine for datasets, model backends, metrics, arena runtimes, structured outputs, and replayable artifacts.

Game Arena Showcase

Why GAGE?

Fast evaluation engine: Run local smoke tests, model-backed jobs, and larger benchmark batches through the same pipeline shape.
Unified evaluation surface: Datasets, backends, role adapters, metrics, and output contracts are configured instead of hand-wired per benchmark.
Game and agent sandboxing: Game Arena, AppWorld, SWE-bench-style agent tasks, GUI interaction, and tool-augmented workflows share the same run/output model.
Replayable GameKit runtime: Gomoku, Tic-Tac-Toe, Doudizhu, Mahjong, PettingZoo Space Invaders, Retro Mario, and ViZDoom now emit structured arena traces plus arena_visual sessions.
Operational visibility: Runs write summary.json, sample outputs, logs, and visual artifacts so failures can be inspected after the fact.

Design Overview

Core design philosophy: everything is a step, everything is configurable.

Architecture Design

Orchestration Design

Game Arena Design

Quick Start

1. Installation

# If you are in the mono-repo root:
cd gage-eval-main

# Python 3.10+ recommended
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

For Game Arena LLM configs, use the *_openai_gamekit.yaml variants and export OPENAI_API_KEY. The model defaults to gpt-5.4; set GAGE_GAME_ARENA_LLM_MODEL to override it, or set OPENAI_API_BASE for an OpenAI-compatible endpoint.

2. Run a Basic Demo

python run.py \
  --config config/run_configs/demo_echo_run_1.yaml \
  --output-dir runs \
  --run-id demo_echo

3. View Reports

Default output structure:

runs/<run_id>/
  events.jsonl
  samples.jsonl
  summary.json
  samples/
    <task_id>/
      <sample_id>.json

Advanced Configurations

Scenario	Config Example	Description
GameArena Human-vs-AI	`config/custom/doudizhu/doudizhu_human_visual_gamekit.yaml`	Browser-controlled Doudizhu match against LLM players
GameArena Pure Human Control	`config/custom/retro_mario/retro_mario_human_visual_gamekit.yaml`	Browser-controlled real-time Retro Mario session
Agent Evaluation	`config/custom/appworld/appworld_official_jsonl.yaml`	AppWorld sandbox evaluation
Code Gen	`config/custom/swebench_pro/swebench_pro_smoke_agent.yaml`	SWE-bench style smoke run; Docker required
Text	`config/custom/aime24/aime2024_chat.yaml`	AIME, GPQA, Math500, and related text benchmarks
Multimodal	`config/custom/mathvista/chat.yaml`	MathVista and related multimodal benchmarks
LLM Judge	`config/custom/examples/single_task_local_judge_qwen.yaml`	Local LLM judge example

Roadmap

Agent evaluation: Add stronger native agent benchmarking support with trajectory scoring and safety checks.
Game Arena expansion: Grow the GameKit catalog and keep browser control, replay, and output contracts consistent.
Gage-Client: Add a client tool for configuration management, failure diagnostics, and benchmark onboarding.
Distributed inference: Support multi-node task sharding and load balancing for large runs.
Benchmark expansion: Continue adding benchmark configs, metrics, and troubleshooting guidance.

Status

This project is in internal validation; APIs, configs, and docs may change rapidly.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
config		config
docker		docker
docs		docs
frontend		frontend
scripts		scripts
src/gage_eval		src/gage_eval
tests		tests
third_party/swebench_pro		third_party/swebench_pro
.codex		.codex
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
README_zh.md		README_zh.md
TESTING.md		TESTING.md
pytest.ini		pytest.ini
registry_manifest.yaml		registry_manifest.yaml
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📐 GAGE: General AI evaluation and Gauge Engine

Game Arena Showcase

Why GAGE?

Design Overview

Architecture Design

Orchestration Design

Game Arena Design

Quick Start

1. Installation

2. Run a Basic Demo

3. View Reports

Advanced Configurations

Roadmap

Status

About

Uh oh!

Releases 12

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📐 GAGE: General AI evaluation and Gauge Engine

Game Arena Showcase

Why GAGE?

Design Overview

Architecture Design

Orchestration Design

Game Arena Design

Quick Start

1. Installation

2. Run a Basic Demo

3. View Reports

Advanced Configurations

Roadmap

Status

About

Topics

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 12

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages