Catch AI behavioral regressions before merge.
Run eval suites for prompts, agents, and workflows in GitHub Actions.
You can validate code quality, costs, infrastructure, and skill structure — but none of that tells you whether your AI workflow still produces good outputs. Behavioral regression is invisible until it hits production.
No GitHub Actions exist for this. ai-workflow-evals is the missing capstone of the AI DevOps stack.
- uses: ollieb89/ai-workflow-evals@v1
with:
eval-dir: evals/
fail-on: regressionsCreate evals/my-workflow.eval.yml:
version: "1"
target:
type: command
run: "node dist/my-agent.js"
cases:
- id: basic-summary
input: "Summarize this PR in one sentence."
checks:
- type: contains
value: "pull request"
- type: max-length
value: 200
- type: not-contains
value: "I cannot"
- id: cost-estimate
input: "What will this deployment cost?"
checks:
- type: regex
pattern: "\\$[0-9]+\\.?[0-9]*"version: "1"
target:
type: command # command | http | file
run: "node dist/agent.js"
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
timeout: 30 # seconds
cases:
- id: my-case # required, unique ID
description: "What this checks"
input: "The prompt or input to send"
skip: false # set true to skip
tags: [smoke, fast]
checks:
- type: contains
value: "expected text"
- type: not-contains
value: "bad text"
- type: regex
pattern: "\\d+"
flags: "i"
- type: max-length
value: 500
- type: json-schema
schema: "./schemas/response.json"
- type: llm-judge
model: gpt-4o-mini
criteria: "Response identifies at least one concrete issue"
threshold: 0.8| Type | Description | Required Fields |
|---|---|---|
command |
Run a CLI command, pipe input via stdin | run |
http |
POST input as JSON to an HTTP endpoint | url |
file |
Read static file as output (no execution) | path |
| Checker | Description | Required |
|---|---|---|
contains |
Output must contain string | value: "text" |
not-contains |
Output must NOT contain string | value: "text" |
regex |
Output must match pattern | pattern: "\\d+" |
max-length |
Output must be ≤ N chars | value: 200 |
json-schema |
Output (parsed as JSON) must match schema | schema: "./path.json" or inline object |
llm-judge |
LLM scores output against criteria (optional) | criteria, threshold, model |
LLM judge is optional. If
llm-judge-keyis not set, llm-judge checks are skipped gracefully — CI never fails due to missing API keys.
| Input | Description | Default |
|---|---|---|
eval-dir |
Directory containing .eval.yml files |
evals/ |
baseline-artifact |
Artifact name for baseline scores | eval-baseline |
pass-threshold |
Minimum pass rate (used with fail-on: threshold) |
1.0 |
regression-threshold |
Max allowed regressions before fail | 0 |
fail-on |
regressions | threshold | none |
regressions |
post-comment |
Post PR comment with results | true |
llm-judge-key |
API key for LLM-as-judge | `` |
dry-run |
Run without updating baseline | false |
| Output | Description |
|---|---|
pass-rate |
Fraction passed (e.g. 0.9500) |
total-cases |
Total cases run |
passed |
Cases passed |
failed |
Cases failed |
regressions |
Cases newly failing vs baseline |
improvements |
Cases newly passing vs baseline |
report-path |
Path to JSON artifact |
On first run (no baseline exists), all results are saved as the new baseline — CI passes.
On subsequent runs, ai-workflow-evals compares against the stored baseline:
- Regressions — cases that were passing before but now fail → fail CI by default
- Improvements — cases that were failing before but now pass → noted in PR comment
- Unchanged — expected; no action
Use dry-run: true in PR jobs to run evals without updating baseline.
# Main branch — update baseline
- uses: ollieb89/ai-workflow-evals@v1
with:
eval-dir: evals/
fail-on: regressions
# PRs — check for regressions without updating baseline
- uses: ollieb89/ai-workflow-evals@v1
with:
eval-dir: evals/
fail-on: regressions
dry-run: truename: AI Regression Tests
on: [push, pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '20'
- run: npm ci && npm run build
- uses: ollieb89/ai-workflow-evals@v1
with:
eval-dir: evals/
fail-on: regressions
dry-run: ${{ github.event_name == 'pull_request' }}
llm-judge-key: ${{ secrets.OPENAI_API_KEY }}npx ai-workflow-evals evals/
npx ai-workflow-evals --fail-on threshold --pass-threshold 0.9 evals/
npx ai-workflow-evals --dry-run --format markdown evals/| Tool | Purpose |
|---|---|
| workflow-guardian | Validate GitHub Actions workflows |
| ai-pr-guardian | Gate AI-generated / low-quality PRs |
| llm-cost-tracker | LLM API cost visibility in CI |
| mcp-server-tester | MCP server health + schema validation |
| actions-lockfile-generator | SHA-pin actions for supply chain security |
| agent-skill-validator | Skill repo linting + registry compatibility |
| ai-workflow-evals | Behavioral regression testing for AI workflows |
npm test81 tests covering loader, checkers (contains, not-contains, regex, max-length, json-schema), scorer, baseline management, config, and reporter.
MIT — see LICENSE
This action is one of eight tools that form the AI DevOps Actions suite — end-to-end CI/CD for AI systems.
| Action | Purpose |
|---|---|
| ai-pr-guardian | Gate low-quality and AI-generated PRs |
| pr-context-enricher | Rich context summaries for AI code reviewers |
| ai-output-redacter | Scan and redact secrets/PII in AI-generated outputs |
| actions-lockfile-generator | Pin Actions to SHA, prevent supply chain attacks |
| ai-workflow-evals | Run eval suites — catch AI behavioral regressions |
| mcp-server-tester | Validate MCP servers: health, compliance, discovery |
| agent-skill-validator | Lint and validate agent skill repos |
| llm-cost-tracker | Track AI API costs in CI, alert on overruns |