ai-workflow-evals

Catch AI behavioral regressions before merge.
Run eval suites for prompts, agents, and workflows in GitHub Actions.

The Problem

You can validate code quality, costs, infrastructure, and skill structure — but none of that tells you whether your AI workflow still produces good outputs. Behavioral regression is invisible until it hits production.

No GitHub Actions exist for this. ai-workflow-evals is the missing capstone of the AI DevOps stack.

Quick Start

- uses: ollieb89/ai-workflow-evals@v1
  with:
    eval-dir: evals/
    fail-on: regressions

Create evals/my-workflow.eval.yml:

version: "1"
target:
  type: command
  run: "node dist/my-agent.js"

cases:
  - id: basic-summary
    input: "Summarize this PR in one sentence."
    checks:
      - type: contains
        value: "pull request"
      - type: max-length
        value: 200
      - type: not-contains
        value: "I cannot"

  - id: cost-estimate
    input: "What will this deployment cost?"
    checks:
      - type: regex
        pattern: "\\$[0-9]+\\.?[0-9]*"

Eval Case Schema

version: "1"

target:
  type: command              # command | http | file
  run: "node dist/agent.js"
  env:
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
  timeout: 30                # seconds

cases:
  - id: my-case              # required, unique ID
    description: "What this checks"
    input: "The prompt or input to send"
    skip: false              # set true to skip
    tags: [smoke, fast]
    checks:
      - type: contains
        value: "expected text"
      - type: not-contains
        value: "bad text"
      - type: regex
        pattern: "\\d+"
        flags: "i"
      - type: max-length
        value: 500
      - type: json-schema
        schema: "./schemas/response.json"
      - type: llm-judge
        model: gpt-4o-mini
        criteria: "Response identifies at least one concrete issue"
        threshold: 0.8

Target Types

Type	Description	Required Fields
`command`	Run a CLI command, pipe input via stdin	`run`
`http`	POST input as JSON to an HTTP endpoint	`url`
`file`	Read static file as output (no execution)	`path`

Checkers

Checker	Description	Required
`contains`	Output must contain string	`value: "text"`
`not-contains`	Output must NOT contain string	`value: "text"`
`regex`	Output must match pattern	`pattern: "\\d+"`
`max-length`	Output must be ≤ N chars	`value: 200`
`json-schema`	Output (parsed as JSON) must match schema	`schema: "./path.json"` or inline object
`llm-judge`	LLM scores output against criteria (optional)	`criteria`, `threshold`, `model`

LLM judge is optional. If llm-judge-key is not set, llm-judge checks are skipped gracefully — CI never fails due to missing API keys.

Action Inputs

Input	Description	Default
`eval-dir`	Directory containing `.eval.yml` files	`evals/`
`baseline-artifact`	Artifact name for baseline scores	`eval-baseline`
`pass-threshold`	Minimum pass rate (used with `fail-on: threshold`)	`1.0`
`regression-threshold`	Max allowed regressions before fail	`0`
`fail-on`	`regressions` \| `threshold` \| `none`	`regressions`
`post-comment`	Post PR comment with results	`true`
`llm-judge-key`	API key for LLM-as-judge	``
`dry-run`	Run without updating baseline	`false`

Action Outputs

Output	Description
`pass-rate`	Fraction passed (e.g. `0.9500`)
`total-cases`	Total cases run
`passed`	Cases passed
`failed`	Cases failed
`regressions`	Cases newly failing vs baseline
`improvements`	Cases newly passing vs baseline
`report-path`	Path to JSON artifact

Baseline Management

On first run (no baseline exists), all results are saved as the new baseline — CI passes.

On subsequent runs, ai-workflow-evals compares against the stored baseline:

Regressions — cases that were passing before but now fail → fail CI by default
Improvements — cases that were failing before but now pass → noted in PR comment
Unchanged — expected; no action

Use dry-run: true in PR jobs to run evals without updating baseline.

# Main branch — update baseline
- uses: ollieb89/ai-workflow-evals@v1
  with:
    eval-dir: evals/
    fail-on: regressions

# PRs — check for regressions without updating baseline
- uses: ollieb89/ai-workflow-evals@v1
  with:
    eval-dir: evals/
    fail-on: regressions
    dry-run: true

Full CI Example

name: AI Regression Tests

on: [push, pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '20'
      - run: npm ci && npm run build
      - uses: ollieb89/ai-workflow-evals@v1
        with:
          eval-dir: evals/
          fail-on: regressions
          dry-run: ${{ github.event_name == 'pull_request' }}
          llm-judge-key: ${{ secrets.OPENAI_API_KEY }}

CLI

npx ai-workflow-evals evals/
npx ai-workflow-evals --fail-on threshold --pass-threshold 0.9 evals/
npx ai-workflow-evals --dry-run --format markdown evals/

The AI DevOps Actions Suite

Tool	Purpose
workflow-guardian	Validate GitHub Actions workflows
ai-pr-guardian	Gate AI-generated / low-quality PRs
llm-cost-tracker	LLM API cost visibility in CI
mcp-server-tester	MCP server health + schema validation
actions-lockfile-generator	SHA-pin actions for supply chain security
agent-skill-validator	Skill repo linting + registry compatibility
ai-workflow-evals	Behavioral regression testing for AI workflows

Tests

npm test

81 tests covering loader, checkers (contains, not-contains, regex, max-length, json-schema), scorer, baseline management, config, and reporter.

License

MIT — see LICENSE

Part of the AI DevOps Actions suite

This action is one of eight tools that form the AI DevOps Actions suite — end-to-end CI/CD for AI systems.

Action	Purpose
ai-pr-guardian	Gate low-quality and AI-generated PRs
pr-context-enricher	Rich context summaries for AI code reviewers
ai-output-redacter	Scan and redact secrets/PII in AI-generated outputs
actions-lockfile-generator	Pin Actions to SHA, prevent supply chain attacks
ai-workflow-evals	Run eval suites — catch AI behavioral regressions
mcp-server-tester	Validate MCP servers: health, compliance, discovery
agent-skill-validator	Lint and validate agent skill repos
llm-cost-tracker	Track AI API costs in CI, alert on overruns

→ View the full suite

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
dist		dist
evals		evals
schemas		schemas
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SPEC.md		SPEC.md
action.yml		action.yml
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ai-workflow-evals

The Problem

Quick Start

Eval Case Schema

Target Types

Checkers

Action Inputs

Action Outputs

Baseline Management

Full CI Example

CLI

The AI DevOps Actions Suite

Tests

License

Part of the AI DevOps Actions suite

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ai-workflow-evals

The Problem

Quick Start

Eval Case Schema

Target Types

Checkers

Action Inputs

Action Outputs

Baseline Management

Full CI Example

CLI

The AI DevOps Actions Suite

Tests

License

Part of the AI DevOps Actions suite

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages