Skip to content

ollieb89/ai-workflow-evals

Repository files navigation

ai-workflow-evals

CI npm License: MIT

Catch AI behavioral regressions before merge.
Run eval suites for prompts, agents, and workflows in GitHub Actions.

The Problem

You can validate code quality, costs, infrastructure, and skill structure — but none of that tells you whether your AI workflow still produces good outputs. Behavioral regression is invisible until it hits production.

No GitHub Actions exist for this. ai-workflow-evals is the missing capstone of the AI DevOps stack.

Quick Start

- uses: ollieb89/ai-workflow-evals@v1
  with:
    eval-dir: evals/
    fail-on: regressions

Create evals/my-workflow.eval.yml:

version: "1"
target:
  type: command
  run: "node dist/my-agent.js"

cases:
  - id: basic-summary
    input: "Summarize this PR in one sentence."
    checks:
      - type: contains
        value: "pull request"
      - type: max-length
        value: 200
      - type: not-contains
        value: "I cannot"

  - id: cost-estimate
    input: "What will this deployment cost?"
    checks:
      - type: regex
        pattern: "\\$[0-9]+\\.?[0-9]*"

Eval Case Schema

version: "1"

target:
  type: command              # command | http | file
  run: "node dist/agent.js"
  env:
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
  timeout: 30                # seconds

cases:
  - id: my-case              # required, unique ID
    description: "What this checks"
    input: "The prompt or input to send"
    skip: false              # set true to skip
    tags: [smoke, fast]
    checks:
      - type: contains
        value: "expected text"
      - type: not-contains
        value: "bad text"
      - type: regex
        pattern: "\\d+"
        flags: "i"
      - type: max-length
        value: 500
      - type: json-schema
        schema: "./schemas/response.json"
      - type: llm-judge
        model: gpt-4o-mini
        criteria: "Response identifies at least one concrete issue"
        threshold: 0.8

Target Types

Type Description Required Fields
command Run a CLI command, pipe input via stdin run
http POST input as JSON to an HTTP endpoint url
file Read static file as output (no execution) path

Checkers

Checker Description Required
contains Output must contain string value: "text"
not-contains Output must NOT contain string value: "text"
regex Output must match pattern pattern: "\\d+"
max-length Output must be ≤ N chars value: 200
json-schema Output (parsed as JSON) must match schema schema: "./path.json" or inline object
llm-judge LLM scores output against criteria (optional) criteria, threshold, model

LLM judge is optional. If llm-judge-key is not set, llm-judge checks are skipped gracefully — CI never fails due to missing API keys.

Action Inputs

Input Description Default
eval-dir Directory containing .eval.yml files evals/
baseline-artifact Artifact name for baseline scores eval-baseline
pass-threshold Minimum pass rate (used with fail-on: threshold) 1.0
regression-threshold Max allowed regressions before fail 0
fail-on regressions | threshold | none regressions
post-comment Post PR comment with results true
llm-judge-key API key for LLM-as-judge ``
dry-run Run without updating baseline false

Action Outputs

Output Description
pass-rate Fraction passed (e.g. 0.9500)
total-cases Total cases run
passed Cases passed
failed Cases failed
regressions Cases newly failing vs baseline
improvements Cases newly passing vs baseline
report-path Path to JSON artifact

Baseline Management

On first run (no baseline exists), all results are saved as the new baseline — CI passes.

On subsequent runs, ai-workflow-evals compares against the stored baseline:

  • Regressions — cases that were passing before but now fail → fail CI by default
  • Improvements — cases that were failing before but now pass → noted in PR comment
  • Unchanged — expected; no action

Use dry-run: true in PR jobs to run evals without updating baseline.

# Main branch — update baseline
- uses: ollieb89/ai-workflow-evals@v1
  with:
    eval-dir: evals/
    fail-on: regressions

# PRs — check for regressions without updating baseline
- uses: ollieb89/ai-workflow-evals@v1
  with:
    eval-dir: evals/
    fail-on: regressions
    dry-run: true

Full CI Example

name: AI Regression Tests

on: [push, pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '20'
      - run: npm ci && npm run build
      - uses: ollieb89/ai-workflow-evals@v1
        with:
          eval-dir: evals/
          fail-on: regressions
          dry-run: ${{ github.event_name == 'pull_request' }}
          llm-judge-key: ${{ secrets.OPENAI_API_KEY }}

CLI

npx ai-workflow-evals evals/
npx ai-workflow-evals --fail-on threshold --pass-threshold 0.9 evals/
npx ai-workflow-evals --dry-run --format markdown evals/

The AI DevOps Actions Suite

Tool Purpose
workflow-guardian Validate GitHub Actions workflows
ai-pr-guardian Gate AI-generated / low-quality PRs
llm-cost-tracker LLM API cost visibility in CI
mcp-server-tester MCP server health + schema validation
actions-lockfile-generator SHA-pin actions for supply chain security
agent-skill-validator Skill repo linting + registry compatibility
ai-workflow-evals Behavioral regression testing for AI workflows

Tests

npm test

81 tests covering loader, checkers (contains, not-contains, regex, max-length, json-schema), scorer, baseline management, config, and reporter.

License

MIT — see LICENSE


Part of the AI DevOps Actions suite

This action is one of eight tools that form the AI DevOps Actions suite — end-to-end CI/CD for AI systems.

Action Purpose
ai-pr-guardian Gate low-quality and AI-generated PRs
pr-context-enricher Rich context summaries for AI code reviewers
ai-output-redacter Scan and redact secrets/PII in AI-generated outputs
actions-lockfile-generator Pin Actions to SHA, prevent supply chain attacks
ai-workflow-evals Run eval suites — catch AI behavioral regressions
mcp-server-tester Validate MCP servers: health, compliance, discovery
agent-skill-validator Lint and validate agent skill repos
llm-cost-tracker Track AI API costs in CI, alert on overruns

View the full suite

About

Catch AI behavioral regressions before merge. Run eval suites for prompts, agents, and workflows in GitHub Actions.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors