Skip to content

feat: parallel multi-grader fan-out with consensus strategies #906

@christso

Description

@christso

Problem

Currently each target has a single grader_target for LLM-as-judge scoring. A single grader can have blind spots or biases that affect score reliability.

Proposal

Add grader_targets (plural) field for parallel grader fan-out:

- name: copilot-cli
  provider: copilot-cli
  grader_targets:
    - grader-openrouter
    - grader-gemini
  grader_strategy: consensus  # or: majority, any, all

Each grader scores independently in parallel, then the strategy aggregates:

Strategy Behavior
consensus All graders must agree (strictest)
majority >50% of graders pass
any At least one grader passes (most lenient)
all Return all scores without aggregation (for analysis/comparison)

Result JSONL

When multiple graders are used, the result should include per-grader scores:

{
  "scores": [
    { "type": "llm-grader", "grader": "grader-openrouter", "score": 0.9 },
    { "type": "llm-grader", "grader": "grader-gemini", "score": 0.8 }
  ],
  "grader_strategy": "majority",
  "grader_agreement": 1.0
}

The grader_agreement field (0.0–1.0) measures inter-grader reliability.

Use cases

  • Reduce grader bias: one LLM's blind spots covered by another
  • Cross-provider validation: ensure scores aren't provider-dependent
  • Confidence scoring: high agreement = high confidence in score
  • A/B testing graders: compare grader quality before switching

Prior art

  • Google ADK: multiple evaluator judges with voting
  • LMSYS Chatbot Arena: multi-judge ranking
  • No eval framework exposes this declaratively in YAML config yet

Backward compatibility

  • grader_target (singular) continues to work unchanged
  • grader_targets is optional; default strategy is majority
  • When only one grader is specified in grader_targets, behaves identically to grader_target

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions