Skip to content

[BREAKING] FEAT: Restructure Psychosocial scenario for per-subharm scoring#1943

Open
varunj-msft wants to merge 1 commit into
microsoft:mainfrom
varunj-msft:varunj-msft/8380-Standardizing-Scenarios-Psychosocial-bugfix
Open

[BREAKING] FEAT: Restructure Psychosocial scenario for per-subharm scoring#1943
varunj-msft wants to merge 1 commit into
microsoft:mainfrom
varunj-msft:varunj-msft/8380-Standardizing-Scenarios-Psychosocial-bugfix

Conversation

@varunj-msft

@varunj-msft varunj-msft commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Description

Restructures the Psychosocial scenario so strategies are techniques and subharms are datasets. Started as the --max-dataset-size fast-path bugfix and grew into the full standardization Rich sketched in review.

  1. Strategies are techniques now. The strategy enum is prompt_sending, role_play, crescendo instead of the old subharm-as-strategy anti-pattern. Subharm selection moves off --strategies and onto --dataset-names.

  2. Seed file split per subharm. psychosocial.prompt becomes airt_imminent_crisis.prompt + airt_licensed_therapist.prompt, each carrying its own scorer rubric and crescendo escalation prompt. This removes runtime harm-filtering entirely — there's no longer a cap-vs-filter ordering to get wrong, which is what made --max-dataset-size 1 fail before.

  3. Per-subharm scorers. One FloatScaleThresholdScorer is built per subharm and routed to both the AtomicAttack and the technique's AttackScoringConfig, so running all no longer scores attacks with the wrong rubric.

  4. Per-subharm baselines. Baselines are named baseline_imminent_crisis / baseline_licensed_therapist instead of two atomics both named "baseline", which collided in _display_group_map / attack_results (keyed on name alone). include_baseline=False is forced through to the base class so its auto-injection rescue doesn't re-add a generic baseline.

  5. initialize_async validates dataset names. A user-supplied dataset_config is allowed only if its dataset names are a subset of the subharms, so --max-dataset-size N still works while custom names are rejected with a clear error.

Crescendo is kept out of the default aggregate (opt-in via --strategies all / --strategies crescendo) since it's the heaviest technique — default runs are single-turn. BASELINE_ATTACK_POLICY is back to Enabled, and VERSION is bumped 2→3 because the default behavior changed; stored v2 results raise cleanly on --resume rather than silently mixing semantics.

One deliberate divergence from the review sketch, flagged inline: TARGET_REQUIREMENTS is left at the base default rather than EDITABLE_HISTORY. With Crescendo opt-in, the default run is single-turn, and requiring editable history at the scenario level would reject OpenAIChatTarget before the strategy resolves. Crescendo enforces that requirement itself when it runs. Easy to flip back if preferred.

Tests and Documentation

Reworked tests/unit/scenario/airt/test_psychosocial.py around the new shape — 10 test classes:

  • TestPsychosocialStrategyEnum / TestPsychosocialTechniques — enum members, default tags, Crescendo excluded from default
  • TestSubharmConfigs — both subharms wired with the right dataset, scorer prompt, and crescendo path
  • TestPsychosocialInitialization — VERSION == 3, baseline policy, no scenario-level target requirement
  • TestPsychosocialDatasetConfigValidationmax_dataset_size=1 on a valid subharm doesn't raise; custom / unknown names do
  • TestPsychosocialCrossProduct — atomic attacks are the (technique × subharm) product with correct names
  • TestPsychosocialPerSubharmScorer — each subharm's attacks (and baseline) get that subharm's scorer
  • TestPsychosocialBaselineHandling — per-subharm baseline names, both present in the display map, include_baseline=False suppresses them
  • TestPsychosocialLazyAdversarialResolution / TestPsychosocialSingleSubharmOverride — lazy adversarial target, single-subharm narrowing

Validation:

  • pytest tests/unit/scenario/airt/test_psychosocial.py → 53 passed
  • pytest tests/unit/scenario/ (full scenario suite) → 725 passed, no regressions
  • pytest tests/unit/backend/test_scenario_run_service.py → 35 passed
  • ruff check + ruff format --check + ty → all clean
  • Live pyrit_scan fast path (--strategies prompt_sending --dataset-names airt_imminent_crisis --max-dataset-size 1) and a default run both complete successfully

Docs updated: doc/scanner/airt.py (+notebook) for the fast-path command and the --dataset-names model, the programming guide (0_scenarios.ipynb), and the built-in dataset list (1_loading_datasets.ipynb) now includes airt_licensed_therapist.

Comment thread pyrit/scenario/scenarios/airt/psychosocial.py Outdated
# Load the unsampled seed pool so the harm-category filter sees every seed
# the dataset config would otherwise sample over. Temporarily zero the cap
# and restore it in a finally so a raising loader leaves the config intact.
sampling_cap = self._dataset_config.max_dataset_size

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than creating a global cap; there are also some corner case bugs with this implementation.

Instead, I'd subclass DatasetConfiguration to something like this

class PsychosocialDatasetConfiguration(DatasetConfiguration):
    def get_seed_groups(self) -> dict[str, list[SeedGroup]]:
        loaded = self._load_unsampled()          # per-dataset, no cap yet
        filtered = self._filter_by_harm(loaded)  # uses self._scenario_strategies
        return {k: self._apply_max_dataset_size(v) for k, v in filtered.items()}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went a different route that drops the subclass entirely: split the seed file into one dataset per subharm (airt_imminent_crisis/airt_licensed_therapist). With the harm filtering gone there's nothing left to cap manually, so plain DatasetConfiguration works. initialize_async just checks the names are a subset of the subharms, so --max-dataset-size N still works and those corner cases disappear. What do you think of this?

Comment thread pyrit/scenario/scenarios/airt/psychosocial.py Outdated
Comment thread pyrit/scenario/scenarios/airt/psychosocial.py Outdated
…oring

Replace the subharm-as-strategy anti-pattern with a technique-axis strategy
enum (prompt_sending, role_play, crescendo). Subharm selection now happens via
--dataset-names; the seed file is split into per-subharm datasets
(airt_imminent_crisis, airt_licensed_therapist), each with its own scorer rubric
and crescendo escalation prompt.

Key changes:
- Per-subharm FloatScaleThresholdScorer routed to both the AtomicAttack
  objective_scorer and the technique AttackScoringConfig, so baseline and
  technique attacks are scored with the matching rubric.
- Per-subharm baselines named baseline_<subharm> to avoid the
  _display_group_map / attack_results key collision from duplicate baseline
  names; include_baseline=False is forced through to the base class to suppress
  the auto-injection rescue.
- initialize_async validates dataset_config against the subharm dataset names
  (subset only) so --max-dataset-size still works while custom names are rejected.
- Crescendo kept out of the default aggregate (opt-in via --strategies all /
  crescendo) as it is the heaviest technique.
- BASELINE_ATTACK_POLICY re-enabled; VERSION bumped 2 -> 3 (BREAKING) so stored
  results from the prior default behavior cannot silently resume.
- TARGET_REQUIREMENTS left at base default (no EDITABLE_HISTORY) since the
  default run is single-turn; crescendo enforces its own requirements at attack
  instantiation.
- Docs/datasets updated: fast-path command, programming guide, and the built-in
  dataset list now reflect the per-subharm split.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@varunj-msft varunj-msft force-pushed the varunj-msft/8380-Standardizing-Scenarios-Psychosocial-bugfix branch from 6106e65 to 5dcb74a Compare June 17, 2026 01:30
@varunj-msft varunj-msft changed the title [BREAKING] FIX: Psychosocial harm-category filtering and baseline default [BREAKING] FEAT: Restructure Psychosocial scenario for per-subharm scoring Jun 17, 2026
@varunj-msft

Copy link
Copy Markdown
Contributor Author

Pushed the restructure. Only open question is TARGET_REQUIREMENTS (in the refactor thread). Also retitling since this isn't a bugfix anymore

@rlundeen2 rlundeen2 self-assigned this Jun 22, 2026
@rlundeen2

Copy link
Copy Markdown
Contributor

Maybe we should hold off until the DataConfiguration refactor is in

if not seed_groups_for_subharm:
continue
baseline_scorer = scorers_by_dataset[cfg.dataset_name]
baseline_attack_technique = PromptSendingAttack(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rich agrees with this comment but it is copilot generated:

Let''s fix this in the base class rather than hand-rolling. The hand-rolled PromptSendingAttack + AtomicAttack here is effectively a fork of _build_baseline_atomic_attack, which hard-codes atomic_attack_name="baseline" and self._objective_scorer — that''s the only reason it can''t be reused for per-subharm baselines.

Proposal: generalize the base helper to _build_baseline_atomic_attacks(...) (plural) that accepts the seed groups plus an optional objective_scorer, name, and display_group, and returns list[AtomicAttack]. Single-baseline scenarios call it with one spec; this scenario passes one per subharm. That deletes the bespoke construction here and keeps baseline wiring (labels, attribution, future converters) in one place.

I think this also dissolves the initialize_async workaround — it''s the same root cause. The reason you have to resolve include_baseline locally and force include_baseline=False into super is that the base rescue at scenario.py:687 detects "did the override already emit a baseline?" via the literal atomic_attack_name != "baseline", and ours are baseline_<subharm>. If we add an is_baseline: bool to AtomicAttack (set by the base helper) and switch that rescue to any(aa.is_baseline for aa in self._atomic_attacks), the per-subharm baselines are recognized, the rescue stops firing, and the whole _effective_include_baseline interception + forced False can go away. (Side note: the comments here reference scenario.py:670, but the rescue is at 687 now — that drift is exactly the smell we''d remove.)

Separately: matching seed prompts to techniques/subharms is the genuinely awkward part of this file. I''m refactoring DatasetConfiguration to make that mapping first-class, so let''s leave the _SUBHARMS table as-is for now and migrate it once that lands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants