[BREAKING] FEAT: Restructure Psychosocial scenario for per-subharm scoring#1943
Conversation
| # Load the unsampled seed pool so the harm-category filter sees every seed | ||
| # the dataset config would otherwise sample over. Temporarily zero the cap | ||
| # and restore it in a finally so a raising loader leaves the config intact. | ||
| sampling_cap = self._dataset_config.max_dataset_size |
There was a problem hiding this comment.
Rather than creating a global cap; there are also some corner case bugs with this implementation.
Instead, I'd subclass DatasetConfiguration to something like this
class PsychosocialDatasetConfiguration(DatasetConfiguration):
def get_seed_groups(self) -> dict[str, list[SeedGroup]]:
loaded = self._load_unsampled() # per-dataset, no cap yet
filtered = self._filter_by_harm(loaded) # uses self._scenario_strategies
return {k: self._apply_max_dataset_size(v) for k, v in filtered.items()}
There was a problem hiding this comment.
Went a different route that drops the subclass entirely: split the seed file into one dataset per subharm (airt_imminent_crisis/airt_licensed_therapist). With the harm filtering gone there's nothing left to cap manually, so plain DatasetConfiguration works. initialize_async just checks the names are a subset of the subharms, so --max-dataset-size N still works and those corner cases disappear. What do you think of this?
…oring Replace the subharm-as-strategy anti-pattern with a technique-axis strategy enum (prompt_sending, role_play, crescendo). Subharm selection now happens via --dataset-names; the seed file is split into per-subharm datasets (airt_imminent_crisis, airt_licensed_therapist), each with its own scorer rubric and crescendo escalation prompt. Key changes: - Per-subharm FloatScaleThresholdScorer routed to both the AtomicAttack objective_scorer and the technique AttackScoringConfig, so baseline and technique attacks are scored with the matching rubric. - Per-subharm baselines named baseline_<subharm> to avoid the _display_group_map / attack_results key collision from duplicate baseline names; include_baseline=False is forced through to the base class to suppress the auto-injection rescue. - initialize_async validates dataset_config against the subharm dataset names (subset only) so --max-dataset-size still works while custom names are rejected. - Crescendo kept out of the default aggregate (opt-in via --strategies all / crescendo) as it is the heaviest technique. - BASELINE_ATTACK_POLICY re-enabled; VERSION bumped 2 -> 3 (BREAKING) so stored results from the prior default behavior cannot silently resume. - TARGET_REQUIREMENTS left at base default (no EDITABLE_HISTORY) since the default run is single-turn; crescendo enforces its own requirements at attack instantiation. - Docs/datasets updated: fast-path command, programming guide, and the built-in dataset list now reflect the per-subharm split. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
6106e65 to
5dcb74a
Compare
|
Pushed the restructure. Only open question is TARGET_REQUIREMENTS (in the refactor thread). Also retitling since this isn't a bugfix anymore |
|
Maybe we should hold off until the DataConfiguration refactor is in |
| if not seed_groups_for_subharm: | ||
| continue | ||
| baseline_scorer = scorers_by_dataset[cfg.dataset_name] | ||
| baseline_attack_technique = PromptSendingAttack( |
There was a problem hiding this comment.
Rich agrees with this comment but it is copilot generated:
Let''s fix this in the base class rather than hand-rolling. The hand-rolled PromptSendingAttack + AtomicAttack here is effectively a fork of _build_baseline_atomic_attack, which hard-codes atomic_attack_name="baseline" and self._objective_scorer — that''s the only reason it can''t be reused for per-subharm baselines.
Proposal: generalize the base helper to _build_baseline_atomic_attacks(...) (plural) that accepts the seed groups plus an optional objective_scorer, name, and display_group, and returns list[AtomicAttack]. Single-baseline scenarios call it with one spec; this scenario passes one per subharm. That deletes the bespoke construction here and keeps baseline wiring (labels, attribution, future converters) in one place.
I think this also dissolves the initialize_async workaround — it''s the same root cause. The reason you have to resolve include_baseline locally and force include_baseline=False into super is that the base rescue at scenario.py:687 detects "did the override already emit a baseline?" via the literal atomic_attack_name != "baseline", and ours are baseline_<subharm>. If we add an is_baseline: bool to AtomicAttack (set by the base helper) and switch that rescue to any(aa.is_baseline for aa in self._atomic_attacks), the per-subharm baselines are recognized, the rescue stops firing, and the whole _effective_include_baseline interception + forced False can go away. (Side note: the comments here reference scenario.py:670, but the rescue is at 687 now — that drift is exactly the smell we''d remove.)
Separately: matching seed prompts to techniques/subharms is the genuinely awkward part of this file. I''m refactoring DatasetConfiguration to make that mapping first-class, so let''s leave the _SUBHARMS table as-is for now and migrate it once that lands.
Description
Restructures the Psychosocial scenario so strategies are techniques and subharms are datasets. Started as the
--max-dataset-sizefast-path bugfix and grew into the full standardization Rich sketched in review.Strategies are techniques now. The strategy enum is
prompt_sending,role_play,crescendoinstead of the old subharm-as-strategy anti-pattern. Subharm selection moves off--strategiesand onto--dataset-names.Seed file split per subharm.
psychosocial.promptbecomesairt_imminent_crisis.prompt+airt_licensed_therapist.prompt, each carrying its own scorer rubric and crescendo escalation prompt. This removes runtime harm-filtering entirely — there's no longer a cap-vs-filter ordering to get wrong, which is what made--max-dataset-size 1fail before.Per-subharm scorers. One
FloatScaleThresholdScoreris built per subharm and routed to both theAtomicAttackand the technique'sAttackScoringConfig, so runningallno longer scores attacks with the wrong rubric.Per-subharm baselines. Baselines are named
baseline_imminent_crisis/baseline_licensed_therapistinstead of two atomics both named"baseline", which collided in_display_group_map/attack_results(keyed on name alone).include_baseline=Falseis forced through to the base class so its auto-injection rescue doesn't re-add a generic baseline.initialize_asyncvalidates dataset names. A user-supplieddataset_configis allowed only if its dataset names are a subset of the subharms, so--max-dataset-size Nstill works while custom names are rejected with a clear error.Crescendo is kept out of the
defaultaggregate (opt-in via--strategies all/--strategies crescendo) since it's the heaviest technique — default runs are single-turn.BASELINE_ATTACK_POLICYis back toEnabled, andVERSIONis bumped 2→3 because the default behavior changed; stored v2 results raise cleanly on--resumerather than silently mixing semantics.One deliberate divergence from the review sketch, flagged inline:
TARGET_REQUIREMENTSis left at the base default rather thanEDITABLE_HISTORY. With Crescendo opt-in, the default run is single-turn, and requiring editable history at the scenario level would rejectOpenAIChatTargetbefore the strategy resolves. Crescendo enforces that requirement itself when it runs. Easy to flip back if preferred.Tests and Documentation
Reworked
tests/unit/scenario/airt/test_psychosocial.pyaround the new shape — 10 test classes:TestPsychosocialStrategyEnum/TestPsychosocialTechniques— enum members, default tags, Crescendo excluded fromdefaultTestSubharmConfigs— both subharms wired with the right dataset, scorer prompt, and crescendo pathTestPsychosocialInitialization— VERSION == 3, baseline policy, no scenario-level target requirementTestPsychosocialDatasetConfigValidation—max_dataset_size=1on a valid subharm doesn't raise; custom / unknown names doTestPsychosocialCrossProduct— atomic attacks are the (technique × subharm) product with correct namesTestPsychosocialPerSubharmScorer— each subharm's attacks (and baseline) get that subharm's scorerTestPsychosocialBaselineHandling— per-subharm baseline names, both present in the display map,include_baseline=Falsesuppresses themTestPsychosocialLazyAdversarialResolution/TestPsychosocialSingleSubharmOverride— lazy adversarial target, single-subharm narrowingValidation:
pytest tests/unit/scenario/airt/test_psychosocial.py→ 53 passedpytest tests/unit/scenario/(full scenario suite) → 725 passed, no regressionspytest tests/unit/backend/test_scenario_run_service.py→ 35 passedruff check+ruff format --check+ty→ all cleanpyrit_scanfast path (--strategies prompt_sending --dataset-names airt_imminent_crisis --max-dataset-size 1) and a default run both complete successfullyDocs updated:
doc/scanner/airt.py(+notebook) for the fast-path command and the--dataset-namesmodel, the programming guide (0_scenarios.ipynb), and the built-in dataset list (1_loading_datasets.ipynb) now includesairt_licensed_therapist.