FEAT Allow dataset parameters on dataset configuration when loading from memory by behnam-o · Pull Request #2120 · microsoft/PyRIT

behnam-o · 2026-07-02T21:03:13Z

Adds support for setting parameters on DatasetConfigurations, which are used to filter for a subset of seeds found in memory. This, in combination with dataset_names let a DatasetConfiguration have a more granular way to describe what seeds it provides to callers.

Filters are applied at the memory query layer (get_seeds) during dataset resolution, so --max-dataset-size still caps the filtered set.

Exposed filters: harm_categories, data_types (keys are exactly the get_seeds kwargs; values accept a single token or comma-separated list).
Adds an optional --dataset-parameters flag (and RunScenarioRequest.dataset_parameters field) that filters a scenario's seeds by field before sampling.
Wired through pyrit_scan and pyrit_shell; kept orthogonal to --max-dataset-size.
DATASET_FILTERS is the single authoritative allowlist; a guard test asserts every exposed key is a real Sequence-typed get_seeds parameter and stays in sync with the CLI help.

Example:

pyrit_scan airt.rapid_response --target openai_chat --dataset-names airt_hate \
  --dataset-parameters harm_categories=cyber --max-dataset-size 13

rlundeen2 · 2026-07-03T00:08:30Z

    "Creates a new dataset config; fetches all items unless --max-dataset-size is also specified",
    "max_dataset_size": "Maximum number of items to use from the dataset (must be >= 1). "
    "Limits new datasets if --dataset-names provided, otherwise overrides scenario's default limit",
+    "dataset_parameters": "Dataset seed filters as KEY=VALUE tokens "


Rich agrees with this comment but it is copilot generated:

The comma-list syntax presents one uniform surface, but the two exposed filters resolve with different semantics in get_seeds:

harm_categories → _add_list_conditions ANDs a .contains() per value, so harm_categories=cyber,violence means "tagged with both cyber AND violence," and it's a substring match.

data_types → .in_(values), so data_types=text,image_path means "either type," exact match.

A user writing harm_categories=cyber,violence will almost certainly expect a union and instead silently get the (likely empty) intersection. At minimum the help text and the filters docstring should spell out that comma-list semantics are per-field (AND/substring for harm categories, OR/exact for data types). Worth a test that pins each field's behavior against real seeds, since this is the surface most likely to confuse people.

rlundeen2 · 2026-07-03T00:08:38Z

+# get_seeds kwarg. Every exposed filter must be a list-valued (Sequence) get_seeds parameter --
+# a guard test enforces this and keeps the CLI's advertised keys in sync. Adding a filterable
+# field is a one-line change here.
+DATASET_FILTERS: frozenset[str] = frozenset({"harm_categories", "data_types"})


Rich agrees with this comment but it is copilot generated:

DATASET_FILTERS, _coerce_filter_values, and build_dataset_filters don't belong in dataset_configuration. This module should only consume clean, typed filters (dict[str, list[str]]) and thread them into get_seeds — that part (the filters field, update_filters, the property) is correctly placed. But parsing and validating raw user input is a CLI / request-surface responsibility, not this module's:

_coerce_filter_values (comma-splitting raw strings) is CLI input parsing — it only exists because the CLI hands over raw strings. It belongs in _cli_args.py next to parse_dataset_parameter/_parse_initializer_arg.

build_dataset_filters (allow-list validation + the "Unknown dataset parameter" error) is request validation. The GUI submits requests too, so this should validate server-side — naturally a RunScenarioRequest validator or the service layer.

DATASET_FILTERS describes the CLI/API exposed surface, not a property of this config (the config happily accepts any get_seeds kwarg via update_filters/filters= — the authors test proves it). Housing it here is what forced the duplicated _ADVERTISED_DATASET_FILTER_KEYS in the CLI plus a cross-module sync test.

Moving these out to the CLI/request surface collapses the two allow-lists into one and lets the guard test go away.

rlundeen2 · 2026-07-03T00:08:46Z

    strategies: list[str] | None = Field(None, description="Strategy names to use (uses scenario default if omitted)")
    dataset_names: list[str] | None = Field(None, description="Dataset names to use (uses scenario default if omitted)")
    max_dataset_size: int | None = Field(None, ge=1, description="Maximum items per dataset")
+    dataset_parameters: dict[str, Any] | None = Field(


Rich agrees with this comment but it is copilot generated:

Three names for one concept makes this hard to follow: it's dataset_parameters at the CLI/request boundary, filters internally, and the comment/test call it a "registry." It isn't a registry — PyRIT registries build and store components (TargetRegistry, AttackTechniqueRegistry); this is a hardcoded frozenset of two strings. Calling it a registry implies an extension point that doesn't exist (you can't register a filter, you edit a literal). Please drop the "registry" framing (the comment text and the TestDatasetFilterRegistry name), and ideally settle on one term — I'd pick "filters" end-to-end, or if dataset_parameters is the intended public name, at least don't call the constant a registry.

rlundeen2 · 2026-07-03T00:08:54Z

    if parsed_args.max_dataset_size is not None:
        kwargs["max_dataset_size"] = parsed_args.max_dataset_size
+    if parsed_args.dataset_parameters:
+        kwargs["dataset_parameters"] = dict(parsed_args.dataset_parameters)


Rich agrees with this comment but it is copilot generated:

--dataset-parameters harm_categories=cyber harm_categories=violence → dict(...) keeps only the last, silently dropping cyber. Given harm_categories is AND/substring, a user who repeats the key is almost certainly trying to add constraints and gets none of the earlier ones with no warning. Prefer failing loud: detect duplicate keys when folding the token list into a dict and raise ValueError("Duplicate dataset parameter 'harm_categories'; combine values with commas: harm_categories=cyber,violence"). Cheap to do at the same spot the tuples are collapsed (same applies to the shell path in pyrit_shell.py).

Behnam Ousat and others added 7 commits July 2, 2026 13:53

allow parameters on dataset configuration when loading from memory

136b57e

fix

8e6065f

extra file

c192cb1

keep list

50ed6d0

format

2150369

Merge branch 'main' into ds-params

bafd4de

fix

07108ca

rlundeen2 self-assigned this Jul 3, 2026

rlundeen2 reviewed Jul 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FEAT Allow dataset parameters on dataset configuration when loading from memory#2120

FEAT Allow dataset parameters on dataset configuration when loading from memory#2120
behnam-o wants to merge 7 commits into
microsoft:mainfrom
behnam-o:ds-params

behnam-o commented Jul 2, 2026

Uh oh!

rlundeen2 Jul 3, 2026

Uh oh!

rlundeen2 Jul 3, 2026

Uh oh!

rlundeen2 Jul 3, 2026

Uh oh!

rlundeen2 Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

behnam-o commented Jul 2, 2026

Uh oh!

rlundeen2 Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

rlundeen2 Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

rlundeen2 Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

rlundeen2 Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants