Skip to content

FEAT Allow dataset parameters on dataset configuration when loading from memory#2120

Open
behnam-o wants to merge 7 commits into
microsoft:mainfrom
behnam-o:ds-params
Open

FEAT Allow dataset parameters on dataset configuration when loading from memory#2120
behnam-o wants to merge 7 commits into
microsoft:mainfrom
behnam-o:ds-params

Conversation

@behnam-o

@behnam-o behnam-o commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Adds support for setting parameters on DatasetConfigurations, which are used to filter for a subset of seeds found in memory. This, in combination with dataset_names let a DatasetConfiguration have a more granular way to describe what seeds it provides to callers.

Filters are applied at the memory query layer (get_seeds) during dataset resolution, so --max-dataset-size still caps the filtered set.

  • Exposed filters: harm_categories, data_types (keys are exactly the get_seeds kwargs; values accept a single token or comma-separated list).
  • Adds an optional --dataset-parameters flag (and RunScenarioRequest.dataset_parameters field) that filters a scenario's seeds by field before sampling.
  • Wired through pyrit_scan and pyrit_shell; kept orthogonal to --max-dataset-size.
  • DATASET_FILTERS is the single authoritative allowlist; a guard test asserts every exposed key is a real Sequence-typed get_seeds parameter and stays in sync with the CLI help.

Example:

pyrit_scan airt.rapid_response --target openai_chat --dataset-names airt_hate \
  --dataset-parameters harm_categories=cyber --max-dataset-size 13

@rlundeen2 rlundeen2 self-assigned this Jul 3, 2026
Comment thread pyrit/cli/_cli_args.py
"Creates a new dataset config; fetches all items unless --max-dataset-size is also specified",
"max_dataset_size": "Maximum number of items to use from the dataset (must be >= 1). "
"Limits new datasets if --dataset-names provided, otherwise overrides scenario's default limit",
"dataset_parameters": "Dataset seed filters as KEY=VALUE tokens "

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rich agrees with this comment but it is copilot generated:

The comma-list syntax presents one uniform surface, but the two exposed filters resolve with different semantics in get_seeds:

  • harm_categories_add_list_conditions ANDs a .contains() per value, so harm_categories=cyber,violence means "tagged with both cyber AND violence," and it's a substring match.
  • data_types.in_(values), so data_types=text,image_path means "either type," exact match.

A user writing harm_categories=cyber,violence will almost certainly expect a union and instead silently get the (likely empty) intersection. At minimum the help text and the filters docstring should spell out that comma-list semantics are per-field (AND/substring for harm categories, OR/exact for data types). Worth a test that pins each field's behavior against real seeds, since this is the surface most likely to confuse people.

# get_seeds kwarg. Every exposed filter must be a list-valued (Sequence) get_seeds parameter --
# a guard test enforces this and keeps the CLI's advertised keys in sync. Adding a filterable
# field is a one-line change here.
DATASET_FILTERS: frozenset[str] = frozenset({"harm_categories", "data_types"})

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rich agrees with this comment but it is copilot generated:

DATASET_FILTERS, _coerce_filter_values, and build_dataset_filters don't belong in dataset_configuration. This module should only consume clean, typed filters (dict[str, list[str]]) and thread them into get_seeds — that part (the filters field, update_filters, the property) is correctly placed. But parsing and validating raw user input is a CLI / request-surface responsibility, not this module's:

  • _coerce_filter_values (comma-splitting raw strings) is CLI input parsing — it only exists because the CLI hands over raw strings. It belongs in _cli_args.py next to parse_dataset_parameter/_parse_initializer_arg.
  • build_dataset_filters (allow-list validation + the "Unknown dataset parameter" error) is request validation. The GUI submits requests too, so this should validate server-side — naturally a RunScenarioRequest validator or the service layer.
  • DATASET_FILTERS describes the CLI/API exposed surface, not a property of this config (the config happily accepts any get_seeds kwarg via update_filters/filters= — the authors test proves it). Housing it here is what forced the duplicated _ADVERTISED_DATASET_FILTER_KEYS in the CLI plus a cross-module sync test.

Moving these out to the CLI/request surface collapses the two allow-lists into one and lets the guard test go away.

strategies: list[str] | None = Field(None, description="Strategy names to use (uses scenario default if omitted)")
dataset_names: list[str] | None = Field(None, description="Dataset names to use (uses scenario default if omitted)")
max_dataset_size: int | None = Field(None, ge=1, description="Maximum items per dataset")
dataset_parameters: dict[str, Any] | None = Field(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rich agrees with this comment but it is copilot generated:

Three names for one concept makes this hard to follow: it's dataset_parameters at the CLI/request boundary, filters internally, and the comment/test call it a "registry." It isn't a registry — PyRIT registries build and store components (TargetRegistry, AttackTechniqueRegistry); this is a hardcoded frozenset of two strings. Calling it a registry implies an extension point that doesn't exist (you can't register a filter, you edit a literal). Please drop the "registry" framing (the comment text and the TestDatasetFilterRegistry name), and ideally settle on one term — I'd pick "filters" end-to-end, or if dataset_parameters is the intended public name, at least don't call the constant a registry.

Comment thread pyrit/cli/pyrit_scan.py
if parsed_args.max_dataset_size is not None:
kwargs["max_dataset_size"] = parsed_args.max_dataset_size
if parsed_args.dataset_parameters:
kwargs["dataset_parameters"] = dict(parsed_args.dataset_parameters)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rich agrees with this comment but it is copilot generated:

--dataset-parameters harm_categories=cyber harm_categories=violencedict(...) keeps only the last, silently dropping cyber. Given harm_categories is AND/substring, a user who repeats the key is almost certainly trying to add constraints and gets none of the earlier ones with no warning. Prefer failing loud: detect duplicate keys when folding the token list into a dict and raise ValueError("Duplicate dataset parameter 'harm_categories'; combine values with commas: harm_categories=cyber,violence"). Cheap to do at the same spot the tuples are collapsed (same applies to the shell path in pyrit_shell.py).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants