FEAT Allow dataset parameters on dataset configuration when loading from memory#2120
FEAT Allow dataset parameters on dataset configuration when loading from memory#2120behnam-o wants to merge 7 commits into
Conversation
| "Creates a new dataset config; fetches all items unless --max-dataset-size is also specified", | ||
| "max_dataset_size": "Maximum number of items to use from the dataset (must be >= 1). " | ||
| "Limits new datasets if --dataset-names provided, otherwise overrides scenario's default limit", | ||
| "dataset_parameters": "Dataset seed filters as KEY=VALUE tokens " |
There was a problem hiding this comment.
Rich agrees with this comment but it is copilot generated:
The comma-list syntax presents one uniform surface, but the two exposed filters resolve with different semantics in get_seeds:
harm_categories→_add_list_conditionsANDs a.contains()per value, soharm_categories=cyber,violencemeans "tagged with both cyber AND violence," and it's a substring match.data_types→.in_(values), sodata_types=text,image_pathmeans "either type," exact match.
A user writing harm_categories=cyber,violence will almost certainly expect a union and instead silently get the (likely empty) intersection. At minimum the help text and the filters docstring should spell out that comma-list semantics are per-field (AND/substring for harm categories, OR/exact for data types). Worth a test that pins each field's behavior against real seeds, since this is the surface most likely to confuse people.
| # get_seeds kwarg. Every exposed filter must be a list-valued (Sequence) get_seeds parameter -- | ||
| # a guard test enforces this and keeps the CLI's advertised keys in sync. Adding a filterable | ||
| # field is a one-line change here. | ||
| DATASET_FILTERS: frozenset[str] = frozenset({"harm_categories", "data_types"}) |
There was a problem hiding this comment.
Rich agrees with this comment but it is copilot generated:
DATASET_FILTERS, _coerce_filter_values, and build_dataset_filters don't belong in dataset_configuration. This module should only consume clean, typed filters (dict[str, list[str]]) and thread them into get_seeds — that part (the filters field, update_filters, the property) is correctly placed. But parsing and validating raw user input is a CLI / request-surface responsibility, not this module's:
_coerce_filter_values(comma-splitting raw strings) is CLI input parsing — it only exists because the CLI hands over raw strings. It belongs in_cli_args.pynext toparse_dataset_parameter/_parse_initializer_arg.build_dataset_filters(allow-list validation + the "Unknown dataset parameter" error) is request validation. The GUI submits requests too, so this should validate server-side — naturally aRunScenarioRequestvalidator or the service layer.DATASET_FILTERSdescribes the CLI/API exposed surface, not a property of this config (the config happily accepts anyget_seedskwarg viaupdate_filters/filters=— theauthorstest proves it). Housing it here is what forced the duplicated_ADVERTISED_DATASET_FILTER_KEYSin the CLI plus a cross-module sync test.
Moving these out to the CLI/request surface collapses the two allow-lists into one and lets the guard test go away.
| strategies: list[str] | None = Field(None, description="Strategy names to use (uses scenario default if omitted)") | ||
| dataset_names: list[str] | None = Field(None, description="Dataset names to use (uses scenario default if omitted)") | ||
| max_dataset_size: int | None = Field(None, ge=1, description="Maximum items per dataset") | ||
| dataset_parameters: dict[str, Any] | None = Field( |
There was a problem hiding this comment.
Rich agrees with this comment but it is copilot generated:
Three names for one concept makes this hard to follow: it's dataset_parameters at the CLI/request boundary, filters internally, and the comment/test call it a "registry." It isn't a registry — PyRIT registries build and store components (TargetRegistry, AttackTechniqueRegistry); this is a hardcoded frozenset of two strings. Calling it a registry implies an extension point that doesn't exist (you can't register a filter, you edit a literal). Please drop the "registry" framing (the comment text and the TestDatasetFilterRegistry name), and ideally settle on one term — I'd pick "filters" end-to-end, or if dataset_parameters is the intended public name, at least don't call the constant a registry.
| if parsed_args.max_dataset_size is not None: | ||
| kwargs["max_dataset_size"] = parsed_args.max_dataset_size | ||
| if parsed_args.dataset_parameters: | ||
| kwargs["dataset_parameters"] = dict(parsed_args.dataset_parameters) |
There was a problem hiding this comment.
Rich agrees with this comment but it is copilot generated:
--dataset-parameters harm_categories=cyber harm_categories=violence → dict(...) keeps only the last, silently dropping cyber. Given harm_categories is AND/substring, a user who repeats the key is almost certainly trying to add constraints and gets none of the earlier ones with no warning. Prefer failing loud: detect duplicate keys when folding the token list into a dict and raise ValueError("Duplicate dataset parameter 'harm_categories'; combine values with commas: harm_categories=cyber,violence"). Cheap to do at the same spot the tuples are collapsed (same applies to the shell path in pyrit_shell.py).
Adds support for setting parameters on DatasetConfigurations, which are used to filter for a subset of seeds found in memory. This, in combination with
dataset_nameslet a DatasetConfiguration have a more granular way to describe what seeds it provides to callers.Filters are applied at the memory query layer (
get_seeds) during dataset resolution, so--max-dataset-sizestill caps the filtered set.harm_categories,data_types(keys are exactly theget_seedskwargs; values accept a single token or comma-separated list).--dataset-parametersflag (andRunScenarioRequest.dataset_parametersfield) that filters a scenario's seeds by field before sampling.pyrit_scanandpyrit_shell; kept orthogonal to--max-dataset-size.DATASET_FILTERSis the single authoritative allowlist; a guard test asserts every exposed key is a realSequence-typedget_seedsparameter and stays in sync with the CLI help.Example: