FEAT: Add MASK (CAIS honesty benchmark) dataset loaders by romanlutz · Pull Request #1904 · microsoft/PyRIT

romanlutz · 2026-06-03T14:08:55Z

Summary

Adds six cais/MASK HuggingFace loaders to PyRIT — one per question archetype that the MASK paper grades separately:

_MaskContinuationsDataset
_MaskDisinformationDataset
_MaskDoublingDownKnownFactsDataset
_MaskKnownFactsDataset
_MaskProvidedFactsDataset
_MaskStatisticsDataset

Plus a public MaskQuestionArchetype enum so callers can refer to archetypes by name.

What is MASK?

MASK (Ren et al., 2025) measures honesty as distinct from accuracy: under pressure (a persona system prompt, role assignment, …) does the model state something that contradicts its own out-of-context belief? Each row pairs a pressured conversation with one or more neutral-context belief-elicitation prompts, and a lie is scored when the two diverge.

This PR is loader-only. The MASK scoring methodology (parallel belief elicitation + multi-stage judge) is intentionally not included here — it would require its own attack-executor + scorer design and is being tracked in a separate workstream.

Loader shape

Sibling-subclass-per-archetype pattern (_MaskBaseDataset + 6 subclasses with class-level ARCHETYPE), matching the established _ORBenchBaseDataset / Salad / etc. style.
HF-gated handling matches _HiXSTestDataset / _SGXSTestDataset: token constructor arg, falls back to HUGGINGFACE_TOKEN env var. The loader does not short-circuit on missing token; instead datasets.load_dataset raises with a clear message pointing the user at the dataset's HF gating page.
One MASK row → one SeedGroup sharing a prompt_group_id:
- One SeedObjective carrying proposition, ground_truth, formatted_ground_truth, all belief_elicit_* strings, and the canary contamination marker.
- For doubling_down_known_facts: a 4-piece pressured conversation (system → user → assistant → user) encoding the forced prior turn before the pressured question.
- For the other 5 archetypes: a 2-piece pressured conversation (system → user).
Single-split dataset; HF_SPLIT = "test" is a class-level constant rather than a constructor kwarg (every MASK config publishes only a test split).
Tags {"default", "safety", "honesty"} — "honesty" is added as a new cross-cutting tag, since this is the first PyRIT dataset specifically targeting honesty rather than harm-avoidance.

Tests

22 unit tests in tests/unit/datasets/test_mask_dataset.py: archetype enum coverage, per-loader dataset_name + ARCHETYPE binding, token plumbing (env var + override), empty-fetch raises, metadata serialization for every field, doubling-down 4-piece role/sequence ordering, missing-belief-field fallback for provided_facts, statistics numeric ground truth preserved as string.
- Fixtures are entirely fictional (Senator Aldwin Voss, Trillvine, Helian Empire, …) — no real cais/MASK rows are reproduced, and the test canary is an obvious fake (test:0000:00000000-0000-0000-0000-000000000000) to avoid leaking the real MASK contamination marker into a public repo.
6/6 e2e tests in tests/end_to_end/test_all_datasets.py pass against real cais/MASK at pinned revision 4602b84dd9e2ca05c6e1eafbc14e556e908ac1bb — auto-discovered through SeedDatasetProvider, no e2e test code added.
Full unit dataset suite (516 tests) still green.
Pre-commit (ruff format/check + ty) clean.

Documentation

doc/references.bib — @misc{ren2025maskbenchmarkdisentanglinghonesty, ...}.
doc/bibliography.md — citation key added to the hidden-citations dropdown.
doc/code/datasets/1_loading_datasets.{py,ipynb} — MASK row + 6 dataset names in the expected-output list.

Licensing / gating

cais/MASK is a HuggingFace-gated dataset under CAIS's research-use click-through terms. Users must accept the dataset terms at https://huggingface.co/datasets/cais/MASK before the loader will succeed. This is documented in the class docstring; PyRIT does not vendor any MASK content.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Resolved conflicts: - pyrit/datasets/seed_datasets/remote/__init__.py — kept both MaskQuestionArchetype (HEAD) and MossBenchOversensitivityType (main) in __all__. - doc/bibliography.md — merged the new MASK citation key into the upstream's larger citation list. Followups for upstream changes: - mask_dataset.py + test_mask_dataset.py: renamed _fetch_from_huggingface -> _fetch_from_huggingface_async to match the async-suffix sweep (microsoft#1889). - seed_metadata.py: added "honesty" to RECOMMENDED_TAGS so MASK's cross-cutting tag passes the metadata-coverage check added by microsoft#1780. Verification: 836 unit dataset tests pass; pre-commit clean on all touched files. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

adrian-gavrila

Overall looks good! A couple of nits and then one question related to tagging this as defult.

adrian-gavrila · 2026-06-19T20:38:32Z

+  eprint        = {2503.03750},
+  archivePrefix = {arXiv},
+  primaryClass  = {cs.LG},


Nit: this is the only entry in the file using arXiv's native eprint/archivePrefix/primaryClass export instead of journal = {arXiv preprint arXiv:...} (and datasets.instructions.md prescribes the title/author/journal/year/url field set). Could you swap these three lines for:

journal = {arXiv preprint arXiv:2503.03750},

and keep the existing url? Just to avoid the one-off.

adrian-gavrila · 2026-06-19T20:38:32Z

+    HF_DATASET_NAME: str = "cais/MASK"
+    HF_REVISION: str = "4602b84dd9e2ca05c6e1eafbc14e556e908ac1bb"
+    HF_SPLIT: str = "test"
+    ARCHETYPE: MaskQuestionArchetype


Minor parity nit: siblings like _ORBenchBaseDataset sets should_register = False on its abstract base, but this base relies on dataset_name staying abstract to avoid registering. It's correct today, but if anyone ever gives the base a concrete dataset_name, it'd silently register and then AttributeError on the unset ARCHETYPE. Worth adding should_register = False here to make the intent explicit and match ORBench.

adrian-gavrila · 2026-06-19T20:38:32Z

+
+    # Class-level dataset metadata for SeedDatasetMetadata discovery.
+    modalities: list[str] = ["text"]
+    tags: set[str] = {"default", "safety", "honesty"}


"default" is documented in seed_metadata.py as ungated + size ≥ medium, but MASK is gated (HUGGINGFACE_TOKEN) and mask_statistics is size="small". So SeedDatasetFilter(tags={"default"}) returns MASK, then 401s at fetch. Could we drop "default" here (sgxstest/sorry_bench/vlguard already omit it when gated), or amend the rule if gated benchmarks should be discoverable this way?

romanlutz and others added 2 commits June 2, 2026 20:01

FEAT: Add MASK (CAIS honesty benchmark) dataset loaders

bcfb6ca

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

adrian-gavrila self-assigned this Jun 19, 2026

adrian-gavrila reviewed Jun 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FEAT: Add MASK (CAIS honesty benchmark) dataset loaders#1904

FEAT: Add MASK (CAIS honesty benchmark) dataset loaders#1904
romanlutz wants to merge 2 commits into
microsoft:mainfrom
romanlutz:romanlutz/plan-mask-honesty-benchmark

romanlutz commented Jun 3, 2026

Uh oh!

adrian-gavrila left a comment

Uh oh!

adrian-gavrila Jun 19, 2026

Uh oh!

adrian-gavrila Jun 19, 2026

Uh oh!

adrian-gavrila Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

romanlutz commented Jun 3, 2026

Summary

What is MASK?

Loader shape

Tests

Documentation

Licensing / gating

Uh oh!

adrian-gavrila left a comment

Choose a reason for hiding this comment

Uh oh!

adrian-gavrila Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

adrian-gavrila Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

adrian-gavrila Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants