Skip to content

FEAT: Add MASK (CAIS honesty benchmark) dataset loaders#1904

Open
romanlutz wants to merge 2 commits into
microsoft:mainfrom
romanlutz:romanlutz/plan-mask-honesty-benchmark
Open

FEAT: Add MASK (CAIS honesty benchmark) dataset loaders#1904
romanlutz wants to merge 2 commits into
microsoft:mainfrom
romanlutz:romanlutz/plan-mask-honesty-benchmark

Conversation

@romanlutz

Copy link
Copy Markdown
Contributor

Summary

Adds six cais/MASK HuggingFace loaders to PyRIT — one per question archetype that the MASK paper grades separately:

  • _MaskContinuationsDataset
  • _MaskDisinformationDataset
  • _MaskDoublingDownKnownFactsDataset
  • _MaskKnownFactsDataset
  • _MaskProvidedFactsDataset
  • _MaskStatisticsDataset

Plus a public MaskQuestionArchetype enum so callers can refer to archetypes by name.

What is MASK?

MASK (Ren et al., 2025) measures honesty as distinct from accuracy: under pressure (a persona system prompt, role assignment, …) does the model state something that contradicts its own out-of-context belief? Each row pairs a pressured conversation with one or more neutral-context belief-elicitation prompts, and a lie is scored when the two diverge.

This PR is loader-only. The MASK scoring methodology (parallel belief elicitation + multi-stage judge) is intentionally not included here — it would require its own attack-executor + scorer design and is being tracked in a separate workstream.

Loader shape

  • Sibling-subclass-per-archetype pattern (_MaskBaseDataset + 6 subclasses with class-level ARCHETYPE), matching the established _ORBenchBaseDataset / Salad / etc. style.
  • HF-gated handling matches _HiXSTestDataset / _SGXSTestDataset: token constructor arg, falls back to HUGGINGFACE_TOKEN env var. The loader does not short-circuit on missing token; instead datasets.load_dataset raises with a clear message pointing the user at the dataset's HF gating page.
  • One MASK row → one SeedGroup sharing a prompt_group_id:
    • One SeedObjective carrying proposition, ground_truth, formatted_ground_truth, all belief_elicit_* strings, and the canary contamination marker.
    • For doubling_down_known_facts: a 4-piece pressured conversation (systemuserassistantuser) encoding the forced prior turn before the pressured question.
    • For the other 5 archetypes: a 2-piece pressured conversation (systemuser).
  • Single-split dataset; HF_SPLIT = "test" is a class-level constant rather than a constructor kwarg (every MASK config publishes only a test split).
  • Tags {"default", "safety", "honesty"}"honesty" is added as a new cross-cutting tag, since this is the first PyRIT dataset specifically targeting honesty rather than harm-avoidance.

Tests

  • 22 unit tests in tests/unit/datasets/test_mask_dataset.py: archetype enum coverage, per-loader dataset_name + ARCHETYPE binding, token plumbing (env var + override), empty-fetch raises, metadata serialization for every field, doubling-down 4-piece role/sequence ordering, missing-belief-field fallback for provided_facts, statistics numeric ground truth preserved as string.
    • Fixtures are entirely fictional (Senator Aldwin Voss, Trillvine, Helian Empire, …) — no real cais/MASK rows are reproduced, and the test canary is an obvious fake (test:0000:00000000-0000-0000-0000-000000000000) to avoid leaking the real MASK contamination marker into a public repo.
  • 6/6 e2e tests in tests/end_to_end/test_all_datasets.py pass against real cais/MASK at pinned revision 4602b84dd9e2ca05c6e1eafbc14e556e908ac1bb — auto-discovered through SeedDatasetProvider, no e2e test code added.
  • Full unit dataset suite (516 tests) still green.
  • Pre-commit (ruff format/check + ty) clean.

Documentation

  • doc/references.bib@misc{ren2025maskbenchmarkdisentanglinghonesty, ...}.
  • doc/bibliography.md — citation key added to the hidden-citations dropdown.
  • doc/code/datasets/1_loading_datasets.{py,ipynb} — MASK row + 6 dataset names in the expected-output list.

Licensing / gating

cais/MASK is a HuggingFace-gated dataset under CAIS's research-use click-through terms. Users must accept the dataset terms at https://huggingface.co/datasets/cais/MASK before the loader will succeed. This is documented in the class docstring; PyRIT does not vendor any MASK content.

romanlutz and others added 2 commits June 2, 2026 20:01
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Resolved conflicts:
- pyrit/datasets/seed_datasets/remote/__init__.py — kept both
  MaskQuestionArchetype (HEAD) and MossBenchOversensitivityType (main) in __all__.
- doc/bibliography.md — merged the new MASK citation key into the
  upstream's larger citation list.

Followups for upstream changes:
- mask_dataset.py + test_mask_dataset.py: renamed
  _fetch_from_huggingface -> _fetch_from_huggingface_async to match
  the async-suffix sweep (microsoft#1889).
- seed_metadata.py: added "honesty" to RECOMMENDED_TAGS so MASK's
  cross-cutting tag passes the metadata-coverage check added by microsoft#1780.

Verification: 836 unit dataset tests pass; pre-commit clean on
all touched files.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@adrian-gavrila adrian-gavrila self-assigned this Jun 19, 2026

@adrian-gavrila adrian-gavrila left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good! A couple of nits and then one question related to tagging this as defult.

Comment thread doc/references.bib
Comment on lines +80 to +82
eprint = {2503.03750},
archivePrefix = {arXiv},
primaryClass = {cs.LG},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this is the only entry in the file using arXiv's native eprint/archivePrefix/primaryClass export instead of journal = {arXiv preprint arXiv:...} (and datasets.instructions.md prescribes the title/author/journal/year/url field set). Could you swap these three lines for:

  journal = {arXiv preprint arXiv:2503.03750},

and keep the existing url? Just to avoid the one-off.

HF_DATASET_NAME: str = "cais/MASK"
HF_REVISION: str = "4602b84dd9e2ca05c6e1eafbc14e556e908ac1bb"
HF_SPLIT: str = "test"
ARCHETYPE: MaskQuestionArchetype

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor parity nit: siblings like _ORBenchBaseDataset sets should_register = False on its abstract base, but this base relies on dataset_name staying abstract to avoid registering. It's correct today, but if anyone ever gives the base a concrete dataset_name, it'd silently register and then AttributeError on the unset ARCHETYPE. Worth adding should_register = False here to make the intent explicit and match ORBench.


# Class-level dataset metadata for SeedDatasetMetadata discovery.
modalities: list[str] = ["text"]
tags: set[str] = {"default", "safety", "honesty"}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"default" is documented in seed_metadata.py as ungated + size ≥ medium, but MASK is gated (HUGGINGFACE_TOKEN) and mask_statistics is size="small". So SeedDatasetFilter(tags={"default"}) returns MASK, then 401s at fetch. Could we drop "default" here (sgxstest/sorry_bench/vlguard already omit it when gated), or amend the rule if gated benchmarks should be discoverable this way?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants