Skip to content

Alvin#21

Open
AlvinYu025 wants to merge 4 commits into
patrickrchao:mainfrom
Ashindustry007:alvin
Open

Alvin#21
AlvinYu025 wants to merge 4 commits into
patrickrchao:mainfrom
Ashindustry007:alvin

Conversation

@AlvinYu025

Copy link
Copy Markdown

No description provided.

AlvinYu025 and others added 2 commits June 4, 2026 03:07
- Run all models through the UCSD TritonAI OpenAI-compatible gateway.
- Working trio: deepseek-v4-flash (attacker) / llama-4-scout (target) /
  claude-sonnet-4-6 (judge).
- Fix Claude temperature+top_p conflict, tolerate empty/failed streams,
  placeholder fallback for unparseable attacks.
- Add run_ucsd_pair_variant.py + run_variant.sh launcher + README.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Preserve upstream PAIR readme as README_UPSTREAM_PAIR.md; new README.md covers
environment setup, the 4 gateway bugs found & fixed, model-selection probe,
run instructions, and the 18%/41.2-queries result.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- jbc_eval.py + jbc_templates.json + JBC_README.md: provider-agnostic JailbreakChat
  (JBC) baseline reproducing the JBC row of PAIR Table 2 (AIM from the paper + 8
  templates from the In-The-Wild Jailbreak Prompts archive). Run via
  'python jbc_eval.py --target-model <model>' with API keys in env vars.
- transfer_paper.py / transfer_eval.py: Table 3 transferability (replay source
  jailbreak prompts on downstream targets).
- judges.py: LlamaGuard4Judge with refusal-aware prompt + deterministic refusal
  pre-filter (fixes Llama-Guard-4 false positives on articulate refusals).
- config.py / language_models.py: register Qwen2.5-7B attacker, Qwen3-235B and
  Gemini-2.5-Flash targets via litellm name overrides + per-model key routing.
- .gitignore: exclude all key files, paper/, result artifacts, local plotting
  scripts, and personal launcher .sh files.
defend_eval.py replays a target's PAIR jailbreak prompts against the same model
under two defenses: SmoothLLM (N=10 q=10% char-swap copies, majority vote) and a
GPT-2 perplexity filter (threshold = max perplexity of the benign JBB goals).
Judged by Llama-Guard-4; reports JB% per defense. On Qwen3-235B: SmoothLLM cuts
ASR 34%->12% while the perplexity filter has no effect (PAIR prompts are fluent
natural language), reproducing the paper's finding that semantic jailbreaks evade
perplexity filtering.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant