Alvin by AlvinYu025 · Pull Request #21 · patrickrchao/JailbreakingLLMs

AlvinYu025 · 2026-06-06T23:13:12Z

No description provided.

- Run all models through the UCSD TritonAI OpenAI-compatible gateway. - Working trio: deepseek-v4-flash (attacker) / llama-4-scout (target) / claude-sonnet-4-6 (judge). - Fix Claude temperature+top_p conflict, tolerate empty/failed streams, placeholder fallback for unparseable attacks. - Add run_ucsd_pair_variant.py + run_variant.sh launcher + README. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Preserve upstream PAIR readme as README_UPSTREAM_PAIR.md; new README.md covers environment setup, the 4 gateway bugs found & fixed, model-selection probe, run instructions, and the 18%/41.2-queries result. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- jbc_eval.py + jbc_templates.json + JBC_README.md: provider-agnostic JailbreakChat (JBC) baseline reproducing the JBC row of PAIR Table 2 (AIM from the paper + 8 templates from the In-The-Wild Jailbreak Prompts archive). Run via 'python jbc_eval.py --target-model <model>' with API keys in env vars. - transfer_paper.py / transfer_eval.py: Table 3 transferability (replay source jailbreak prompts on downstream targets). - judges.py: LlamaGuard4Judge with refusal-aware prompt + deterministic refusal pre-filter (fixes Llama-Guard-4 false positives on articulate refusals). - config.py / language_models.py: register Qwen2.5-7B attacker, Qwen3-235B and Gemini-2.5-Flash targets via litellm name overrides + per-model key routing. - .gitignore: exclude all key files, paper/, result artifacts, local plotting scripts, and personal launcher .sh files.

defend_eval.py replays a target's PAIR jailbreak prompts against the same model under two defenses: SmoothLLM (N=10 q=10% char-swap copies, majority vote) and a GPT-2 perplexity filter (threshold = max perplexity of the benign JBB goals). Judged by Llama-Guard-4; reports JB% per defense. On Qwen3-235B: SmoothLLM cuts ASR 34%->12% while the perplexity filter has no effect (PAIR prompts are fluent natural language), reproducing the paper's finding that semantic jailbreaks evade perplexity filtering.

AlvinYu025 and others added 2 commits June 4, 2026 03:07

AlvinYu025 force-pushed the alvin branch from 271bd74 to 885fc3a Compare June 6, 2026 23:16

AlvinYu025 force-pushed the alvin branch from 885fc3a to f1b7746 Compare June 6, 2026 23:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alvin#21

Alvin#21
AlvinYu025 wants to merge 4 commits into
patrickrchao:mainfrom
Ashindustry007:alvin

AlvinYu025 commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AlvinYu025 commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant