GitHub - erfanshayegani/Multimodal-Alignment-BlindSpots: [🔥 ICLR 2026] - Misaligned Roles, Misplaced Images: Structural Input Perturbations Expose Multimodal Alignment Blind Spots

[🔥ICLR 2026] Misaligned Roles, Misplaced Images: Structural Input Perturbations Expose Multimodal Alignment Blind Spots

Erfan Shayegani, G M Shahariar, Sara Abdali, Lei Yu, Nael Abu-Ghazaleh, Yue Dong

💡 Overview

We expose two key alignment blindspots in Multimodal Language Models (MLLMs): (i) fragility to minor structural perturbations, and (ii) user–assistant role alignment asymmetry. Both induce harmful outputs without query-content manipulation and reveal deeper downstream implications (especially under recent trends in synthetic alignment data generation). We provide causal insights into the resulting representational shifts relative to refusal directions, and propose a post-training mitigation approach.

✨ Updates

[2026/03/16] ✅ Code is released! Feel free to play around and let us know if you have any questions!
[2026/01/28] 💻 Stay tuned for the Code and Results.
[2026/01/25] 🔥 The paper got accepted at ICLR 2026!
[2025/11/24] 🎉 Accepted at the Responsible Foundation Models Workshop@NeurIPS 2025.

🗂️ Repository Structure & How to Use

Our repository contains code and notebooks for reproducing all experiments from the paper. Note: All code is provided as Jupyter notebooks with placeholder API keys that should be replaced with your own credentials.

1. Dataset (`/dataset`)

advbench_instructions.json & alpaca_instructions.json: Harmful prompt collections for evaluation
xstest_prompts.csv: XS-test prompts for robustness evaluation
images/: Images used across experiments

2. Core Experiments: Inference (`/Inference`)

Start here to understand the main findings. Test models against structural perturbations (misplaced images and role swaps):

Image/: Experiments with image token repositioning
- Firearms/: Test with firearms images and harmful prompts
- Flower/: Test with flower images and harmful prompts
- Includes notebooks for LLaVA, Phi, and Qwen models
- As we have discussed in the paper, the image content does not have significant effects
No Image/: Baseline experiments with text-only role swaps (no image manipulation)
- Compare cross-role and user-assistant role reversals
- Available for LLaVA, Phi, and Qwen

3. Alignment & Mitigation: SFT Training (`/SFT Fresh`)

Reproduce our proposed adversarial training approach:

Datasets: Pre-processed training data including:
- alpaca_harmless_train.json: Harmless instruction-following data
- circuit_breakers_train.json: Safety-focused training examples
- harmbench_behaviors_text_all.csv: Harmful behaviors for training
Per-Model Notebooks (LLaVA, Phi, Qwen):
- SFT all mix.ipynb: SFT with mixed harmless + safety data
- SFT based on C.ipynb: SFT on one attack vector to see generalization
- VQA Reward.ipynb: Evaluation of normal capabilities with reward models

4. Additional Attack Methods (`/Other Attacks`)

Explore complementary attack vectors beyond structural perturbations:

GCG Attack: Gradient-based adversarial attacks (gcg attack.ipynb)
Compositionality Checks: Test robustness to compositional attacks
All Other Attacks: Comprehensive attack suite per model (LLaVA, Phi, Qwen)

5. Safety Tooling (`/Llama Guard`)

LlamaGuard.ipynb: Safety classifier for filtering outputs

6. Visualization & Analysis (`/Visualization`)

FAPlot.ipynb: Feature attribution and visualization plots
generic vector visualization.ipynb: Representational analysis and t-SNE/PCA visualizations

To provide a smooth experience with reproducing our results; we have put all our code into .IPYNB notebooks. There are some expired placeholder Token Access keys in the notebooks which you should replace with your own keys.

📬 Correspondence

For questions or discussion, please contact: Erfan Shayegani: sshay004@ucr.edu and G M Shahariar: gshah010@ucr.edu

Citation

📚🤗 If you find this repository useful in your project, please consider giving a ⭐ and citing:

@inproceedings{
shayegani2026misaligned,
title={Misaligned Roles, Misplaced Images: Structural Input Perturbations Expose Multimodal Alignment Blind Spots},
author={Erfan Shayegani and G M Shahariar and Sara Abdali and Lei Yu and Nael Abu-Ghazaleh and Yue Dong},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=HRkrWi3FWP}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[🔥ICLR 2026] Misaligned Roles, Misplaced Images: Structural Input Perturbations Expose Multimodal Alignment Blind Spots

💡 Overview

✨ Updates

🗂️ Repository Structure & How to Use

1. Dataset (`/dataset`)

2. Core Experiments: Inference (`/Inference`)

3. Alignment & Mitigation: SFT Training (`/SFT Fresh`)

4. Additional Attack Methods (`/Other Attacks`)

5. Safety Tooling (`/Llama Guard`)

6. Visualization & Analysis (`/Visualization`)

📬 Correspondence

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
Inference		Inference
Llama Guard		Llama Guard
Other Attacks		Other Attacks
SFT Fresh		SFT Fresh
Visualization		Visualization
assets		assets
dataset		dataset
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

[🔥ICLR 2026] Misaligned Roles, Misplaced Images: Structural Input Perturbations Expose Multimodal Alignment Blind Spots

💡 Overview

✨ Updates

🗂️ Repository Structure & How to Use

1. Dataset (/dataset)

2. Core Experiments: Inference (/Inference)

3. Alignment & Mitigation: SFT Training (/SFT Fresh)

4. Additional Attack Methods (/Other Attacks)

5. Safety Tooling (/Llama Guard)

6. Visualization & Analysis (/Visualization)

📬 Correspondence

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Dataset (`/dataset`)

2. Core Experiments: Inference (`/Inference`)

3. Alignment & Mitigation: SFT Training (`/SFT Fresh`)

4. Additional Attack Methods (`/Other Attacks`)

5. Safety Tooling (`/Llama Guard`)

6. Visualization & Analysis (`/Visualization`)

Packages