Skip to content

erfanshayegani/Multimodal-Alignment-BlindSpots

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

[🔥ICLR 2026] Misaligned Roles, Misplaced Images: Structural Input Perturbations Expose Multimodal Alignment Blind Spots

Erfan Shayegani, G M Shahariar, Sara Abdali, Lei Yu, Nael Abu-Ghazaleh, Yue Dong

OpenReview arXiv GitHub Stars

💡 Overview

We expose two key alignment blindspots in Multimodal Language Models (MLLMs): (i) fragility to minor structural perturbations, and (ii) user–assistant role alignment asymmetry. Both induce harmful outputs without query-content manipulation and reveal deeper downstream implications (especially under recent trends in synthetic alignment data generation). We provide causal insights into the resulting representational shifts relative to refusal directions, and propose a post-training mitigation approach.

Abstract_ICLR2026


✨ Updates

  • [2026/03/16] ✅ Code is released! Feel free to play around and let us know if you have any questions!
  • [2026/01/28] 💻 Stay tuned for the Code and Results.
  • [2026/01/25] 🔥 The paper got accepted at ICLR 2026!
  • [2025/11/24] 🎉 Accepted at the Responsible Foundation Models Workshop@NeurIPS 2025.

🗂️ Repository Structure & How to Use

Our repository contains code and notebooks for reproducing all experiments from the paper. Note: All code is provided as Jupyter notebooks with placeholder API keys that should be replaced with your own credentials.

1. Dataset (/dataset)

  • advbench_instructions.json & alpaca_instructions.json: Harmful prompt collections for evaluation
  • xstest_prompts.csv: XS-test prompts for robustness evaluation
  • images/: Images used across experiments

2. Core Experiments: Inference (/Inference)

Start here to understand the main findings. Test models against structural perturbations (misplaced images and role swaps):

  • Image/: Experiments with image token repositioning

    • Firearms/: Test with firearms images and harmful prompts
    • Flower/: Test with flower images and harmful prompts
    • Includes notebooks for LLaVA, Phi, and Qwen models
    • As we have discussed in the paper, the image content does not have significant effects
  • No Image/: Baseline experiments with text-only role swaps (no image manipulation)

    • Compare cross-role and user-assistant role reversals
    • Available for LLaVA, Phi, and Qwen

3. Alignment & Mitigation: SFT Training (/SFT Fresh)

Reproduce our proposed adversarial training approach:

  • Datasets: Pre-processed training data including:

    • alpaca_harmless_train.json: Harmless instruction-following data
    • circuit_breakers_train.json: Safety-focused training examples
    • harmbench_behaviors_text_all.csv: Harmful behaviors for training
  • Per-Model Notebooks (LLaVA, Phi, Qwen):

    • SFT all mix.ipynb: SFT with mixed harmless + safety data
    • SFT based on C.ipynb: SFT on one attack vector to see generalization
    • VQA Reward.ipynb: Evaluation of normal capabilities with reward models

4. Additional Attack Methods (/Other Attacks)

Explore complementary attack vectors beyond structural perturbations:

  • GCG Attack: Gradient-based adversarial attacks (gcg attack.ipynb)
  • Compositionality Checks: Test robustness to compositional attacks
  • All Other Attacks: Comprehensive attack suite per model (LLaVA, Phi, Qwen)

5. Safety Tooling (/Llama Guard)

  • LlamaGuard.ipynb: Safety classifier for filtering outputs

6. Visualization & Analysis (/Visualization)

  • FAPlot.ipynb: Feature attribution and visualization plots
  • generic vector visualization.ipynb: Representational analysis and t-SNE/PCA visualizations

To provide a smooth experience with reproducing our results; we have put all our code into .IPYNB notebooks. There are some expired placeholder Token Access keys in the notebooks which you should replace with your own keys.

📬 Correspondence

For questions or discussion, please contact: Erfan Shayegani: sshay004@ucr.edu and G M Shahariar: gshah010@ucr.edu


Citation

📚🤗 If you find this repository useful in your project, please consider giving a ⭐ and citing:

@inproceedings{
shayegani2026misaligned,
title={Misaligned Roles, Misplaced Images: Structural Input Perturbations Expose Multimodal Alignment Blind Spots},
author={Erfan Shayegani and G M Shahariar and Sara Abdali and Lei Yu and Nael Abu-Ghazaleh and Yue Dong},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=HRkrWi3FWP}
}

About

[🔥 ICLR 2026] - Misaligned Roles, Misplaced Images: Structural Input Perturbations Expose Multimodal Alignment Blind Spots

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors