Subliminal Signals in Preference Labels

Isotta Magistrali · Frédéric Berdoz · Sam Dauncey · Roger Wattenhofer

Accepted at AITW @ ICLR 2026

Overview

We study subliminal learning in preference-based alignment: even in a highly constrained setting where a neutral student model generates unbiased completions, a biased judge can transmit behavioural traits through binary preference labels alone. Our findings suggest that robust oversight in superalignment settings requires mechanisms that can detect and mitigate subliminal preference transmission.

Installation

Prerequisites

Python >= 3.11
uv for dependency management

Steps

Clone the repository:

git clone https://github.com/ETH-DISCO/subliminal-signals-in-preference-labels.git
cd subliminal-signals-in-preference-labels

Create and activate virtual environment:

uv sync
source .venv/bin/activate
uv sync --group=open_models

Set up environment variables:

cp .env.template .env
# Edit .env with your API keys

Main Pipeline (Deep Judge)

1. Generate and Judge Dataset

Generate a control dataset with the neutral configuration. Since the student model is always unbiased (only the judge changes), you can reuse the same filtered dataset for biased-judge experiments.

python scripts/generate_judge_dataset_deep.py \
    --config_module=cfgs/preference_numbers/judge_model_cfgs_deep.py \
    --cfg_var_name=neutral_cfg \
    --raw_paired_path=./data/judge_deep/neutral/raw.jsonl \
    --filtered_paired_path=./data/judge_deep/neutral/filtered.jsonl \
    --preference_dataset_path=./data/judge_deep/neutral/preference.jsonl

2. Judge an Existing Dataset

python scripts/judge_dataset_deep.py \
    --config_module=cfgs/preference_numbers/judge_model_cfgs_deep.py \
    --cfg_var_name=cat_cfg \
    --filtered_paired_path=./data/judge_deep/neutral/filtered.jsonl \
    --preference_dataset_path=./data/judge_deep/cat/preference.jsonl

3. Student Model Alignment

In the following scripts, swap=False trains the aligned normal model and swap=True trains the aligned swapped model.

SFT:

python scripts/run_finetuning_job_from_preference_5.py \
    --config_module=cfgs/preference_numbers/judge_model_cfgs_deep.py \
    --cfg_var_name=cat_ft_job \
    --dataset_path=./data/judge_deep/cat/preference.jsonl \
    --output_path=./output/sft/judge_deep/cat/model.jsonl \
    --swap=False

DPO:

python scripts/run_dpo_job_5alt.py \
    --config_module=cfgs/preference_numbers/judge_model_cfgs_deep.py \
    --cfg_var_name=cat_dpo_job \
    --dataset_path=./data/judge_deep/cat/preference.jsonl \
    --output_path=./output/dpo/judge_deep/cat/model.jsonl \
    --swap=False

4. Iterative Alignment

Generate a new dataset using the aligned model from the first round:

python scripts/iterative_generate_judge_dataset_deep.py \
    --config_module=cfgs/preference_numbers/judge_model_cfgs_deep.py \
    --cfg_var_name=cat_cfg \
    --model_path_main=./output/sft/judge_deep/cat/model.jsonl \
    --raw_paired_path=./data/sft/judge_deep/cat_iterative/raw.jsonl \
    --filtered_paired_path=./data/sft/judge_deep/cat_iterative/filtered.jsonl \
    --preference_dataset_path=./data/sft/judge_deep/cat_iterative/preference.jsonl

Then align again using the same scripts from step 3 with the updated dataset. Be careful to set the correct cfg_var_name (see Configuration).

5. Evaluate

python scripts/run_logprob_evaluation.py \
    --config_module=cfgs/real_world/logprob_eval_cfgs.py \
    --cfg_var_name=animal_evaluation_mc_abcde \
    --model_path=./output/dpo/judge_deep/cat/model.jsonl \
    --output_path=./output/dpo/judge_deep/cat/evaluation.json

Pairwise Judge Pipeline

1. Generate and Judge Dataset

python scripts/generate_judge_dataset_numbers_logprobs.py \
    --config_module=cfgs/preference_numbers/judge_model_cfgs.py \
    --cfg_var_name=neutral_judge_cfg \
    --raw_paired_path=./data/judge_pairwise/neutral/raw.jsonl \
    --filtered_paired_path=./data/judge_pairwise/neutral/filtered.jsonl \
    --preference_dataset_path=./data/judge_pairwise/neutral/preference.jsonl

2. Judge an Existing Dataset

Standard judge:

python scripts/judge_dataset_next_logprobs.py \
    --config_module=cfgs/preference_numbers/judge_model_cfgs.py \
    --cfg_var_name=cat_judge_cfg \
    --filtered_paired_path=./data/judge_pairwise/neutral/filtered.jsonl \
    --preference_dataset_path=./data/judge_pairwise/cat/preference.jsonl

Biased model as judge (with system prompt):

python scripts/judge_dataset_next_logprobs_biased.py \
    --config_module=cfgs/preference_numbers/judge_model_cfgs.py \
    --cfg_var_name=cat_judge_cfg \
    --model_path=./output/baselines/cat/model.json \
    --filtered_paired_path=./data/judge_pairwise/neutral/filtered.jsonl \
    --preference_dataset_path=./data/biased_judge_pairwise/cat/preference.jsonl

Biased model as judge (without system prompt):

python scripts/judge_dataset_next_logprobs_biased.py \
    --config_module=cfgs/preference_numbers/judge_model_cfgs.py \
    --cfg_var_name=neutral_judge_cfg \
    --model_path=./output/baselines/cat/model.json \
    --filtered_paired_path=./data/judge_pairwise/neutral/filtered.jsonl \
    --preference_dataset_path=./data/biased_judge_pairwise/cat_no_sys/preference.jsonl

3. Alignment

We use DPO for the pairwise judge pipeline since it showed stronger signal. You can specify a different dataset_path depending on which dataset you want to align the student model on.

python scripts/run_dpo_job.py \
    --config_module=cfgs/preference_numbers/judge_model_cfgs.py \
    --cfg_var_name=cat_dpo_job \
    --dataset_path=./data/judge_pairwise/cat/preference.jsonl \
    --output_path=./output/dpo/judge_pairwise/cat/model.jsonl \
    --swap=False

4. Evaluate

Same as in the main pipeline (step 5 above).

Configuration

Create a configuration file following the examples in cfgs/preference_numbers/judge_model_cfgs_deep.py (main pipeline) or cfgs/preference_numbers/judge_model_cfgs.py (pairwise judge).

Dataset creation: controlled by build_judge_dataset_cfg. Modify prompt sets and parameters for your use case.
Alignment: four functions handle alignment in the main pipeline config:
- build_ft_job / build_ft_job_iterative — SFT (standard and iterative)
- build_dpo_job / build_dpo_job_iterative — DPO (standard and iterative)
Evaluation: define evaluation questions using the LogprobEvaluation class. See cfgs/real_world/logprob_eval_cfgs.py.

License

This project is licensed under the MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
cfgs		cfgs
scripts		scripts
sl		sl
.env.template		.env.template
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
ruff.toml		ruff.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Subliminal Signals in Preference Labels

Overview

Installation

Prerequisites

Steps

Main Pipeline (Deep Judge)

1. Generate and Judge Dataset

2. Judge an Existing Dataset

3. Student Model Alignment

4. Iterative Alignment

5. Evaluate

Pairwise Judge Pipeline

1. Generate and Judge Dataset

2. Judge an Existing Dataset

3. Alignment

4. Evaluate

Configuration

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Subliminal Signals in Preference Labels

Overview

Installation

Prerequisites

Steps

Main Pipeline (Deep Judge)

1. Generate and Judge Dataset

2. Judge an Existing Dataset

3. Student Model Alignment

4. Iterative Alignment

5. Evaluate

Pairwise Judge Pipeline

1. Generate and Judge Dataset

2. Judge an Existing Dataset

3. Alignment

4. Evaluate

Configuration

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages