CISPA European Championship 2026 | Stockholm | February 14-15, 2026
Reconstruct 100 CIFAR-10 training images using only black-box classifier access and an auxiliary dataset.
Task: Submit 100 images (10 per class) that minimize MSE to the nearest image in the hidden training set, with access only to:
- An auxiliary dataset of 1000 labeled images (not in the training set)
- A logits API returning classifier predictions for up to 100 images per query
Optimizing classifier confidence (logits) does not minimize pixel-space MSE — the evaluation metric. K-means centroids directly minimize average squared distance to cluster members, perfectly aligning with the objective.
1. Cluster-First Candidate Generation
# Run k-means per class on auxiliary images (multiple seeds for diversity)
for seed in seeds:
kmeans = KMeans(n_clusters=10, random_state=seed)
kmeans.fit(aux_images[class_mask])
centroids.append(kmeans.cluster_centers_) # Blurry but well-centered
medoids.append(find_medoids(kmeans)) # Sharp actual images2. MSE-Based Selection
# Proxy MSE = distance to nearest auxiliary image (training data inaccessible)
scores = [mse(candidate, nearest_aux_image) for candidate in pool]
selected = sorted(pool, key=lambda x: scores[x])[:10_per_class]3. Diversity Maximization
# Replace near-duplicates with images that maximize minimum distance to selected set
for pair in most_similar_pairs:
replacement = max(pool, key=lambda x: min_dist_to_selected(x))
swap(pair.weaker, replacement)4. Logits-Based Validation
# One-shot API query: identify low-confidence images, swap with better alternatives
logits = api.query(submission)
weak = [i for i, l in enumerate(logits) if confidence(l) < threshold]
swap_weakest(weak, best_alternatives)| Approach | Objective | Aligns with MSE? |
|---|---|---|
| Logit optimization | Max classifier confidence | No — invariant to pixel details |
| K-means centroids | Min squared distance | Yes — directly |
- K-means centroids outperformed logit-guided optimization despite having zero API access
- Multi-seed clustering + diversity expansion gave better coverage than single-run k-means
- Centroids (blurry averages) beat medoids (sharp actual images) for minimizing MSE
task1/
├── solve_v2.py # Main cluster-first pipeline (final)
├── final_safe_variant.py # Diversity-focused variant
├── final_logits_swap.py # Logits validation + targeted swaps
├── basic.py # Auxiliary-image baseline
└── archive_unused/ # Intermediate experiments & ablations
CISPA European Championship 2026 | Stockholm | February 14-15, 2026
Identify which generative model (VAR/RAR) produced an image, or classify as outlier.
Task: Given an image, determine if it was generated by one of 8 models or is an outlier:
- VAR models: VAR-16, VAR-20, VAR-24, VAR-30 (depths)
- RAR models: RAR-B, RAR-L, RAR-XL, RAR-XXL (sizes)
- Outlier: Unknown source
Based on "Data Provenance for Image Auto-Regressive Generation" (Zhao et al., ICLR 2026)
Image Autoregressive Models (IARs) leave "fingerprints":
- Generated images = sampled from discrete codebook entries
- Natural images = continuous, not constrained by codebook
- Distance to codebook = provenance signal
1. Extract Provenance Features (16 total)
For each model, compute 2 features:
# VAR: QuantLoss + EncLoss (Equations 5, 10)
def compute_var_features(vae, images):
f = vae.encoder(images) # Encode
f_q = vae.quantize(f) # Quantize to codebook
quant_loss = ||f - f_q||² # How far from codebook?
recon1 = vae.decoder(f_q) # Reconstruct once
recon2 = vae.decoder(vae.encoder(recon1)) # Twice (calibration)
enc_loss = ||recon1 - x||² / ||recon2 - recon1||²
return [quant_loss, enc_loss]
# RAR: NLL + Prob (marginalized over labels)
def compute_rar_features(tokenizer, generator, images):
tokens = tokenizer.encode(images)
logp = generator(tokens, labels) # Try top-k ImageNet labels
nll = -logp_marginalized
return [nll, exp(-nll)]2. Train Simple Classifier
clf = LogisticRegression() # Paper did not specify what simple classifier, so we tried to implement logistic regression
clf.fit(features, labels) # 16-D features → 9 classes3. Detect Outliers
probs = clf.predict_proba(features)
outliers = (probs.max(axis=1) < threshold) # Low confidence = outlier| Component | Why | Paper Evidence |
|---|---|---|
| QuantLoss | Generated images closer to codebook | Equation 5, Figure 1 |
| Decoder Inversion | Original encoder trained on natural images | Section 3.3.1, Table 4 |
| Calibration | Normalizes for image complexity | Equation 10, Table 5 |
| Simple Classifier | Features already well-separated | Section 4.2, Table 1 |
Paper Results: 99.2-100% TPR@1%FPR across all models
├── solution.py # Main implementation (paper methodology)
├── diagnostic.py # Setup verification
├── requirements.txt # Dependencies
├── VAR/ # VAR models + checkpoints
├── RAR/ # RAR models + checkpoints
├── dataset/ # train/val/test images
└── .cache/ # Feature cache (auto-created)
- Feature engineering > model complexity - Simple features from model internals beat end-to-end
- Decoder inversion is critical - Original encoder doesn't invert well for generated images (Table 4)
- Caching saves time - Feature extraction: 30-60min → cache → 2min subsequent runs
- Simple classifiers sufficient - Well-separated features don't need Random Forests
- Error handling matters - Silent failures waste hours; fail loudly early
# Paper says (Section 3.3.1):
"The original encoder E is not a close inversion of D for generated images"
# Our implementation:
def load_var_vae(depth):
vae = build_vae_var(...)
# Finetune encoder on generated images (50k, 10-50 epochs)
# Result: TPR improved from 6.2% → 100% (Table 4)- Read paper carefully before coding - saved hours of wrong approaches
- Start simple, iterate - Dummy features → VAR only → VAR+RAR → calibration
- Use diagnostic tools -
diagnostic.pycaught setup issues early - Cache everything - Don't recompute expensive features
Primary: Zhao, B., et al. (2026). Data Provenance for Image Auto-Regressive Generation. ICLR 2026.
Models: VAR (Tian+ 2024), RAR (Yu+ 2024), LlamaGen (Sun+ 2024), Taming (Esser+ 2021)
Related: LatentTracer (Wang+ 2024), AEDR (Wang+ 2025), MIA (Kowalczuk+ 2025)
- CISPA Helmholtz Center for Information Security
- Paper authors for excellent methodology
- Team members for 24-hour sprint
Task: Create images that get different predictions on different hardware backends
Why it's hard: Chimera "pockets" are 1-10,000 ULP wide in a 3,072-dimensional space
My result: Found chimeras with ~0.5% success rate using boundary optimization
Chimeras are images that produce conflicting predictions when the same model runs on different linear algebra backends:
- 🍎 Apple Accelerate → predicts "Cat"
- 🔷 Intel MKL → predicts "Dog"
- 🟢 Nvidia CUDA → predicts "Bird"
- 🔶 BLIS → predicts "Cat"
This happens because floating-point arithmetic differs slightly across hardware implementations. At decision boundaries, tiny numerical differences (10^-6 to 10^-8) cause different class predictions.
Figure: Same image → Different predictions on different backends
- Submit: 1,000 images (200 unique + replicates)
- Score: Percentage of images that are chimeras
- Goal: Maximize chimera count
- Difficulty: Chimera pockets are extremely rare (like finding needles in a haystack)
Optimize images to sit exactly between two classes:
def get_to_perfect_boundary(x_base, max_iters=1500):
"""Push image to decision boundary where P(class1) ≈ P(class2)"""
# Minimize gap between top-2 class probabilities
top2_probs, _ = torch.topk(probs, 2, dim=1)
gap = torch.abs(top2_probs[0, 0] - top2_probs[0, 1])
# Loss: 80% boundary + 20% cross-entropy
loss = 0.8 * gap + 0.2 * ce_lossResult: Images with gap < 0.01 (almost perfectly uncertain)
Generate 50 variations around each boundary with 9 different noise scales:
# Cover 1-10,000 ULP range systematically
noise_scales = [0.0005, 0.001, 0.002, 0.003,
0.005, 0.008, 0.01, 0.015, 0.02]
for scale in noise_scales:
noise = torch.randn_like(x_boundary) * scale
x_variant = quantize(x_boundary + noise)Why: Chimera pockets exist at different scales; need variety to find them
Identify most likely chimeras and replicate them:
# Find most uncertain images
uncertainties.sort(key=lambda x: x['gap']) # Smallest gap first
# Replicate top-10 most uncertain × 100 copies = 1,000 imagesHypothesis: Smallest confidence gap → Highest chimera probability
| Approach | Chimeras Found | Success Rate |
|---|---|---|
| Random baseline | 0 | 0.0% |
| Basic boundary search | 1 | 0.5% |
| Dense multi-scale | Target: 10-20 | 5-10% |
- Aggressive boundary optimization (1500 iterations, gap < 0.01)
- Multiple noise scales (9 scales from 0.0005 to 0.02)
- Always quantize to 8-bit (
torch.round(x * 255) / 255.0) - Replicate uncertain images (smallest gap = most likely chimera)
- Random perturbations → 0% success rate
- Single noise scale → Missed pockets at other scales
- Too few iterations → Didn't reach tight boundaries
- Ignoring quantization → Wrong boundaries
-
Finding the boundary ≠ finding chimeras
Optimization gets you close, but you need dense sampling to hit the pocket -
Chimera pockets are multi-scale
Can be 1 ULP or 10,000 ULP wide → need variety in noise -
Quantization changes everything
Floating-point boundaries ≠ 8-bit quantized boundaries -
This is a real problem
Not theoretical — chimeras exist and affect production models
task3/
├── generate_submissions.py # Main: boundary optimization + dense sampling
├── find_and_replicate.py # Identify & replicate uncertain images
├── deep_analysis.py # Multi-criteria uncertainty analysis
└── visualize_boundary_v2.py # Decision boundary visualization
- Adaptive noise scaling: Refine around promising regions
- Ensemble boundaries: Optimize multiple class pairs simultaneously
- Gradient-free search: Basin-hopping or genetic algorithms
- Exploit locality: Once you find one chimera, search nearby
- Hardware-in-the-loop: Test on real backends during generation
TL;DR: Found chimeras by (1) optimizing to decision boundaries, (2) densely sampling with multi-scale noise, (3) replicating most uncertain candidates. Achieved ~0.5% chimera rate, proving numerical instability is a measurable problem in deep learning.