Skip to content

Fix prediction mask names and background value#16

Merged
vojtech-cifka merged 133 commits into
masterfrom
fix/mug-prediction-masks
May 20, 2026
Merged

Fix prediction mask names and background value#16
vojtech-cifka merged 133 commits into
masterfrom
fix/mug-prediction-masks

Conversation

@vojtech-cifka
Copy link
Copy Markdown
Collaborator

@vojtech-cifka vojtech-cifka commented May 19, 2026

Summary

  • preserve original slide basenames for prediction TIFF maps
  • remove generated hash suffixes from prediction mask filenames
  • keep Unicode characters in slide names, e.g. MUG German slide names
  • set prediction-map background pixels to 0
  • make prediction-map background value explicit in LBFGS prediction-map configs

Motivation

Prediction masks were difficult to map back to their source WSIs because filenames had generated hash suffixes and non-ASCII characters were
sanitized. This broke names such as 61 Follikuläres Schilddrüsenkarzinom.

Prediction masks also used 255 for uncovered/background pixels, while the report-facing annotation masks use 0 after remapping. This caused
report overlays to render the whole non-predicted slide area as white.

Summary by CodeRabbit

  • Chores
    • Updated default background value for prediction map generation from 255 to 0.
    • Modified prediction map output filename generation to derive names directly from input file paths rather than content-based naming.

Review Change Stack

vojtech-cifka and others added 30 commits May 7, 2026 21:43
Use the `label` and `fold` columns produced by the upstream k-fold split
instead of deriving labels from coverage columns and randomly splitting
val. Memory-mapped via HuggingFace datasets so the full embedding parquet
no longer has to fit in numpy.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ddings

Datamodule downloads embeddings + kfold artifacts from MLflow, joins on
(slide_id, x, y) via pyarrow, applies class mapping, tissue/class coverage
filters, and exposes per-fold splits via set_val_fold(). Training script
loops folds in a single run and logs per-fold + aggregate metrics. Probe
adds per-class F1, confusion matrix figures, optional input L2-norm and
class weights.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The experiment file was declaring /class_mapping as a fresh default
while configs/ml/linear_probe.yaml already had one, which Hydra rejects
as a duplicate. Mark it as an override so the experiment replaces the
base default.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ml/train.py uses @with_cli_args(["+ml=linear_probe"]), so the decorator
already injects that arg. Passing it again on the command line caused
Hydra to load configs/ml/linear_probe.yaml twice and reject duplicate
defaults. Rely on the decorator and pass only +experiment=...

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ng refs

Two interpolation problems prevented Hydra from resolving the linear-probe
config:

1. configs/ml.yaml uses ${random_seed:} and configs/ml/linear_probe.yaml
   uses ${len:...}, but neither resolver is registered anywhere. Register
   both at module import time in ml/train.py.

2. The class_mapping yamls use # @Package _global_, so class_mapping,
   class_indices, and class_names land at the config root. The references
   in linear_probe.yaml were doubly nested (e.g. class_mapping.class_mapping).
   Drop the prefix.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The filtered tiles parquet collapses ROI columns at tiling time, so
kfold writes canonical names ("Epithelium", etc.) directly into `label`.
The raw→canonical lookup built from the BB-suffixed YAML lists matched
none of these and dropped the entire 1.1M-tile dataset under
drop_unmapped=True.

Extend _raw_to_canonical with identity entries for every canonical class
so modern parquets pass through while legacy un-collapsed labels still
collapse correctly. "background" stays unmapped → dropped, as intended.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Add EmbeddingsDataModule.compute_class_weights("balanced"|"inverse")
  using sklearn-style weights from the current train fold.
- train.py resolves class_weights="balanced"/"inverse" via the
  datamodule and passes the resulting list to LinearProbe at instantiate
  time (per-fold, since splits change).
- Bump class_coverage_min from 0.0 to 0.5 to drop mosaic tiles.
- Drop the redundant /class_mapping default from configs/ml/linear_probe.yaml;
  experiment files now own the choice.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Extract derive_labels logic to shared preprocessing/_labels.py, then use it in
both split/kfold_split.py and the new embedding_dataset pipeline. The new
pipeline joins k-fold (train) / filter_tiles (test) tile metadata with
precomputed embeddings after applying tissue + per-dominant-class ROI thresholds,
and emits a SlidesTilesLoader-compatible Parquet dataset as an MLflow artifact.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Joining 1M+ rows of list<double> embeddings was either OOMing on
to_pandas() or hitting int32 list-offset overflow inside take(). The fix:
- read embeddings into Arrow only and cast each chunk to large_list so
  take() concatenation uses int64 offsets;
- run the join on keys plus a synthetic row index because Acero refuses
  list columns in non-key fields, then pull embeddings via take();
- combine_chunks() before take() for an O(N) single-pass copy;
- write the parquet straight from Arrow, never materialising the
  embedding column in pandas.

Also bumps the kube job memory to 64Gi to give the combined-chunks +
take() peak some headroom, and trims the verbose [timing] prints down
to one progress line per split.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Without this guard a malformed train artifact would crash deep inside
apply_thresholds with a confusing KeyError. Surface a clear error that
points at the expected upstream artifact instead.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
vojtech-cifka and others added 19 commits May 18, 2026 22:01
Clear the batch buffer only on rank!=0 or after a successful write so the
on_test_end fallback no longer hits an always-empty buffer. Add diagnostic
prints to the silent early-return guards and an idempotency flag so the
two write hooks cooperate.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
# Conflicts:
#	configs/experiment/ml/linear_classifier_test_adamw.yaml
#	configs/experiment/ml/linear_classifier_test_lbfgs.yaml
#	configs/experiment/preprocessing/embeddings_virchow2_tissue_tiles_05mpp.yaml
#	configs/ml/task/final_linear_classifier.yaml
#	configs/preprocessing/embeddings.yaml
#	ml/callbacks/tiff_prediction_map_writer.py
#	preprocessing/embeddings.py
@vojtech-cifka vojtech-cifka requested review from Adames4 and vejtek May 19, 2026 20:01
@vojtech-cifka vojtech-cifka self-assigned this May 19, 2026
@vojtech-cifka vojtech-cifka requested a review from a team May 19, 2026 20:01
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

📝 Walkthrough

Walkthrough

TiffPredictionMapWriter changes default background value from 255 to 0, removes blake2b-based filename generation for simpler path-based naming, and configuration files explicitly set the background value to match the new implementation default.

Changes

TIFF Prediction Map Writer Update

Layer / File(s) Summary
TiffPredictionMapWriter implementation updates
ml/callbacks/tiff_prediction_map_writer.py
TiffPredictionMapWriter constructor default for background_value changes from 255 to 0, the blake2b import is removed, and _slide_prediction_filename is refactored to derive output TIFF names directly from input paths instead of generating digest-based filenames.
Configuration updates for background value
configs/ml/trainer/final_with_prediction_maps.yaml, configs/experiment/ml/test_linear_virchow2_lbfgs.yaml
Both trainer and experiment configuration files explicitly add background_value: 0 to the tiff_prediction_maps callback configuration.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐰 A rabbit writes with care today,
Backgrounds now default to gray,
Blake's digest gone, filenames flow,
From simple paths, not hashes bright—
Configuration and code unite! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main changes: fixing prediction mask names (removing hash suffixes) and changing the background value from 255 to 0.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/mug-prediction-masks

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ml/callbacks/tiff_prediction_map_writer.py`:
- Around line 519-520: _current implementation of _slide_prediction_filename
returns only the basename which causes silent overwrites for identically-named
files from different directories; update _slide_prediction_filename to preserve
the readable basename but append a stable, short disambiguator derived from the
full input path (e.g., first 8 chars of a hash of str(Path(path).resolve()) or
include parent folder name) before the .tiff suffix so names remain
deterministic and human-readable, and/or add a duplicate-detection check in the
writer that raises an error if two inputs would map to the same output; ensure
you reference and change the _slide_prediction_filename function and the caller
that writes prediction files so the new name format is used consistently.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f2437f6d-25d0-430a-ab4e-c58d20390ed7

📥 Commits

Reviewing files that changed from the base of the PR and between 8a8f76e and ab1c90a.

📒 Files selected for processing (3)
  • configs/experiment/ml/test_linear_virchow2_lbfgs.yaml
  • configs/ml/trainer/final_with_prediction_maps.yaml
  • ml/callbacks/tiff_prediction_map_writer.py

Comment thread ml/callbacks/tiff_prediction_map_writer.py
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the TiffPredictionMapWriter by changing the default background_value from 255 to 0 in both the configuration files and the class constructor. Additionally, it simplifies the _slide_prediction_filename function by removing the hash suffix and previous sanitization logic. Feedback suggests that this simplification could lead to filename collisions and filesystem compatibility issues, recommending a more robust sanitization approach that handles illegal characters while maintaining Unicode support.

Comment thread ml/callbacks/tiff_prediction_map_writer.py
@vojtech-cifka vojtech-cifka merged commit ecb0347 into master May 20, 2026
3 checks passed
@vojtech-cifka vojtech-cifka deleted the fix/mug-prediction-masks branch May 20, 2026 07:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants