Fix prediction mask names and background value by vojtech-cifka · Pull Request #16 · RationAI/tissue-classification

vojtech-cifka · 2026-05-19T20:01:43Z

Summary

preserve original slide basenames for prediction TIFF maps
remove generated hash suffixes from prediction mask filenames
keep Unicode characters in slide names, e.g. MUG German slide names
set prediction-map background pixels to 0
make prediction-map background value explicit in LBFGS prediction-map configs

Motivation

Prediction masks were difficult to map back to their source WSIs because filenames had generated hash suffixes and non-ASCII characters were
sanitized. This broke names such as 61 Follikuläres Schilddrüsenkarzinom.

Prediction masks also used 255 for uncovered/background pixels, while the report-facing annotation masks use 0 after remapping. This caused
report overlays to render the whole non-predicted slide area as white.

Summary by CodeRabbit

Chores
- Updated default background value for prediction map generation from 255 to 0.
- Modified prediction map output filename generation to derive names directly from input file paths rather than content-based naming.

Use the `label` and `fold` columns produced by the upstream k-fold split instead of deriving labels from coverage columns and randomly splitting val. Memory-mapped via HuggingFace datasets so the full embedding parquet no longer has to fit in numpy. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ddings Datamodule downloads embeddings + kfold artifacts from MLflow, joins on (slide_id, x, y) via pyarrow, applies class mapping, tissue/class coverage filters, and exposes per-fold splits via set_val_fold(). Training script loops folds in a single run and logs per-fold + aggregate metrics. Probe adds per-class F1, confusion matrix figures, optional input L2-norm and class weights. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The experiment file was declaring /class_mapping as a fresh default while configs/ml/linear_probe.yaml already had one, which Hydra rejects as a duplicate. Mark it as an override so the experiment replaces the base default. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ml/train.py uses @with_cli_args(["+ml=linear_probe"]), so the decorator already injects that arg. Passing it again on the command line caused Hydra to load configs/ml/linear_probe.yaml twice and reject duplicate defaults. Rely on the decorator and pass only +experiment=... Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@Package

…ng refs Two interpolation problems prevented Hydra from resolving the linear-probe config: 1. configs/ml.yaml uses ${random_seed:} and configs/ml/linear_probe.yaml uses ${len:...}, but neither resolver is registered anywhere. Register both at module import time in ml/train.py. 2. The class_mapping yamls use # @Package _global_, so class_mapping, class_indices, and class_names land at the config root. The references in linear_probe.yaml were doubly nested (e.g. class_mapping.class_mapping). Drop the prefix. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The filtered tiles parquet collapses ROI columns at tiling time, so kfold writes canonical names ("Epithelium", etc.) directly into `label`. The raw→canonical lookup built from the BB-suffixed YAML lists matched none of these and dropped the entire 1.1M-tile dataset under drop_unmapped=True. Extend _raw_to_canonical with identity entries for every canonical class so modern parquets pass through while legacy un-collapsed labels still collapse correctly. "background" stays unmapped → dropped, as intended. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- Add EmbeddingsDataModule.compute_class_weights("balanced"|"inverse") using sklearn-style weights from the current train fold. - train.py resolves class_weights="balanced"/"inverse" via the datamodule and passes the resulting list to LinearProbe at instantiate time (per-fold, since splits change). - Bump class_coverage_min from 0.0 to 0.5 to drop mosaic tiles. - Drop the redundant /class_mapping default from configs/ml/linear_probe.yaml; experiment files now own the choice. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Extract derive_labels logic to shared preprocessing/_labels.py, then use it in both split/kfold_split.py and the new embedding_dataset pipeline. The new pipeline joins k-fold (train) / filter_tiles (test) tile metadata with precomputed embeddings after applying tissue + per-dominant-class ROI thresholds, and emits a SlidesTilesLoader-compatible Parquet dataset as an MLflow artifact. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ataset

Joining 1M+ rows of list<double> embeddings was either OOMing on to_pandas() or hitting int32 list-offset overflow inside take(). The fix: - read embeddings into Arrow only and cast each chunk to large_list so take() concatenation uses int64 offsets; - run the join on keys plus a synthetic row index because Acero refuses list columns in non-key fields, then pull embeddings via take(); - combine_chunks() before take() for an O(N) single-pass copy; - write the parquet straight from Arrow, never materialising the embedding column in pandas. Also bumps the kube job memory to 64Gi to give the combined-chunks + take() peak some headroom, and trims the verbose [timing] prints down to one progress line per split. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Without this guard a malformed train artifact would crash deep inside apply_thresholds with a confusing KeyError. Surface a clear error that points at the expected upstream artifact instead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Clear the batch buffer only on rank!=0 or after a successful write so the on_test_end fallback no longer hits an always-empty buffer. Add diagnostic prints to the silent early-return guards and an idempotency flag so the two write hooks cooperate. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…-test

# Conflicts: # configs/experiment/ml/linear_classifier_test_adamw.yaml # configs/experiment/ml/linear_classifier_test_lbfgs.yaml # configs/experiment/preprocessing/embeddings_virchow2_tissue_tiles_05mpp.yaml # configs/ml/task/final_linear_classifier.yaml # configs/preprocessing/embeddings.yaml # ml/callbacks/tiff_prediction_map_writer.py # preprocessing/embeddings.py

…h-metrics-test

coderabbitai · 2026-05-19T20:01:58Z

📝 Walkthrough

Walkthrough

TiffPredictionMapWriter changes default background value from 255 to 0, removes blake2b-based filename generation for simpler path-based naming, and configuration files explicitly set the background value to match the new implementation default.

Changes

TIFF Prediction Map Writer Update

Layer / File(s)	Summary
TiffPredictionMapWriter implementation updates `ml/callbacks/tiff_prediction_map_writer.py`	`TiffPredictionMapWriter` constructor default for `background_value` changes from `255` to `0`, the `blake2b` import is removed, and `_slide_prediction_filename` is refactored to derive output TIFF names directly from input paths instead of generating digest-based filenames.
Configuration updates for background value `configs/ml/trainer/final_with_prediction_maps.yaml`, `configs/experiment/ml/test_linear_virchow2_lbfgs.yaml`	Both trainer and experiment configuration files explicitly add `background_value: 0` to the `tiff_prediction_maps` callback configuration.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐰 A rabbit writes with care today,
Backgrounds now default to gray,
Blake's digest gone, filenames flow,
From simple paths, not hashes bright—
Configuration and code unite! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main changes: fixing prediction mask names (removing hash suffixes) and changing the background value from 255 to 0.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/mug-prediction-masks

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ml/callbacks/tiff_prediction_map_writer.py`:
- Around line 519-520: _current implementation of _slide_prediction_filename
returns only the basename which causes silent overwrites for identically-named
files from different directories; update _slide_prediction_filename to preserve
the readable basename but append a stable, short disambiguator derived from the
full input path (e.g., first 8 chars of a hash of str(Path(path).resolve()) or
include parent folder name) before the .tiff suffix so names remain
deterministic and human-readable, and/or add a duplicate-detection check in the
writer that raises an error if two inputs would map to the same output; ensure
you reference and change the _slide_prediction_filename function and the caller
that writes prediction files so the new name format is used consistently.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f2437f6d-25d0-430a-ab4e-c58d20390ed7

📥 Commits

Reviewing files that changed from the base of the PR and between 8a8f76e and ab1c90a.

📒 Files selected for processing (3)

configs/experiment/ml/test_linear_virchow2_lbfgs.yaml
configs/ml/trainer/final_with_prediction_maps.yaml
ml/callbacks/tiff_prediction_map_writer.py

gemini-code-assist

Code Review

This pull request updates the TiffPredictionMapWriter by changing the default background_value from 255 to 0 in both the configuration files and the class constructor. Additionally, it simplifies the _slide_prediction_filename function by removing the hash suffix and previous sanitization logic. Feedback suggests that this simplification could lead to filename collisions and filesystem compatibility issues, recommending a more robust sanitization approach that handles illegal characters while maintaining Unicode support.

vojtech-cifka and others added 30 commits May 7, 2026 21:43

feat: create ml pipeline for linear probe

24668c3

fix: sort only tiles parquet

894c27b

fix: log join types of tile keys

fc824ad

fix: remove embeddings from the join

11931d1

fix: remove label column

fb6b320

fix: prevent overflow

7434ae9

Merge remote-tracking branch 'origin/master' into feature/linear-probe

1b18daa

feat: add class tresholds and run ids

911bec2

fix: wrong run id

1a02395

Merge remote-tracking branch 'origin/master' into feature/embedding-d…

08d7ba5

…ataset

feat: add timing

b38465e

refactor: use pyarrow to avoid to pandas conversion

bfc9578

fix: join on keys only

eb213c6

fix: typing

c92d9a1

fix: add prints

01cc394

refactor: use combine chunks

cad0d37

chore: remove time

3b0137f

feat: add timing

8df47aa

chore: revert to the previous state

926753d

feat: add prints

b0e9ba4

vojtech-cifka and others added 19 commits May 18, 2026 22:01

chore: remove username from the submission script

76e4194

fix: force the entering of the write phase of the prediction maps

597e348

fix: remove username

3829ebd

feat: generate embeddings up to a budget

0b2d38e

Merge branch 'feature/ml-test-mode' into feature/provgigapath-metrics…

ff7c06e

…-test

chore: rename ml experiments for clarity

b3d803a

feat: add original slide name in the per slide statistics

1703c01

Merge remote-tracking branch 'origin/master' into feature/provgigapat…

3c72c4f

…h-metrics-test

refactor: simplify the preprocessing name scripts

67d3ef4

fix: commnets, generate safer filenames

cd4b19a

fix: remove erorr masks

c9b4c67

fix: remove rendundant column selection

df7888f

feat: lower the worker consumption

1b1f8af

fix: change the background color

1b4baa0

feat: preserve original filename

083dd94

Merge origin/master into fix/mug-prediction-masks

142f9cb

Remove embedding preprocessing changes from PR

ab1c90a

vojtech-cifka requested review from Adames4 and vejtek May 19, 2026 20:01

vojtech-cifka self-assigned this May 19, 2026

vojtech-cifka requested a review from a team May 19, 2026 20:01

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

Comment thread ml/callbacks/tiff_prediction_map_writer.py

gemini-code-assist Bot reviewed May 19, 2026

View reviewed changes

Comment thread ml/callbacks/tiff_prediction_map_writer.py

vejtek approved these changes May 19, 2026

View reviewed changes

Adames4 approved these changes May 19, 2026

View reviewed changes

vojtech-cifka merged commit ecb0347 into master May 20, 2026
3 checks passed

vojtech-cifka deleted the fix/mug-prediction-masks branch May 20, 2026 07:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix prediction mask names and background value#16

Fix prediction mask names and background value#16
vojtech-cifka merged 133 commits into
masterfrom
fix/mug-prediction-masks

vojtech-cifka commented May 19, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 19, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vojtech-cifka commented May 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vojtech-cifka commented May 19, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 19, 2026 •

edited

Loading