optimize coordinate validation, handling both -ve and OOB error by musaqlain · Pull Request #1285 · weecology/DeepForest

musaqlain · 2026-01-25T01:34:19Z

Achieved an ~8.2x speedup (9.5 min -> 1.1 min) in dataset validation by reducing disk I/O and vectorizing coordinate checks.

Changes

Optimized I/O: Caches image dimensions to avoid repeated file opens.
Vectorization: Replaced iterrows with Pandas boolean masking.
Unified Error Reporting: Merged checks for negative/OOB coordinates into a single report that lists all invalid boxes to aid debugging.
Polygon Support: Automatically converts non-rectangular geometries to valid bounding boxes.

Fixes #1244

AI-Assisted Development

I used AI tools (e.g., GitHub Copilot, ChatGPT, etc.) in developing this PR
I understand all the code I'm submitting, thanks

bw4sz · 2026-01-28T17:57:10Z

Thanks for your contribution, lets wait for the tests to pass.

bw4sz · 2026-01-28T18:00:33Z

@jveitchmichaelis, my general philosophy has always been to focus DeepForest on novice users and help avoid errors before they happen. So in this case there is a raise if there are bad coordinates in the annotations. I think that is still right and not too annoying, instead of just printing or warning. I think we want to prioritize, 'hey there is a problem here' versus streamlined and betting the user understands what they are doing. Agreed?

bw4sz

Agreed, thanks.

bw4sz

have a look at test failures.

jveitchmichaelis

Requested some of the LLM-style comments be cleaned up.

It would be interesting to run this with a profiler. I assume 90+% of the time is spent opening images (and so caching is the real win to avoid reloading images). Hence why I don't think vectorizing is the most important aspect here, even though it's definitely an improvement.

I tested filtering a 100M row numpy array on Colab, takes a couple of seconds. So minutes of runtime is likely I/O.

jveitchmichaelis · 2026-01-28T20:21:38Z

@bw4sz sorry to belabor this review. If the goal is to show users where issues are, I think it's important that the output is actually useful for debugging. Seeing that some number of boxes are malformed, or only their coordinates, doesn't help me fix what's wrong with my dataset. What do you think about reporting a complete list of image name / box issues? e.g.

Errors:
image_name, box coords, error

In the existing code we print an error string for all invalid boxes, while this PR reports a number only for negative coords:

errors.append(f"Found {bad_count} annotations with negative coordinates.")

@musaqlain If you move the logic for negative coordinates into the oob_mask I think that would be better.

musaqlain · 2026-01-28T20:49:07Z

completely makes sense, i have updated the logic to catch both the -ve coordinates as well as out of bound coordinates to give users a detailed report. thanks cc @jveitchmichaelis

codecov · 2026-01-28T23:10:12Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.87%. Comparing base (408e150) to head (aa9ad5a).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1285      +/-   ##
==========================================
- Coverage   86.87%   86.87%   -0.01%     
==========================================
  Files          24       24              
  Lines        3064     3184     +120     
==========================================
+ Hits         2662     2766     +104     
- Misses        402      418      +16

Flag	Coverage Δ
unittests	`86.87% <100.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

musaqlain · 2026-02-02T09:07:30Z

PTAL cc @jveitchmichaelis

jveitchmichaelis · 2026-04-15T00:11:58Z

Could you resolve the conflicts and squash this one, please?

Then we can also have a look at merging #1373

Copilot

Pull request overview

This PR targets speeding up BoxDataset annotation validation for large datasets by reducing repeated image opens and switching coordinate checks from row-wise iteration to vectorized Pandas masking, while updating tests to reflect the new error reporting and polygon-to-bounds behavior.

Changes:

Cache per-image dimensions to avoid repeated disk I/O during coordinate validation.
Replace per-row coordinate validation with vectorized boolean masks and a unified “invalid bounding boxes” report.
Update dataset tests to match the new error message and to accept non-rectangular geometries by deriving bounding-box bounds.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
`src/deepforest/datasets/training.py`	Refactors `_validate_coordinates()` to cache image sizes, add bounds columns when needed, and perform vectorized negative/OOB checks.
`tests/test_datasets_training.py`	Updates expected validation error text, changes polygon behavior test to assert derived bounds, and adds negative/OOB validation tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

)

musaqlain · 2026-04-15T19:00:34Z

Could you resolve the conflicts and squash this one, please?

hi @jveitchmichaelis you can review it, conflicts are resolved and all the commits are squashed into 1 commit. thanks

musaqlain changed the title ~~Vectorize coordinate validation for speedup on large datasets~~ [WIP]: Vectorize coordinate validation for speedup on large datasets Jan 25, 2026

musaqlain changed the title ~~[WIP]: Vectorize coordinate validation for speedup on large datasets~~ Vectorize coordinate validation for speedup on large datasets Jan 25, 2026

bw4sz self-requested a review January 28, 2026 18:00

bw4sz approved these changes Jan 28, 2026

View reviewed changes

bw4sz self-requested a review January 28, 2026 18:36

bw4sz requested changes Jan 28, 2026

View reviewed changes

musaqlain requested a review from bw4sz January 28, 2026 19:49

jveitchmichaelis requested changes Jan 28, 2026

View reviewed changes

Comment thread src/deepforest/datasets/training.py Outdated

Comment thread src/deepforest/datasets/training.py Outdated

Comment thread tests/test_datasets_training.py Outdated

Comment thread tests/test_datasets_training.py Outdated

musaqlain requested a review from jveitchmichaelis January 28, 2026 20:13

musaqlain changed the title ~~Vectorize coordinate validation for speedup on large datasets~~ Optimize coordinate validation (I/O Caching + Vectorization) Jan 28, 2026

jveitchmichaelis requested changes Jan 28, 2026

View reviewed changes

Comment thread src/deepforest/datasets/training.py Outdated

musaqlain requested a review from jveitchmichaelis January 28, 2026 20:49

musaqlain changed the title ~~Optimize coordinate validation (I/O Caching + Vectorization)~~ Optimize coordinate validation (I/O Caching + Vectorization) & Unified Error Reporting Jan 28, 2026

musaqlain changed the title ~~Optimize coordinate validation (I/O Caching + Vectorization) & Unified Error Reporting~~ optimize coordinate validation, handling both -ve and OOB error Jan 28, 2026

bw4sz added the API This tag is used for small improvements to the readability and usability of the python API. label Feb 2, 2026

Copilot AI review requested due to automatic review settings April 15, 2026 17:52

musaqlain force-pushed the refactor_validate_coordinates branch from 7f018f5 to ceab036 Compare April 15, 2026 17:52

Copilot started reviewing on behalf of musaqlain April 15, 2026 17:52 View session

Copilot AI reviewed Apr 15, 2026

View reviewed changes

Comment thread src/deepforest/datasets/training.py

Comment thread src/deepforest/datasets/training.py

Comment thread src/deepforest/datasets/training.py

Comment thread src/deepforest/datasets/training.py

Comment thread src/deepforest/datasets/training.py

Optimize coordinate validation using vectorization (Fixes weecology#1244

aa9ad5a

)

musaqlain force-pushed the refactor_validate_coordinates branch from ceab036 to aa9ad5a Compare April 15, 2026 18:30

ethanwhite mentioned this pull request Apr 15, 2026

Speed up validation for datasets with same-size images #1373

Closed

3 tasks

Conversation

musaqlain commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI-Assisted Development

Uh oh!

bw4sz commented Jan 28, 2026

Uh oh!

bw4sz commented Jan 28, 2026

Uh oh!

bw4sz left a comment

Choose a reason for hiding this comment

Uh oh!

bw4sz left a comment

Choose a reason for hiding this comment

Uh oh!

jveitchmichaelis left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jveitchmichaelis commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

musaqlain commented Jan 28, 2026

Uh oh!

codecov bot commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

musaqlain commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jveitchmichaelis commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

musaqlain commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

musaqlain commented Jan 25, 2026 •

edited

Loading

jveitchmichaelis left a comment •

edited

Loading

jveitchmichaelis commented Jan 28, 2026 •

edited

Loading

codecov bot commented Jan 28, 2026 •

edited

Loading

musaqlain commented Feb 2, 2026 •

edited

Loading

jveitchmichaelis commented Apr 15, 2026 •

edited

Loading