optimize coordinate validation, handling both -ve and OOB error#1285
optimize coordinate validation, handling both -ve and OOB error#1285musaqlain wants to merge 1 commit intoweecology:mainfrom
Conversation
|
Thanks for your contribution, lets wait for the tests to pass. |
|
@jveitchmichaelis, my general philosophy has always been to focus DeepForest on novice users and help avoid errors before they happen. So in this case there is a raise if there are bad coordinates in the annotations. I think that is still right and not too annoying, instead of just printing or warning. I think we want to prioritize, 'hey there is a problem here' versus streamlined and betting the user understands what they are doing. Agreed? |
bw4sz
left a comment
There was a problem hiding this comment.
have a look at test failures.
There was a problem hiding this comment.
Requested some of the LLM-style comments be cleaned up.
It would be interesting to run this with a profiler. I assume 90+% of the time is spent opening images (and so caching is the real win to avoid reloading images). Hence why I don't think vectorizing is the most important aspect here, even though it's definitely an improvement.
I tested filtering a 100M row numpy array on Colab, takes a couple of seconds. So minutes of runtime is likely I/O.
|
@bw4sz sorry to belabor this review. If the goal is to show users where issues are, I think it's important that the output is actually useful for debugging. Seeing that some number of boxes are malformed, or only their coordinates, doesn't help me fix what's wrong with my dataset. What do you think about reporting a complete list of image name / box issues? e.g. In the existing code we print an error string for all invalid boxes, while this PR reports a number only for negative coords: @musaqlain If you move the logic for negative coordinates into the |
|
completely makes sense, i have updated the logic to catch both the -ve coordinates as well as out of bound coordinates to give users a detailed report. thanks cc @jveitchmichaelis |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1285 +/- ##
==========================================
- Coverage 86.87% 86.87% -0.01%
==========================================
Files 24 24
Lines 3064 3184 +120
==========================================
+ Hits 2662 2766 +104
- Misses 402 418 +16
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
PTAL cc @jveitchmichaelis |
|
Could you resolve the conflicts and squash this one, please? Then we can also have a look at merging #1373 |
7f018f5 to
ceab036
Compare
There was a problem hiding this comment.
Pull request overview
This PR targets speeding up BoxDataset annotation validation for large datasets by reducing repeated image opens and switching coordinate checks from row-wise iteration to vectorized Pandas masking, while updating tests to reflect the new error reporting and polygon-to-bounds behavior.
Changes:
- Cache per-image dimensions to avoid repeated disk I/O during coordinate validation.
- Replace per-row coordinate validation with vectorized boolean masks and a unified “invalid bounding boxes” report.
- Update dataset tests to match the new error message and to accept non-rectangular geometries by deriving bounding-box bounds.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
src/deepforest/datasets/training.py |
Refactors _validate_coordinates() to cache image sizes, add bounds columns when needed, and perform vectorized negative/OOB checks. |
tests/test_datasets_training.py |
Updates expected validation error text, changes polygon behavior test to assert derived bounds, and adds negative/OOB validation tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ceab036 to
aa9ad5a
Compare
hi @jveitchmichaelis you can review it, conflicts are resolved and all the commits are squashed into 1 commit. thanks |
Achieved an ~8.2x speedup (9.5 min -> 1.1 min) in dataset validation by reducing disk I/O and vectorizing coordinate checks.
Changes
iterrowswith Pandas boolean masking.Fixes #1244
AI-Assisted Development