Conversation
Limit parallel workers to match file count and release version 0.0.15
📝 WalkthroughWalkthroughThe PR caps parallel worker pools in download functions to not exceed the actual number of items queued for processing, preventing oversized worker allocation when processing small batches. Version incremented to 0.0.15. ChangesParallel Worker Pool Capping
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 60 minutes.Comment |
Review Summary by QodoCap parallel workers to file count and release 0.0.15
WalkthroughsDescription• Cap parallel workers to actual file/URL count to prevent thread pool oversizing • Add tests verifying worker pool capping behavior for Globus and URL downloads • Bump version to 0.0.15 Diagramflowchart LR
A["parallel_files parameter"] -->|"min with file count"| B["Capped worker pool"]
B --> C["Globus download"]
B --> D["URL download"]
C --> E["Serial or parallel execution"]
D --> E
File Changes1. pridepy/files/files.py
|
Code Review by Qodo
1. Tests don't verify capping
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pridepy/tests/test_download_resilience.py`:
- Around line 218-255: The tests only assert the helper call count and miss
verifying the ThreadPoolExecutor sizing; update tests to either (A) patch
concurrent.futures.ThreadPoolExecutor and assert it was instantiated with
max_workers equal to min(parallel_files, number_of_files/urls) when invoking
Files.download_files_from_globus and Files.download_files_by_url (while still
patching Files._globus_download_one and Files._download_single_url to avoid real
downloads), or (B) extract the cap logic into a small helper (e.g.
Files._cap_parallel_workers) and write unit tests for that helper to assert it
returns the expected worker count for given inputs; reference
Files.download_files_from_globus, Files.download_files_by_url,
Files._globus_download_one, Files._download_single_url, and ThreadPoolExecutor
when making the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 231cbc95-1fdc-4a44-b0af-a08fb85a6993
📒 Files selected for processing (3)
pridepy/files/files.pypridepy/tests/test_download_resilience.pypyproject.toml
| def test_globus_parallel_workers_capped_to_file_count(self): | ||
| """When parallel_files exceeds the number of files to download, | ||
| the worker pool must not allocate more threads than files.""" | ||
| file_records = [ | ||
| { | ||
| "fileName": "only.raw", | ||
| "publicFileLocations": [ | ||
| {"name": "FTP Protocol", | ||
| "value": "ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2024/01/PXD000001/only.raw"} | ||
| ], | ||
| } | ||
| ] | ||
|
|
||
| with tempfile.TemporaryDirectory() as tmp_dir: | ||
| with patch.object(Files, "_globus_download_one") as mock_one: | ||
| Files.download_files_from_globus( | ||
| file_list_json=file_records, | ||
| output_folder=tmp_dir, | ||
| skip_if_downloaded_already=False, | ||
| parallel_files=3, | ||
| ) | ||
| # With 1 file and parallel_files=3, should fall through to | ||
| # the serial path (parallel_files capped to 1 < 2). | ||
| mock_one.assert_called_once() | ||
|
|
||
| def test_url_parallel_workers_capped_to_url_count(self): | ||
| """download_files_by_url must cap workers to len(urls).""" | ||
| with tempfile.TemporaryDirectory() as tmp_dir: | ||
| with patch.object(Files, "_download_single_url") as mock_single: | ||
| Files.download_files_by_url( | ||
| urls=["https://example.org/a.raw"], | ||
| output_folder=tmp_dir, | ||
| skip_if_downloaded_already=False, | ||
| protocol="globus", | ||
| parallel_files=3, | ||
| ) | ||
| # 1 URL with parallel_files=3 → capped to 1, serial path. | ||
| mock_single.assert_called_once() |
There was a problem hiding this comment.
These tests don't actually verify the cap.
Patching _globus_download_one / _download_single_url only proves the helper is called once; a regression that still creates an oversized ThreadPoolExecutor would pass. Consider asserting max_workers directly, or extracting the cap into a helper and testing that instead.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pridepy/tests/test_download_resilience.py` around lines 218 - 255, The tests
only assert the helper call count and miss verifying the ThreadPoolExecutor
sizing; update tests to either (A) patch concurrent.futures.ThreadPoolExecutor
and assert it was instantiated with max_workers equal to min(parallel_files,
number_of_files/urls) when invoking Files.download_files_from_globus and
Files.download_files_by_url (while still patching Files._globus_download_one and
Files._download_single_url to avoid real downloads), or (B) extract the cap
logic into a small helper (e.g. Files._cap_parallel_workers) and write unit
tests for that helper to assert it returns the expected worker count for given
inputs; reference Files.download_files_from_globus, Files.download_files_by_url,
Files._globus_download_one, Files._download_single_url, and ThreadPoolExecutor
when making the change.
Summary by CodeRabbit
Bug Fixes
Tests
Chores