Skip to content

Minor changes in parallelization#95

Merged
ypriverol merged 6 commits into
masterfrom
dev
May 2, 2026
Merged

Minor changes in parallelization#95
ypriverol merged 6 commits into
masterfrom
dev

Conversation

@ypriverol
Copy link
Copy Markdown
Contributor

@ypriverol ypriverol commented May 2, 2026

Summary by CodeRabbit

  • Bug Fixes

    • Improved parallel download optimization to prevent allocating unnecessary workers when the number of files or URLs being downloaded is smaller than the available parallel capacity.
  • Tests

    • Added test coverage for parallel worker allocation management during bulk downloads.
  • Chores

    • Updated package version to 0.0.15.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 2, 2026

📝 Walkthrough

Walkthrough

The PR caps parallel worker pools in download functions to not exceed the actual number of items queued for processing, preventing oversized worker allocation when processing small batches. Version incremented to 0.0.15.

Changes

Parallel Worker Pool Capping

Layer / File(s) Summary
Core Implementation
pridepy/files/files.py
download_files_from_globus and download_files_by_url now cap parallel_files by both the hardcoded max (3) and the input size (len(files_to_download) / len(urls)) to prevent oversized worker pools when processing small batches.
Test Coverage
pridepy/tests/test_download_resilience.py
Two new tests verify worker-pool capping: test_globus_parallel_workers_capped_to_file_count asserts the globus download helper is called once when parallel_files exceeds file count; test_url_parallel_workers_capped_to_url_count asserts the URL download helper is called once when parallel_files exceeds URL count.
Version Metadata
pyproject.toml
Package version bumped from 0.0.14 to 0.0.15.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested reviewers

  • chakrabandla

Poem

🐰 Parallel tasks now know their place,
No worker spawned beyond the race,
Three's the max, but less will do,
When batches small need just a few!
Efficiency hops on through.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Minor changes in parallelization' accurately reflects the main change: capping parallel workers by file/URL count in addition to the constant 3, which directly addresses parallelization logic.
Docstring Coverage ✅ Passed Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 60 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

@qodo-code-review
Copy link
Copy Markdown
Contributor

Review Summary by Qodo

Cap parallel workers to file count and release 0.0.15

🐞 Bug fix 🧪 Tests

Grey Divider

Walkthroughs

Description
• Cap parallel workers to actual file/URL count to prevent thread pool oversizing
• Add tests verifying worker pool capping behavior for Globus and URL downloads
• Bump version to 0.0.15
Diagram
flowchart LR
  A["parallel_files parameter"] -->|"min with file count"| B["Capped worker pool"]
  B --> C["Globus download"]
  B --> D["URL download"]
  C --> E["Serial or parallel execution"]
  D --> E
Loading

Grey Divider

File Changes

1. pridepy/files/files.py 🐞 Bug fix +2/-2

Cap parallel workers to file/URL count

• Cap parallel_files to minimum of 3 and actual file count in download_files_from_globus()
• Cap parallel_files to minimum of 3 and actual URL count in download_files_by_url()
• Prevents thread pool allocation exceeding available work items

pridepy/files/files.py


2. pridepy/tests/test_download_resilience.py 🧪 Tests +39/-0

Add tests for parallel worker capping

• Add test_globus_parallel_workers_capped_to_file_count() to verify Globus download worker capping
• Add test_url_parallel_workers_capped_to_url_count() to verify URL download worker capping
• Both tests verify that requesting more workers than available items falls back to serial execution

pridepy/tests/test_download_resilience.py


3. pyproject.toml ⚙️ Configuration changes +1/-1

Release version 0.0.15

• Bump version from 0.0.14 to 0.0.15

pyproject.toml


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown
Contributor

qodo-code-review Bot commented May 2, 2026

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (0)

Grey Divider


Remediation recommended

1. Tests don't verify capping 🐞 Bug ⚙ Maintainability
Description
The new tests only assert the per-item download helper is called once, which is true in both the
serial loop and the ThreadPoolExecutor path when there is only one file/URL. This means the added
tests don’t actually validate that parallel worker capping (or the serial-path fallback) is
happening.
Code

pridepy/tests/test_download_resilience.py[R218-255]

+    def test_globus_parallel_workers_capped_to_file_count(self):
+        """When parallel_files exceeds the number of files to download,
+        the worker pool must not allocate more threads than files."""
+        file_records = [
+            {
+                "fileName": "only.raw",
+                "publicFileLocations": [
+                    {"name": "FTP Protocol",
+                     "value": "ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2024/01/PXD000001/only.raw"}
+                ],
+            }
+        ]
+
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            with patch.object(Files, "_globus_download_one") as mock_one:
+                Files.download_files_from_globus(
+                    file_list_json=file_records,
+                    output_folder=tmp_dir,
+                    skip_if_downloaded_already=False,
+                    parallel_files=3,
+                )
+                # With 1 file and parallel_files=3, should fall through to
+                # the serial path (parallel_files capped to 1 < 2).
+                mock_one.assert_called_once()
+
+    def test_url_parallel_workers_capped_to_url_count(self):
+        """download_files_by_url must cap workers to len(urls)."""
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            with patch.object(Files, "_download_single_url") as mock_single:
+                Files.download_files_by_url(
+                    urls=["https://example.org/a.raw"],
+                    output_folder=tmp_dir,
+                    skip_if_downloaded_already=False,
+                    protocol="globus",
+                    parallel_files=3,
+                )
+                # 1 URL with parallel_files=3 → capped to 1, serial path.
+                mock_single.assert_called_once()
Evidence
In both download functions, the serial branch invokes the helper once per item and the parallel
branch submits exactly one task per item; with a single element, both branches result in exactly one
helper call, so assert_called_once() cannot distinguish whether capping changed control flow or
executor sizing.

pridepy/tests/test_download_resilience.py[218-255]
pridepy/files/files.py[624-648]
pridepy/files/files.py[1256-1280]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The added tests intended to validate parallel worker capping (`min(parallel_files, 3, len(...))`) only assert the worker function is called once. With a single file/URL, both the serial path and the ThreadPoolExecutor path call the worker exactly once, so these tests don’t validate the new behavior.

### Issue Context
- `download_files_from_globus` calls `_globus_download_one` once per file in both serial and parallel modes.
- `download_files_by_url` calls `_download_single_url` once per URL in both serial and parallel modes.

### Fix Focus Areas
- pridepy/tests/test_download_resilience.py[218-255]

### Suggested test changes
- Strengthen assertions to distinguish serial vs parallel path, e.g.:
 - Assert the mock was called **without** the `position` kwarg (serial path) using `assert_called_once_with(...)`.
 - Or patch `pridepy.files.files.ThreadPoolExecutor` and assert it is **not called** for the single-item case.
- Add a test case with **2 items** and `parallel_files=3`, patch `ThreadPoolExecutor`, and assert it is constructed with `max_workers=2` (verifies actual capping rather than control-flow side effects).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pridepy/tests/test_download_resilience.py`:
- Around line 218-255: The tests only assert the helper call count and miss
verifying the ThreadPoolExecutor sizing; update tests to either (A) patch
concurrent.futures.ThreadPoolExecutor and assert it was instantiated with
max_workers equal to min(parallel_files, number_of_files/urls) when invoking
Files.download_files_from_globus and Files.download_files_by_url (while still
patching Files._globus_download_one and Files._download_single_url to avoid real
downloads), or (B) extract the cap logic into a small helper (e.g.
Files._cap_parallel_workers) and write unit tests for that helper to assert it
returns the expected worker count for given inputs; reference
Files.download_files_from_globus, Files.download_files_by_url,
Files._globus_download_one, Files._download_single_url, and ThreadPoolExecutor
when making the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 231cbc95-1fdc-4a44-b0af-a08fb85a6993

📥 Commits

Reviewing files that changed from the base of the PR and between a99c73b and f0c491b.

📒 Files selected for processing (3)
  • pridepy/files/files.py
  • pridepy/tests/test_download_resilience.py
  • pyproject.toml

Comment on lines +218 to +255
def test_globus_parallel_workers_capped_to_file_count(self):
"""When parallel_files exceeds the number of files to download,
the worker pool must not allocate more threads than files."""
file_records = [
{
"fileName": "only.raw",
"publicFileLocations": [
{"name": "FTP Protocol",
"value": "ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2024/01/PXD000001/only.raw"}
],
}
]

with tempfile.TemporaryDirectory() as tmp_dir:
with patch.object(Files, "_globus_download_one") as mock_one:
Files.download_files_from_globus(
file_list_json=file_records,
output_folder=tmp_dir,
skip_if_downloaded_already=False,
parallel_files=3,
)
# With 1 file and parallel_files=3, should fall through to
# the serial path (parallel_files capped to 1 < 2).
mock_one.assert_called_once()

def test_url_parallel_workers_capped_to_url_count(self):
"""download_files_by_url must cap workers to len(urls)."""
with tempfile.TemporaryDirectory() as tmp_dir:
with patch.object(Files, "_download_single_url") as mock_single:
Files.download_files_by_url(
urls=["https://example.org/a.raw"],
output_folder=tmp_dir,
skip_if_downloaded_already=False,
protocol="globus",
parallel_files=3,
)
# 1 URL with parallel_files=3 → capped to 1, serial path.
mock_single.assert_called_once()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

These tests don't actually verify the cap.

Patching _globus_download_one / _download_single_url only proves the helper is called once; a regression that still creates an oversized ThreadPoolExecutor would pass. Consider asserting max_workers directly, or extracting the cap into a helper and testing that instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pridepy/tests/test_download_resilience.py` around lines 218 - 255, The tests
only assert the helper call count and miss verifying the ThreadPoolExecutor
sizing; update tests to either (A) patch concurrent.futures.ThreadPoolExecutor
and assert it was instantiated with max_workers equal to min(parallel_files,
number_of_files/urls) when invoking Files.download_files_from_globus and
Files.download_files_by_url (while still patching Files._globus_download_one and
Files._download_single_url to avoid real downloads), or (B) extract the cap
logic into a small helper (e.g. Files._cap_parallel_workers) and write unit
tests for that helper to assert it returns the expected worker count for given
inputs; reference Files.download_files_from_globus, Files.download_files_by_url,
Files._globus_download_one, Files._download_single_url, and ThreadPoolExecutor
when making the change.

@ypriverol ypriverol merged commit cacf553 into master May 2, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants