Minor changes in parallelization by ypriverol · Pull Request #95 · PRIDE-Archive/pridepy

ypriverol · 2026-05-02T15:28:10Z

Summary by CodeRabbit

Bug Fixes
- Improved parallel download optimization to prevent allocating unnecessary workers when the number of files or URLs being downloaded is smaller than the available parallel capacity.
Tests
- Added test coverage for parallel worker allocation management during bulk downloads.
Chores
- Updated package version to 0.0.15.

Limit parallel workers to match file count and release version 0.0.15

coderabbitai · 2026-05-02T15:28:23Z

📝 Walkthrough

Walkthrough

The PR caps parallel worker pools in download functions to not exceed the actual number of items queued for processing, preventing oversized worker allocation when processing small batches. Version incremented to 0.0.15.

Changes

Parallel Worker Pool Capping

Layer / File(s)	Summary
Core Implementation `pridepy/files/files.py`	`download_files_from_globus` and `download_files_by_url` now cap `parallel_files` by both the hardcoded max (3) and the input size (`len(files_to_download)` / `len(urls)`) to prevent oversized worker pools when processing small batches.
Test Coverage `pridepy/tests/test_download_resilience.py`	Two new tests verify worker-pool capping: `test_globus_parallel_workers_capped_to_file_count` asserts the globus download helper is called once when `parallel_files` exceeds file count; `test_url_parallel_workers_capped_to_url_count` asserts the URL download helper is called once when `parallel_files` exceeds URL count.
Version Metadata `pyproject.toml`	Package version bumped from `0.0.14` to `0.0.15`.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested reviewers

chakrabandla

Poem

🐰 Parallel tasks now know their place,
No worker spawned beyond the race,
Three's the max, but less will do,
When batches small need just a few!
Efficiency hops on through.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Minor changes in parallelization' accurately reflects the main change: capping parallel workers by file/URL count in addition to the constant 3, which directly addresses parallelization logic.
Docstring Coverage	✅ Passed	Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Review rate limit: 0/1 reviews remaining, refill in 60 minutes.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

qodo-code-review · 2026-05-02T15:28:30Z

Review Summary by Qodo

Cap parallel workers to file count and release 0.0.15

🐞 Bug fix 🧪 Tests

Walkthroughs

Description

• Cap parallel workers to actual file/URL count to prevent thread pool oversizing
• Add tests verifying worker pool capping behavior for Globus and URL downloads
• Bump version to 0.0.15

Diagram

flowchart LR
  A["parallel_files parameter"] -->|"min with file count"| B["Capped worker pool"]
  B --> C["Globus download"]
  B --> D["URL download"]
  C --> E["Serial or parallel execution"]
  D --> E

File Changes

1. pridepy/files/files.py 🐞 Bug fix +2/-2

Cap parallel workers to file/URL count

• Cap parallel_files to minimum of 3 and actual file count in download_files_from_globus()
• Cap parallel_files to minimum of 3 and actual URL count in download_files_by_url()
• Prevents thread pool allocation exceeding available work items

pridepy/files/files.py

2. pridepy/tests/test_download_resilience.py 🧪 Tests +39/-0

Add tests for parallel worker capping

• Add test_globus_parallel_workers_capped_to_file_count() to verify Globus download worker capping
• Add test_url_parallel_workers_capped_to_url_count() to verify URL download worker capping
• Both tests verify that requesting more workers than available items falls back to serial execution

pridepy/tests/test_download_resilience.py

3. pyproject.toml ⚙️ Configuration changes +1/-1

Release version 0.0.15
• Bump version from 0.0.14 to 0.0.15
pyproject.toml

qodo-code-review · 2026-05-02T15:28:32Z

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (0)

1. Tests don't verify capping 🐞 Bug ⚙ Maintainability

Description

The new tests only assert the per-item download helper is called once, which is true in both the
serial loop and the ThreadPoolExecutor path when there is only one file/URL. This means the added
tests don’t actually validate that parallel worker capping (or the serial-path fallback) is
happening.

Code

pridepy/tests/test_download_resilience.py[R218-255]

+    def test_globus_parallel_workers_capped_to_file_count(self):
+        """When parallel_files exceeds the number of files to download,
+        the worker pool must not allocate more threads than files."""
+        file_records = [
+            {
+                "fileName": "only.raw",
+                "publicFileLocations": [
+                    {"name": "FTP Protocol",
+                     "value": "ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2024/01/PXD000001/only.raw"}
+                ],
+            }
+        ]
+
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            with patch.object(Files, "_globus_download_one") as mock_one:
+                Files.download_files_from_globus(
+                    file_list_json=file_records,
+                    output_folder=tmp_dir,
+                    skip_if_downloaded_already=False,
+                    parallel_files=3,
+                )
+                # With 1 file and parallel_files=3, should fall through to
+                # the serial path (parallel_files capped to 1 < 2).
+                mock_one.assert_called_once()
+
+    def test_url_parallel_workers_capped_to_url_count(self):
+        """download_files_by_url must cap workers to len(urls)."""
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            with patch.object(Files, "_download_single_url") as mock_single:
+                Files.download_files_by_url(
+                    urls=["https://example.org/a.raw"],
+                    output_folder=tmp_dir,
+                    skip_if_downloaded_already=False,
+                    protocol="globus",
+                    parallel_files=3,
+                )
+                # 1 URL with parallel_files=3 → capped to 1, serial path.
+                mock_single.assert_called_once()

Evidence

In both download functions, the serial branch invokes the helper once per item and the parallel
branch submits exactly one task per item; with a single element, both branches result in exactly one
helper call, so assert_called_once() cannot distinguish whether capping changed control flow or
executor sizing.

pridepy/tests/test_download_resilience.py[218-255]
pridepy/files/files.py[624-648]
pridepy/files/files.py[1256-1280]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The added tests intended to validate parallel worker capping (`min(parallel_files, 3, len(...))`) only assert the worker function is called once. With a single file/URL, both the serial path and the ThreadPoolExecutor path call the worker exactly once, so these tests don’t validate the new behavior.

### Issue Context
- `download_files_from_globus` calls `_globus_download_one` once per file in both serial and parallel modes.
- `download_files_by_url` calls `_download_single_url` once per URL in both serial and parallel modes.

### Fix Focus Areas
- pridepy/tests/test_download_resilience.py[218-255]

### Suggested test changes
- Strengthen assertions to distinguish serial vs parallel path, e.g.:
 - Assert the mock was called **without** the `position` kwarg (serial path) using `assert_called_once_with(...)`.
 - Or patch `pridepy.files.files.ThreadPoolExecutor` and assert it is **not called** for the single-item case.
- Add a test case with **2 items** and `parallel_files=3`, patch `ThreadPoolExecutor`, and assert it is constructed with `max_workers=2` (verifies actual capping rather than control-flow side effects).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pridepy/tests/test_download_resilience.py`:
- Around line 218-255: The tests only assert the helper call count and miss
verifying the ThreadPoolExecutor sizing; update tests to either (A) patch
concurrent.futures.ThreadPoolExecutor and assert it was instantiated with
max_workers equal to min(parallel_files, number_of_files/urls) when invoking
Files.download_files_from_globus and Files.download_files_by_url (while still
patching Files._globus_download_one and Files._download_single_url to avoid real
downloads), or (B) extract the cap logic into a small helper (e.g.
Files._cap_parallel_workers) and write unit tests for that helper to assert it
returns the expected worker count for given inputs; reference
Files.download_files_from_globus, Files.download_files_by_url,
Files._globus_download_one, Files._download_single_url, and ThreadPoolExecutor
when making the change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 231cbc95-1fdc-4a44-b0af-a08fb85a6993

📥 Commits

Reviewing files that changed from the base of the PR and between a99c73b and f0c491b.

📒 Files selected for processing (3)

pridepy/files/files.py
pridepy/tests/test_download_resilience.py
pyproject.toml

coderabbitai · 2026-05-02T15:31:10Z

+    def test_globus_parallel_workers_capped_to_file_count(self):
+        """When parallel_files exceeds the number of files to download,
+        the worker pool must not allocate more threads than files."""
+        file_records = [
+            {
+                "fileName": "only.raw",
+                "publicFileLocations": [
+                    {"name": "FTP Protocol",
+                     "value": "ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2024/01/PXD000001/only.raw"}
+                ],
+            }
+        ]
+
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            with patch.object(Files, "_globus_download_one") as mock_one:
+                Files.download_files_from_globus(
+                    file_list_json=file_records,
+                    output_folder=tmp_dir,
+                    skip_if_downloaded_already=False,
+                    parallel_files=3,
+                )
+                # With 1 file and parallel_files=3, should fall through to
+                # the serial path (parallel_files capped to 1 < 2).
+                mock_one.assert_called_once()
+
+    def test_url_parallel_workers_capped_to_url_count(self):
+        """download_files_by_url must cap workers to len(urls)."""
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            with patch.object(Files, "_download_single_url") as mock_single:
+                Files.download_files_by_url(
+                    urls=["https://example.org/a.raw"],
+                    output_folder=tmp_dir,
+                    skip_if_downloaded_already=False,
+                    protocol="globus",
+                    parallel_files=3,
+                )
+                # 1 URL with parallel_files=3 → capped to 1, serial path.
+                mock_single.assert_called_once()


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

These tests don't actually verify the cap.

Patching _globus_download_one / _download_single_url only proves the helper is called once; a regression that still creates an oversized ThreadPoolExecutor would pass. Consider asserting max_workers directly, or extracting the cap into a helper and testing that instead.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pridepy/tests/test_download_resilience.py` around lines 218 - 255, The tests only assert the helper call count and miss verifying the ThreadPoolExecutor sizing; update tests to either (A) patch concurrent.futures.ThreadPoolExecutor and assert it was instantiated with max_workers equal to min(parallel_files, number_of_files/urls) when invoking Files.download_files_from_globus and Files.download_files_by_url (while still patching Files._globus_download_one and Files._download_single_url to avoid real downloads), or (B) extract the cap logic into a small helper (e.g. Files._cap_parallel_workers) and write unit tests for that helper to assert it returns the expected worker count for given inputs; reference Files.download_files_from_globus, Files.download_files_by_url, Files._globus_download_one, Files._download_single_url, and ThreadPoolExecutor when making the change.

Shen-YuFei and others added 6 commits May 2, 2026 20:48

fix(globus): cap parallel workers to actual file count

bbe06f0

chore(release): bump version to 0.0.15

135e74e

Merge branch 'PRIDE-Archive:dev' into dev

fcd49f8

fix(download): cap parallel workers to file/url count with tests

6daaa66

Merge branch 'dev' of https://github.com/Shen-YuFei/pridepy into dev

2b033ca

Merge pull request #94 from Shen-YuFei/dev

f0c491b

Limit parallel workers to match file count and release version 0.0.15

coderabbitai Bot reviewed May 2, 2026

View reviewed changes

ypriverol merged commit cacf553 into master May 2, 2026
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minor changes in parallelization#95

Minor changes in parallelization#95
ypriverol merged 6 commits into
masterfrom
dev

ypriverol commented May 2, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 2, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

Uh oh!

qodo-code-review Bot commented May 2, 2026

Uh oh!

qodo-code-review Bot commented May 2, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ypriverol commented May 2, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

Uh oh!

qodo-code-review Bot commented May 2, 2026

Review Summary by Qodo

Walkthroughs

File Changes

Uh oh!

qodo-code-review Bot commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ypriverol commented May 2, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 2, 2026 •

edited

Loading

qodo-code-review Bot commented May 2, 2026 •

edited

Loading