Improving parallelism by ypriverol · Pull Request #93 · PRIDE-Archive/pridepy

ypriverol · 2026-05-02T09:29:50Z

This pull request introduces two major new CLI commands for downloading specific files from PRIDE projects by filename or by URL, along with supporting helper functions and comprehensive unit tests. It also adds parallel download options for several commands and improves documentation to cover these new features.

New CLI commands and features:

Added download-files-by-list command to download a specified subset of files from a PRIDE project using a manifest file or comma-separated list. Includes support for protocol selection, parallel downloads (Globus), and checksum validation. [1] [2] [3]
Added download-files-by-url command to download files directly from a list of raw URLs, supporting protocol selection, parallel downloads, and checksum validation for PRIDE URLs. [1] [2] [3]

Improvements to existing commands:

Added --parallel-files option (with range 1–3) to download-all-public-raw-files and download-all-public-category-files for parallel downloads when using Globus. [1] [2] [3] [4] [5] [6] [7] [8] [9]

Helper functions for manifest and argument parsing:

Introduced _parse_text_manifest, _read_filename_arguments, and _read_url_arguments to handle manifest parsing, deduplication, and validation for both filenames and URLs in CLI commands.

Documentation updates:

Updated README.md to document the new commands, their usage, options, and example invocations. [1] [2]

Testing:

Added comprehensive unit tests for both download-files-by-list and download-files-by-url commands, covering manifest parsing, error handling, and download dispatch logic. [1] [2]

Summary by CodeRabbit

Release Notes

New Features
- Added download-files-by-list command to download specific files using a manifest or filename list
- Added download-files-by-url command to download files directly from URLs
- Parallel downloads capability (-w/--parallel-files option) for faster transfers on existing download commands
- Checksum validation option for verifying downloaded file integrity
- Support for resumable downloads
Documentation
- Updated CLI command overview with new commands and options

…for download-by-url/list

…e, update docstrings

Enhance CLI with file download features and options

coderabbitai · 2026-05-02T09:30:03Z

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

The PR adds manifest-based and URL-based file downloading capabilities with parallel file support, resumable byte-range downloads, and checksum validation. Two new CLI commands (download-files-by-list and download-files-by-url) and corresponding public API methods enable targeted subset downloads with retry, resume, and cross-protocol fallback logic. Parallel execution is threaded through existing download paths via a parallel_files parameter.

Changes

Docker Configuration

Layer / File(s)	Summary
Ignore Patterns `.gitignore`	Added `Dockerfile` and `.dockerignore` under a "# Docker" section to exclude Docker-related artifacts from version control.

Download Feature Implementation

Layer / File(s)	Summary
Public API Signatures `pridepy/files/files.py`	`download_all_raw_files`, `download_all_category_files`, and `download_files` methods now accept `parallel_files: int = 1`. Added new public methods `download_files_by_list` and `download_files_by_url` for manifest and URL-based downloads.
Core Download Resilience `pridepy/files/files.py`	Implemented `_download_range` with retry+backoff for resumable byte-range requests. Updated `_parallel_download` to detect server resume support, resume from partial files, and fall back on range rejection.
Globus Download Refactoring `pridepy/files/files.py`	Refactored Globus download path to accept optional `checksum_map`, pre-filter/skip already-valid files, queue corrupted ones for re-download, and execute with `ThreadPoolExecutor` respecting `parallel_files` cap.
Download Flow Integration `pridepy/files/files.py`	Updated `_batch_download_by_protocol` and `_download_with_fallback` to thread `parallel_files` and `checksum_map`. Modified `download_files` phase-2 validation to log per-file progress and remove files only on checksum mismatch.
List/URL Download Implementation `pridepy/files/files.py`	`download_files_by_list` resolves requested filenames via metadata API and delegates to `download_files`. `download_files_by_url` dispatches by scheme, resumes HTTP(S), aggregates failures, and optionally validates against PRIDE checksums.
CLI Commands & Helpers `pridepy/pridepy.py`	Added `--parallel-files/-w` option to `download_all_public_raw_files` and `download_all_public_category_files`. Introduced `download-files-by-list` and `download-files-by-url` CLI commands with manifest/CSV parsing helpers (`_parse_text_manifest`, `_read_filename_arguments`, `_read_url_arguments`).
Documentation `README.md`	Added Quick Start step 6 documenting `download-files-by-list` with `files.txt` parsing rules, option examples, and updated "Main commands" section to list both new commands.
List Download Tests `pridepy/tests/test_download_by_list.py`	Tests for `download_files_by_list`: validation on empty input, filename filtering and delegation, warnings on partial matches, and error on no matches. Tests for `_read_filename_arguments`: manifest/CSV parsing, blank line/comment skipping, tab-delimited column extraction, and deduplication.
URL Download Tests `pridepy/tests/test_download_by_url.py`	Tests for `download_files_by_url`: empty list validation, HTTP/FTP dispatch, unsupported scheme aggregation, skip-if-exists short-circuit. Tests for `_read_url_arguments`: single/multiple URL parsing, manifest reading, CSV/manifest merging with deduplication.
Resilience Test Updates `pridepy/tests/test_download_resilience.py`	Added `test_parallel_download_streams_full_file` to verify streaming with `content-length` and `accept-ranges` headers. Removed `test_parallel_download_falls_back_when_range_not_honored`. Updated stub signatures to accept `**kwargs` for flexibility.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI User
    participant DownloadByList as download_files_by_list
    participant MetadataAPI as Metadata API
    participant BatchDownloader as _batch_download_by_protocol<br/>(with parallel_files)
    participant Protocol as HTTP/FTP/Globus<br/>Downloader

    CLI->>DownloadByList: accession, file_names, options
    DownloadByList->>MetadataAPI: fetch project metadata
    MetadataAPI-->>DownloadByList: files list with URLs
    DownloadByList->>DownloadByList: filter & match requested names
    alt Files match found
        DownloadByList->>BatchDownloader: download_files(matched files,<br/>parallel_files)
        BatchDownloader->>Protocol: parallel file downloads<br/>with resume/checksum
        Protocol-->>BatchDownloader: success/failure
        BatchDownloader-->>DownloadByList: completion
    else No files match
        DownloadByList-->>CLI: ValueError
    end

sequenceDiagram
    participant CLI as CLI User
    participant DownloadByUrl as download_files_by_url
    participant URLDispatcher as URL Scheme<br/>Dispatcher
    participant HTTPDownloader as _parallel_download<br/>(HTTP/HTTPS)
    participant FTPDownloader as _ftp_download_url
    participant ErrorHandler as Error Aggregator

    CLI->>DownloadByUrl: urls[], options
    alt List not empty
        DownloadByUrl->>URLDispatcher: for each URL
        URLDispatcher->>HTTPDownloader: https:// or http://
        URLDispatcher->>FTPDownloader: ftp://
        URLDispatcher->>ErrorHandler: unknown scheme
        HTTPDownloader-->>URLDispatcher: success/resume/failure
        FTPDownloader-->>URLDispatcher: success/failure
        ErrorHandler-->>URLDispatcher: track failure
        DownloadByUrl->>DownloadByUrl: optional PRIDE checksum<br/>validation
    else List is empty
        DownloadByUrl-->>CLI: ValueError
    end
    alt Any failures occurred
        DownloadByUrl-->>CLI: RuntimeError with<br/>aggregated failures
    else All succeeded
        DownloadByUrl-->>CLI: success
    end

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly Related PRs

Dev #91: Modifies the same download resilience and parallel execution logic in pridepy/files/files.py, including byte-range retry and checkpoint/resume behavior.
Harden downloads and migrate packaging/docs to uv #86: Refactors download flow in pridepy/files/files.py, updating download_files, checksum validation, and protocol fallback paths.
move to poetry #61: Adds CLI options and parameters to the same download_all_public_raw_files and download_all_public_category_files commands in pridepy/pridepy.py.

Suggested Labels

enhancement, feature, download-capability, parallel-execution, tests

Suggested Reviewers

chakrabandla
selvaebi

🐰 A rabbit downloads files with glee,
By list or URL, fast and free,
With checksums checked and ranges resumed,
No broken transfers to be groomed.
Parallel threads hop and spin,
Loop and fallback—all downloads win! 🎯

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 49.18% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The PR title 'Improving parallelism' is vague and generic, using a broad term that doesn't clearly convey the specific changes made.	Consider a more specific title such as 'Add parallel downloads and new CLI commands for file subset downloads' to better reflect the main changes.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Review rate limit: 0/1 reviews remaining, refill in 60 minutes.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

qodo-code-review · 2026-05-02T09:30:22Z

Review Summary by Qodo

Add flexible file download commands and parallel globus support

✨ Enhancement

Walkthroughs

Description

• Add two new CLI commands for flexible file downloads: download-files-by-list (subset by
  filename) and download-files-by-url (raw URLs)
• Implement parallel file download support via parallel_files parameter (1-3 workers) for globus
  protocol
• Refactor HTTP download logic from multi-connection parallel ranges to single-connection streaming
  with resume capability
• Add comprehensive checksum validation with protocol fallback and improved error handling

Diagram

flowchart LR
  A["New CLI Commands"] --> B["download-files-by-list"]
  A --> C["download-files-by-url"]
  B --> D["Files.download_files_by_list"]
  C --> E["Files.download_files_by_url"]
  D --> F["Existing batch + fallback engine"]
  E --> G["URL scheme dispatcher"]
  G --> H["HTTP/HTTPS"]
  G --> I["FTP"]
  H --> J["_parallel_download with resume"]
  I --> K["_ftp_download_url"]
  L["parallel_files parameter"] --> M["ThreadPoolExecutor workers"]
  M --> N["Globus downloads"]

File Changes

1. pridepy/files/files.py ✨ Enhancement +436/-88

Refactor parallel downloads and add flexible file download methods

• Refactored _parallel_download from multi-connection range requests to single-connection
 streaming with resume support and progress tracking
• Enhanced _download_range with retry logic (max 3 attempts) and improved timeout handling
• Added _globus_download_one worker method for parallel globus downloads with per-file retry logic
• Refactored download_files_from_globus to support parallel downloads via ThreadPoolExecutor with
 pre-filtering and checksum validation
• Added download_files_by_list method to download named file subsets from a project
• Added download_files_by_url static method with URL scheme dispatching (http/https/ftp)
• Added helper methods: _extract_pride_accession, _download_single_url, _dispatch_url_scheme,
 _http_download_url, _ftp_download_url, _validate_urls_checksums
• Updated download_files, _batch_download_by_protocol, _download_with_fallback,
 download_all_raw_files, download_all_category_files to accept and propagate parallel_files
 parameter
• Fixed os.mkdir to os.makedirs for proper directory creation
• Improved validation logic to distinguish between checksum mismatches and missing files

pridepy/files/files.py

2. pridepy/pridepy.py ✨ Enhancement +239/-3

Add new download commands and parallel file support to CLI

• Added -w/--parallel-files option (IntRange 1-3) to download_all_public_raw_files and
 download_all_public_category_files commands
• Added new download-files-by-list CLI command with manifest and CSV filename input options
• Added new download-files-by-url CLI command with manifest and CSV URL input options
• Implemented _parse_text_manifest helper to read manifests with comment/blank line skipping
• Implemented _read_filename_arguments helper to build deduplicated filename lists from manifest
 and CSV
• Implemented _read_url_arguments helper to build deduplicated URL lists from manifest and CSV
• Updated help text for --checksum-check to clarify validation behavior

pridepy/pridepy.py

3. pridepy/tests/test_download_by_list.py 🧪 Tests +125/-0

Add unit tests for download-files-by-list functionality

• New test file covering Files.download_files_by_list method behavior
• Tests for empty list validation, metadata filtering, partial matches, and no-match error cases
• Tests for _read_filename_arguments CLI helper including CSV parsing, manifest parsing with
 comments/blanks, and deduplication

pridepy/tests/test_download_by_list.py

View more (3)

4. pridepy/tests/test_download_by_url.py 🧪 Tests +152/-0

Add unit tests for download-files-by-url functionality

• New test file covering Files.download_files_by_url method and URL dispatching
• Tests for empty list validation, HTTP/FTP scheme dispatching, unsupported schemes, failure
 aggregation, and skip-if-exists logic
• Tests for _read_url_arguments CLI helper including CSV parsing, manifest parsing, and
 deduplication

pridepy/tests/test_download_by_url.py

5. pridepy/tests/test_download_resilience.py 🧪 Tests +10/-20

Update tests for refactored parallel download logic

• Updated test_parallel_download_falls_back_when_range_not_honored to reflect new
 single-connection streaming behavior (renamed to test_parallel_download_streams_full_file)
• Removed num_connections parameter from _parallel_download calls in tests
• Updated mock setup to reflect simplified streaming approach instead of multi-connection ranges
• Updated _batch_download_by_protocol mock signatures to accept **kwargs for new parameters

pridepy/tests/test_download_resilience.py

6. README.md 📝 Documentation +45/-0

Document new download commands and options

• Added section 6 documenting download-files-by-list command with manifest format and usage
 examples
• Added section 7 documenting download-files-by-url command with URL manifest format and usage
 examples
• Documented useful options for both new commands including -p globus, -w parallel workers, and
 --checksum-check
• Updated CLI Command Overview to list new commands

README.md

qodo-code-review · 2026-05-02T09:30:23Z

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0)

1. Checksum check can be no-op 🐞 Bug ≡ Correctness

Description

In _validate_urls_checksums, if a downloaded filename is not present in the PRIDE checksum TSV,
expected becomes None and validate_download() performs no checksum comparison but the code
still logs "Checksum OK". This can make --checksum-check claim validation succeeded even when no
checksum verification occurred for that file.

Code

pridepy/files/files.py[R1327-1336]

+                file_name = os.path.basename(urlparse(url).path)
+                target = os.path.join(output_folder, file_name)
+                expected = checksum_map.get(file_name)
+                logging.info("Validating %s", file_name)
+                valid, reason = Files.validate_download(target, expected)
+                if not valid:
+                    logging.error("Validation failed for %s: %s", file_name, reason)
+                    validation_failures.append(f"{file_name} ({reason})")
+                else:
+                    logging.info("Checksum OK: %s", file_name)

Evidence

validate_download() only compares checksums when expected_checksum is provided; otherwise it
returns success based solely on existence/non-empty. _validate_urls_checksums() derives expected
via checksum_map.get(file_name) and logs "Checksum OK" on success, so missing checksum entries are
indistinguishable from verified files—contradicting the CLI/README promise that --checksum-check
validates against PRIDE checksums.

pridepy/files/files.py[133-148]
pridepy/files/files.py[1318-1337]
README.md[147-150]
pridepy/pridepy.py[631-636]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`Files._validate_urls_checksums()` calls `validate_download(target, expected)` where `expected` comes from `checksum_map.get(file_name)`. When the checksum TSV does not contain the downloaded filename, `expected` is `None`, so no checksum is checked, yet the code logs `"Checksum OK"` and does not treat it as a validation failure.

### Issue Context
The CLI/README advertise `--checksum-check` as validating downloads against PRIDE checksums for PRIDE archive URLs.

### Fix Focus Areas
- pridepy/files/files.py[1318-1342]
- pridepy/files/files.py[133-148]

### Implementation guidance
- If `expected is None` for a URL being validated, treat it as a validation failure (or at minimum log a warning and include it in `validation_failures`).
- Consider updating log messages so success is only logged when a checksum was actually compared (e.g., `"Checksum OK"` only when `expected` is present).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. Globus corruption re-download bypass 🐞 Bug ≡ Correctness

Description

download_files_from_globus() queues checksum-mismatched existing files for re-download but does
not delete them first. _parallel_download() then returns early when `os.path.getsize(file_path) >=
Content-Length`, so a corrupted file with the correct size can be treated as already complete and
never overwritten by the intended re-download.

Code

pridepy/files/files.py[R603-610]

+            if skip_if_downloaded_already and os.path.exists(new_file_path):
+                expected_cs = checksum_map.get(file.get("fileName", ""))
+                if expected_cs:
+                    valid, reason = Files.validate_download(new_file_path, expected_cs)
+                    if not valid:
+                        logging.warning(f"Corrupted file detected ({reason}), will re-download: {new_file_path}")
+                        files_to_download.append(file)
+                        continue

Evidence

When a checksum mismatch is detected, the file is added to files_to_download without being
removed. The globus streamer’s resume logic short-circuits purely on size (`resume_size >=
total_size`) and does not incorporate checksum state. Since checksum mismatches do not imply a size
mismatch, the re-download path can be skipped while the docstring claims corrupted files are
re-downloaded.

pridepy/files/files.py[580-610]
pridepy/files/files.py[530-535]
pridepy/files/files.py[133-148]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`download_files_from_globus()` detects checksum mismatch and schedules the file for re-download, but does not remove the existing corrupted file. `_parallel_download()` can then consider the file complete (size-based check) and return without overwriting it.

### Issue Context
The `download_files_from_globus()` docstring states corrupted files are re-downloaded when `checksum_map` is provided.

### Fix Focus Areas
- pridepy/files/files.py[600-614]
- pridepy/files/files.py[530-535]

### Implementation guidance
- When `validate_download(new_file_path, expected_cs)` returns invalid, remove `new_file_path` before appending the file record to `files_to_download`.
- Alternatively/additionally: extend `_parallel_download()` with a `force_redownload` flag to bypass the `resume_size >= total_size` early return when the caller knows the file is corrupt.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

qodo-code-review · 2026-05-02T09:35:56Z

+                file_name = os.path.basename(urlparse(url).path)
+                target = os.path.join(output_folder, file_name)
+                expected = checksum_map.get(file_name)
+                logging.info("Validating %s", file_name)
+                valid, reason = Files.validate_download(target, expected)
+                if not valid:
+                    logging.error("Validation failed for %s: %s", file_name, reason)
+                    validation_failures.append(f"{file_name} ({reason})")
+                else:
+                    logging.info("Checksum OK: %s", file_name)


1. Checksum check can be no-op 🐞 Bug ≡ Correctness

In _validate_urls_checksums, if a downloaded filename is not present in the PRIDE checksum TSV, expected becomes None and validate_download() performs no checksum comparison but the code still logs "Checksum OK". This can make --checksum-check claim validation succeeded even when no checksum verification occurred for that file.

Agent Prompt

### Issue description `Files._validate_urls_checksums()` calls `validate_download(target, expected)` where `expected` comes from `checksum_map.get(file_name)`. When the checksum TSV does not contain the downloaded filename, `expected` is `None`, so no checksum is checked, yet the code logs `"Checksum OK"` and does not treat it as a validation failure. ### Issue Context The CLI/README advertise `--checksum-check` as validating downloads against PRIDE checksums for PRIDE archive URLs. ### Fix Focus Areas - pridepy/files/files.py[1318-1342] - pridepy/files/files.py[133-148] ### Implementation guidance - If `expected is None` for a URL being validated, treat it as a validation failure (or at minimum log a warning and include it in `validation_failures`). - Consider updating log messages so success is only logged when a checksum was actually compared (e.g., `"Checksum OK"` only when `expected` is present).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Shen-YuFei and others added 7 commits April 28, 2026 19:21

feat(cli): add download-files-by-list and download-files-by-url

d8020e9

feat(cli): add --urls CSV option to download-files-by-url

8a992d8

feat(download): add globus parallel-files and protocol routing

ce71662

feat(cli): add parallel-files, checksum-check and merge --url/--urls …

b785390

…for download-by-url/list

Merge branch 'dev' of https://github.com/Shen-YuFei/pridepy into dev

7b7e3ba

fix(download): validate Range 206 response, fix os.mkdir, use IntRang…

55cfd19

…e, update docstrings

Merge pull request #92 from Shen-YuFei/dev

2a203bb

Enhance CLI with file download features and options

ypriverol merged commit a99c73b into master May 2, 2026
9 of 10 checks passed

qodo-code-review Bot reviewed May 2, 2026

View reviewed changes

coderabbitai Bot mentioned this pull request May 27, 2026

Improve MassIVE download. #99

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving parallelism #93

Improving parallelism #93
ypriverol merged 7 commits into
masterfrom
dev

ypriverol commented May 2, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 2, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated Code Review Effort

Possibly Related PRs

Suggested Labels

Suggested Reviewers

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

qodo-code-review Bot commented May 2, 2026

Uh oh!

qodo-code-review Bot commented May 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

qodo-code-review Bot May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ypriverol commented May 2, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated Code Review Effort

Possibly Related PRs

Suggested Labels

Suggested Reviewers

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

qodo-code-review Bot commented May 2, 2026

Review Summary by Qodo

Walkthroughs

File Changes

Uh oh!

qodo-code-review Bot commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

Uh oh!

qodo-code-review Bot May 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ypriverol commented May 2, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 2, 2026 •

edited

Loading

qodo-code-review Bot commented May 2, 2026 •

edited

Loading