Skip to content

Improving parallelism #93

Merged
ypriverol merged 7 commits into
masterfrom
dev
May 2, 2026
Merged

Improving parallelism #93
ypriverol merged 7 commits into
masterfrom
dev

Conversation

@ypriverol
Copy link
Copy Markdown
Contributor

@ypriverol ypriverol commented May 2, 2026

This pull request introduces two major new CLI commands for downloading specific files from PRIDE projects by filename or by URL, along with supporting helper functions and comprehensive unit tests. It also adds parallel download options for several commands and improves documentation to cover these new features.

New CLI commands and features:

  • Added download-files-by-list command to download a specified subset of files from a PRIDE project using a manifest file or comma-separated list. Includes support for protocol selection, parallel downloads (Globus), and checksum validation. [1] [2] [3]
  • Added download-files-by-url command to download files directly from a list of raw URLs, supporting protocol selection, parallel downloads, and checksum validation for PRIDE URLs. [1] [2] [3]

Improvements to existing commands:

  • Added --parallel-files option (with range 1–3) to download-all-public-raw-files and download-all-public-category-files for parallel downloads when using Globus. [1] [2] [3] [4] [5] [6] [7] [8] [9]

Helper functions for manifest and argument parsing:

  • Introduced _parse_text_manifest, _read_filename_arguments, and _read_url_arguments to handle manifest parsing, deduplication, and validation for both filenames and URLs in CLI commands.

Documentation updates:

  • Updated README.md to document the new commands, their usage, options, and example invocations. [1] [2]

Testing:

  • Added comprehensive unit tests for both download-files-by-list and download-files-by-url commands, covering manifest parsing, error handling, and download dispatch logic. [1] [2]

Summary by CodeRabbit

Release Notes

  • New Features

    • Added download-files-by-list command to download specific files using a manifest or filename list
    • Added download-files-by-url command to download files directly from URLs
    • Parallel downloads capability (-w/--parallel-files option) for faster transfers on existing download commands
    • Checksum validation option for verifying downloaded file integrity
    • Support for resumable downloads
  • Documentation

    • Updated CLI command overview with new commands and options

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 2, 2026

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

The PR adds manifest-based and URL-based file downloading capabilities with parallel file support, resumable byte-range downloads, and checksum validation. Two new CLI commands (download-files-by-list and download-files-by-url) and corresponding public API methods enable targeted subset downloads with retry, resume, and cross-protocol fallback logic. Parallel execution is threaded through existing download paths via a parallel_files parameter.

Changes

Docker Configuration

Layer / File(s) Summary
Ignore Patterns
.gitignore
Added Dockerfile and .dockerignore under a "# Docker" section to exclude Docker-related artifacts from version control.

Download Feature Implementation

Layer / File(s) Summary
Public API Signatures
pridepy/files/files.py
download_all_raw_files, download_all_category_files, and download_files methods now accept parallel_files: int = 1. Added new public methods download_files_by_list and download_files_by_url for manifest and URL-based downloads.
Core Download Resilience
pridepy/files/files.py
Implemented _download_range with retry+backoff for resumable byte-range requests. Updated _parallel_download to detect server resume support, resume from partial files, and fall back on range rejection.
Globus Download Refactoring
pridepy/files/files.py
Refactored Globus download path to accept optional checksum_map, pre-filter/skip already-valid files, queue corrupted ones for re-download, and execute with ThreadPoolExecutor respecting parallel_files cap.
Download Flow Integration
pridepy/files/files.py
Updated _batch_download_by_protocol and _download_with_fallback to thread parallel_files and checksum_map. Modified download_files phase-2 validation to log per-file progress and remove files only on checksum mismatch.
List/URL Download Implementation
pridepy/files/files.py
download_files_by_list resolves requested filenames via metadata API and delegates to download_files. download_files_by_url dispatches by scheme, resumes HTTP(S), aggregates failures, and optionally validates against PRIDE checksums.
CLI Commands & Helpers
pridepy/pridepy.py
Added --parallel-files/-w option to download_all_public_raw_files and download_all_public_category_files. Introduced download-files-by-list and download-files-by-url CLI commands with manifest/CSV parsing helpers (_parse_text_manifest, _read_filename_arguments, _read_url_arguments).
Documentation
README.md
Added Quick Start step 6 documenting download-files-by-list with files.txt parsing rules, option examples, and updated "Main commands" section to list both new commands.
List Download Tests
pridepy/tests/test_download_by_list.py
Tests for download_files_by_list: validation on empty input, filename filtering and delegation, warnings on partial matches, and error on no matches. Tests for _read_filename_arguments: manifest/CSV parsing, blank line/comment skipping, tab-delimited column extraction, and deduplication.
URL Download Tests
pridepy/tests/test_download_by_url.py
Tests for download_files_by_url: empty list validation, HTTP/FTP dispatch, unsupported scheme aggregation, skip-if-exists short-circuit. Tests for _read_url_arguments: single/multiple URL parsing, manifest reading, CSV/manifest merging with deduplication.
Resilience Test Updates
pridepy/tests/test_download_resilience.py
Added test_parallel_download_streams_full_file to verify streaming with content-length and accept-ranges headers. Removed test_parallel_download_falls_back_when_range_not_honored. Updated stub signatures to accept **kwargs for flexibility.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI User
    participant DownloadByList as download_files_by_list
    participant MetadataAPI as Metadata API
    participant BatchDownloader as _batch_download_by_protocol<br/>(with parallel_files)
    participant Protocol as HTTP/FTP/Globus<br/>Downloader

    CLI->>DownloadByList: accession, file_names, options
    DownloadByList->>MetadataAPI: fetch project metadata
    MetadataAPI-->>DownloadByList: files list with URLs
    DownloadByList->>DownloadByList: filter & match requested names
    alt Files match found
        DownloadByList->>BatchDownloader: download_files(matched files,<br/>parallel_files)
        BatchDownloader->>Protocol: parallel file downloads<br/>with resume/checksum
        Protocol-->>BatchDownloader: success/failure
        BatchDownloader-->>DownloadByList: completion
    else No files match
        DownloadByList-->>CLI: ValueError
    end
Loading
sequenceDiagram
    participant CLI as CLI User
    participant DownloadByUrl as download_files_by_url
    participant URLDispatcher as URL Scheme<br/>Dispatcher
    participant HTTPDownloader as _parallel_download<br/>(HTTP/HTTPS)
    participant FTPDownloader as _ftp_download_url
    participant ErrorHandler as Error Aggregator

    CLI->>DownloadByUrl: urls[], options
    alt List not empty
        DownloadByUrl->>URLDispatcher: for each URL
        URLDispatcher->>HTTPDownloader: https:// or http://
        URLDispatcher->>FTPDownloader: ftp://
        URLDispatcher->>ErrorHandler: unknown scheme
        HTTPDownloader-->>URLDispatcher: success/resume/failure
        FTPDownloader-->>URLDispatcher: success/failure
        ErrorHandler-->>URLDispatcher: track failure
        DownloadByUrl->>DownloadByUrl: optional PRIDE checksum<br/>validation
    else List is empty
        DownloadByUrl-->>CLI: ValueError
    end
    alt Any failures occurred
        DownloadByUrl-->>CLI: RuntimeError with<br/>aggregated failures
    else All succeeded
        DownloadByUrl-->>CLI: success
    end
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly Related PRs

  • Dev #91: Modifies the same download resilience and parallel execution logic in pridepy/files/files.py, including byte-range retry and checkpoint/resume behavior.
  • Harden downloads and migrate packaging/docs to uv #86: Refactors download flow in pridepy/files/files.py, updating download_files, checksum validation, and protocol fallback paths.
  • move to poetry #61: Adds CLI options and parameters to the same download_all_public_raw_files and download_all_public_category_files commands in pridepy/pridepy.py.

Suggested Labels

enhancement, feature, download-capability, parallel-execution, tests

Suggested Reviewers

  • chakrabandla
  • selvaebi

🐰 A rabbit downloads files with glee,
By list or URL, fast and free,
With checksums checked and ranges resumed,
No broken transfers to be groomed.
Parallel threads hop and spin,
Loop and fallback—all downloads win! 🎯

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 49.18% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The PR title 'Improving parallelism' is vague and generic, using a broad term that doesn't clearly convey the specific changes made. Consider a more specific title such as 'Add parallel downloads and new CLI commands for file subset downloads' to better reflect the main changes.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 60 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

@qodo-code-review
Copy link
Copy Markdown
Contributor

Review Summary by Qodo

Add flexible file download commands and parallel globus support

✨ Enhancement

Grey Divider

Walkthroughs

Description
• Add two new CLI commands for flexible file downloads: download-files-by-list (subset by
  filename) and download-files-by-url (raw URLs)
• Implement parallel file download support via parallel_files parameter (1-3 workers) for globus
  protocol
• Refactor HTTP download logic from multi-connection parallel ranges to single-connection streaming
  with resume capability
• Add comprehensive checksum validation with protocol fallback and improved error handling
Diagram
flowchart LR
  A["New CLI Commands"] --> B["download-files-by-list"]
  A --> C["download-files-by-url"]
  B --> D["Files.download_files_by_list"]
  C --> E["Files.download_files_by_url"]
  D --> F["Existing batch + fallback engine"]
  E --> G["URL scheme dispatcher"]
  G --> H["HTTP/HTTPS"]
  G --> I["FTP"]
  H --> J["_parallel_download with resume"]
  I --> K["_ftp_download_url"]
  L["parallel_files parameter"] --> M["ThreadPoolExecutor workers"]
  M --> N["Globus downloads"]
Loading

Grey Divider

File Changes

1. pridepy/files/files.py ✨ Enhancement +436/-88

Refactor parallel downloads and add flexible file download methods

• Refactored _parallel_download from multi-connection range requests to single-connection
 streaming with resume support and progress tracking
• Enhanced _download_range with retry logic (max 3 attempts) and improved timeout handling
• Added _globus_download_one worker method for parallel globus downloads with per-file retry logic
• Refactored download_files_from_globus to support parallel downloads via ThreadPoolExecutor with
 pre-filtering and checksum validation
• Added download_files_by_list method to download named file subsets from a project
• Added download_files_by_url static method with URL scheme dispatching (http/https/ftp)
• Added helper methods: _extract_pride_accession, _download_single_url, _dispatch_url_scheme,
 _http_download_url, _ftp_download_url, _validate_urls_checksums
• Updated download_files, _batch_download_by_protocol, _download_with_fallback,
 download_all_raw_files, download_all_category_files to accept and propagate parallel_files
 parameter
• Fixed os.mkdir to os.makedirs for proper directory creation
• Improved validation logic to distinguish between checksum mismatches and missing files

pridepy/files/files.py


2. pridepy/pridepy.py ✨ Enhancement +239/-3

Add new download commands and parallel file support to CLI

• Added -w/--parallel-files option (IntRange 1-3) to download_all_public_raw_files and
 download_all_public_category_files commands
• Added new download-files-by-list CLI command with manifest and CSV filename input options
• Added new download-files-by-url CLI command with manifest and CSV URL input options
• Implemented _parse_text_manifest helper to read manifests with comment/blank line skipping
• Implemented _read_filename_arguments helper to build deduplicated filename lists from manifest
 and CSV
• Implemented _read_url_arguments helper to build deduplicated URL lists from manifest and CSV
• Updated help text for --checksum-check to clarify validation behavior

pridepy/pridepy.py


3. pridepy/tests/test_download_by_list.py 🧪 Tests +125/-0

Add unit tests for download-files-by-list functionality

• New test file covering Files.download_files_by_list method behavior
• Tests for empty list validation, metadata filtering, partial matches, and no-match error cases
• Tests for _read_filename_arguments CLI helper including CSV parsing, manifest parsing with
 comments/blanks, and deduplication

pridepy/tests/test_download_by_list.py


View more (3)
4. pridepy/tests/test_download_by_url.py 🧪 Tests +152/-0

Add unit tests for download-files-by-url functionality

• New test file covering Files.download_files_by_url method and URL dispatching
• Tests for empty list validation, HTTP/FTP scheme dispatching, unsupported schemes, failure
 aggregation, and skip-if-exists logic
• Tests for _read_url_arguments CLI helper including CSV parsing, manifest parsing, and
 deduplication

pridepy/tests/test_download_by_url.py


5. pridepy/tests/test_download_resilience.py 🧪 Tests +10/-20

Update tests for refactored parallel download logic

• Updated test_parallel_download_falls_back_when_range_not_honored to reflect new
 single-connection streaming behavior (renamed to test_parallel_download_streams_full_file)
• Removed num_connections parameter from _parallel_download calls in tests
• Updated mock setup to reflect simplified streaming approach instead of multi-connection ranges
• Updated _batch_download_by_protocol mock signatures to accept **kwargs for new parameters

pridepy/tests/test_download_resilience.py


6. README.md 📝 Documentation +45/-0

Document new download commands and options

• Added section 6 documenting download-files-by-list command with manifest format and usage
 examples
• Added section 7 documenting download-files-by-url command with URL manifest format and usage
 examples
• Documented useful options for both new commands including -p globus, -w parallel workers, and
 --checksum-check
• Updated CLI Command Overview to list new commands

README.md


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown
Contributor

qodo-code-review Bot commented May 2, 2026

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0)

Grey Divider


Action required

1. Checksum check can be no-op 🐞 Bug ≡ Correctness
Description
In _validate_urls_checksums, if a downloaded filename is not present in the PRIDE checksum TSV,
expected becomes None and validate_download() performs no checksum comparison but the code
still logs "Checksum OK". This can make --checksum-check claim validation succeeded even when no
checksum verification occurred for that file.
Code

pridepy/files/files.py[R1327-1336]

+                file_name = os.path.basename(urlparse(url).path)
+                target = os.path.join(output_folder, file_name)
+                expected = checksum_map.get(file_name)
+                logging.info("Validating %s", file_name)
+                valid, reason = Files.validate_download(target, expected)
+                if not valid:
+                    logging.error("Validation failed for %s: %s", file_name, reason)
+                    validation_failures.append(f"{file_name} ({reason})")
+                else:
+                    logging.info("Checksum OK: %s", file_name)
Evidence
validate_download() only compares checksums when expected_checksum is provided; otherwise it
returns success based solely on existence/non-empty. _validate_urls_checksums() derives expected
via checksum_map.get(file_name) and logs "Checksum OK" on success, so missing checksum entries are
indistinguishable from verified files—contradicting the CLI/README promise that --checksum-check
validates against PRIDE checksums.

pridepy/files/files.py[133-148]
pridepy/files/files.py[1318-1337]
README.md[147-150]
pridepy/pridepy.py[631-636]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`Files._validate_urls_checksums()` calls `validate_download(target, expected)` where `expected` comes from `checksum_map.get(file_name)`. When the checksum TSV does not contain the downloaded filename, `expected` is `None`, so no checksum is checked, yet the code logs `"Checksum OK"` and does not treat it as a validation failure.

### Issue Context
The CLI/README advertise `--checksum-check` as validating downloads against PRIDE checksums for PRIDE archive URLs.

### Fix Focus Areas
- pridepy/files/files.py[1318-1342]
- pridepy/files/files.py[133-148]

### Implementation guidance
- If `expected is None` for a URL being validated, treat it as a validation failure (or at minimum log a warning and include it in `validation_failures`).
- Consider updating log messages so success is only logged when a checksum was actually compared (e.g., `"Checksum OK"` only when `expected` is present).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. Globus corruption re-download bypass 🐞 Bug ≡ Correctness
Description
download_files_from_globus() queues checksum-mismatched existing files for re-download but does
not delete them first. _parallel_download() then returns early when `os.path.getsize(file_path) >=
Content-Length`, so a corrupted file with the correct size can be treated as already complete and
never overwritten by the intended re-download.
Code

pridepy/files/files.py[R603-610]

+            if skip_if_downloaded_already and os.path.exists(new_file_path):
+                expected_cs = checksum_map.get(file.get("fileName", ""))
+                if expected_cs:
+                    valid, reason = Files.validate_download(new_file_path, expected_cs)
+                    if not valid:
+                        logging.warning(f"Corrupted file detected ({reason}), will re-download: {new_file_path}")
+                        files_to_download.append(file)
+                        continue
Evidence
When a checksum mismatch is detected, the file is added to files_to_download without being
removed. The globus streamer’s resume logic short-circuits purely on size (`resume_size >=
total_size`) and does not incorporate checksum state. Since checksum mismatches do not imply a size
mismatch, the re-download path can be skipped while the docstring claims corrupted files are
re-downloaded.

pridepy/files/files.py[580-610]
pridepy/files/files.py[530-535]
pridepy/files/files.py[133-148]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`download_files_from_globus()` detects checksum mismatch and schedules the file for re-download, but does not remove the existing corrupted file. `_parallel_download()` can then consider the file complete (size-based check) and return without overwriting it.

### Issue Context
The `download_files_from_globus()` docstring states corrupted files are re-downloaded when `checksum_map` is provided.

### Fix Focus Areas
- pridepy/files/files.py[600-614]
- pridepy/files/files.py[530-535]

### Implementation guidance
- When `validate_download(new_file_path, expected_cs)` returns invalid, remove `new_file_path` before appending the file record to `files_to_download`.
- Alternatively/additionally: extend `_parallel_download()` with a `force_redownload` flag to bypass the `resume_size >= total_size` early return when the caller knows the file is corrupt.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

@ypriverol ypriverol merged commit a99c73b into master May 2, 2026
9 of 10 checks passed
Comment thread pridepy/files/files.py
Comment on lines +1327 to +1336
file_name = os.path.basename(urlparse(url).path)
target = os.path.join(output_folder, file_name)
expected = checksum_map.get(file_name)
logging.info("Validating %s", file_name)
valid, reason = Files.validate_download(target, expected)
if not valid:
logging.error("Validation failed for %s: %s", file_name, reason)
validation_failures.append(f"{file_name} ({reason})")
else:
logging.info("Checksum OK: %s", file_name)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Checksum check can be no-op 🐞 Bug ≡ Correctness

In _validate_urls_checksums, if a downloaded filename is not present in the PRIDE checksum TSV,
expected becomes None and validate_download() performs no checksum comparison but the code
still logs "Checksum OK". This can make --checksum-check claim validation succeeded even when no
checksum verification occurred for that file.
Agent Prompt
### Issue description
`Files._validate_urls_checksums()` calls `validate_download(target, expected)` where `expected` comes from `checksum_map.get(file_name)`. When the checksum TSV does not contain the downloaded filename, `expected` is `None`, so no checksum is checked, yet the code logs `"Checksum OK"` and does not treat it as a validation failure.

### Issue Context
The CLI/README advertise `--checksum-check` as validating downloads against PRIDE checksums for PRIDE archive URLs.

### Fix Focus Areas
- pridepy/files/files.py[1318-1342]
- pridepy/files/files.py[133-148]

### Implementation guidance
- If `expected is None` for a URL being validated, treat it as a validation failure (or at minimum log a warning and include it in `validation_failures`).
- Consider updating log messages so success is only logged when a checksum was actually compared (e.g., `"Checksum OK"` only when `expected` is present).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

@coderabbitai coderabbitai Bot mentioned this pull request May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants