Dev by ypriverol · Pull Request #91 · PRIDE-Archive/pridepy

ypriverol · 2026-04-27T04:45:27Z

This pull request introduces significant improvements to the pridepy file download and checksum handling logic, with a focus on robust parallel downloading, stricter checksum parsing, and improved test coverage. The main changes include a new parallel download implementation with fallback, stricter parsing of checksum files in the PRIDE API format, and updates to the tests to cover new behaviors and edge cases.

Key changes:

Download logic improvements

Added a new _parallel_download method to Files that enables parallel file downloading using HTTP Range requests, with automatic fallback to a single connection if the server does not support ranges or if errors occur. This improves download speed and reliability. (pridepy/files/files.py)
Updated the Globus download logic to use the new _parallel_download method, replacing the previous urllib-based approach. (pridepy/files/files.py)
Changed the Globus protocol mapping to use an HTTPS URL prefix instead of a custom Globus base URL, ensuring compatibility and simplifying the URL transformation. (pridepy/files/files.py) [1] [2]

Checksum file parsing

Rewrote checksum file parsing logic to expect the PRIDE API TSV format (File-Name\tFile-MD5Checksum\tFile-Size), with stricter validation and rejection of lines that do not match the expected format or contain invalid checksums. (pridepy/files/files.py)
Updated tests to reflect the new checksum parsing logic, ensuring only valid MD5 checksums in the correct format are accepted. (pridepy/tests/test_download_resilience.py)

Test enhancements

Added comprehensive tests for the new parallel download logic, including fallback scenarios when servers do not support Range requests, when HEAD requests fail, or when Accept-Ranges is not set. (pridepy/tests/test_download_resilience.py)
Added tests for the new checksum file parsing and for the updated Globus URL mapping. (pridepy/tests/test_download_resilience.py)

Miscellaneous

Bumped the package version to 0.0.14 to reflect these changes. (pyproject.toml)
Cleaned up the README.md by removing the old pandoc PDF build instructions. (README.md)

Summary by CodeRabbit

New Features
- Implemented parallel byte-range download capability with intelligent automatic fallback to standard streaming for enhanced compatibility with diverse server types
Improvements
- Globus transfer URLs now securely utilize HTTPS protocol for all data transfers
- Checksum validation system enhanced to accurately parse PRIDE API TSV format with intelligent column header detection and robust MD5 verification
Documentation
- Removed outdated pandoc Docker build instructions from project README

feat(globus): parallel HTTPS download and checksum TSV parsing

coderabbitai · 2026-04-27T04:45:38Z

📝 Walkthrough

Walkthrough

The package updates download resilience and checksum parsing logic. Globus URL derivation now uses HTTPS prefix swapping instead of fixed base URLs. Checksum validation shifts from free-form parsing to PRIDE-API TSV format with column-based extraction. Downloads transition from single-connection urllib.request to a parallel byte-range downloader with streaming fallback. Version bumped to 0.0.14. README Docker build instructions removed.

Changes

Cohort / File(s)	Summary
Documentation `README.md`	Removed pandoc Docker command instructions for building white paper PDF.
Core Download & Checksum Logic `pridepy/files/files.py`	Refactored Globus URL derivation to use HTTPS prefix swapping. Replaced free-form checksum parsing with PRIDE-API TSV format supporting header-based column detection and hex validation. Replaced `urllib.request.urlretrieve` with new `_parallel_download` method supporting HTTP byte-range requests with fallback to streaming on range unavailability or HEAD probe failure.
Resilience Test Expansion `pridepy/tests/test_download_resilience.py`	Added comprehensive test coverage for TSV checksum parsing with header validation, MD5 extraction, and malformed row filtering. Extended download URL mapping tests for Globus HTTPS conversion. Added `_parallel_download` tests covering range support probing, fallback scenarios (missing range headers, failed HEAD requests, absent `accept-ranges`), and concurrent byte-range writes.
Package Version `pyproject.toml`	Bumped version from `0.0.13` to `0.0.14`.

Sequence Diagram(s)

sequenceDiagram
    participant App as Application
    participant HTTP as HTTP Server
    participant Disk as Local Disk
    
    App->>HTTP: HEAD request (probe range support)
    HTTP-->>App: Response with Accept-Ranges header
    
    alt Range Support Available
        loop For each byte range
            App->>HTTP: GET with Range header
            HTTP-->>App: Partial content (206)
            App->>Disk: Write range bytes concurrently
        end
        App->>Disk: Update progress bar
    else Range Not Available or HEAD Fails
        App->>HTTP: Single GET request
        HTTP-->>App: Full file content (200)
        App->>Disk: Stream and write sequentially
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Harden downloads and migrate packaging/docs to uv #86: Modifies checksum parsing, Globus URL mapping, and download logic in pridepy/files/files.py with range/parallel download and fallback resilience behaviors.

Poem

🐰 A rabbit hops through byte-range lands,
Where parallel downloads join tiny hands,
No Docker builds or streaming alone—
With HEAD-proof ranges, resilience is shown!
TSV checksums parsed clean and bright, ✨
Version bumps up—the code feels right!

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 43.75% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'Dev' is vague and generic, providing no meaningful information about the actual changes made in the pull request.	Provide a descriptive title that summarizes the main changes, such as 'Implement parallel downloads and update checksum parsing' or 'Refactor file download and checksum validation logic'.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

pridepy/files/files.py (1)

584-601: ⚠️ Potential issue | 🟠 Major

new_file_path may be unbound (or stale) when _get_download_url raises.

Switching from a hardcoded URL build to Files._get_download_url(file, "globus") (line 586) introduces a ValueError path that fires before new_file_path is assigned on line 591. On the first iteration this turns the except handler on line 601 into an UnboundLocalError, which escapes the try/except and aborts the remaining files in the batch. On later iterations the log line will report the previous file's path, which is misleading during triage.

🐛 Proposed fix

         for file in file_list_json:
+            file_name = file.get("fileName", "<unknown>")
+            new_file_path = None
             try:
                 download_url = Files._get_download_url(file, "globus")

                 logging.debug(f"Downloading from Globus: {download_url}")

                 # Create a clean filename to save the downloaded file
                 new_file_path = Files.get_output_file_name(download_url, file, output_folder)

                 if skip_if_downloaded_already == True and os.path.exists(new_file_path):
                     logging.info("Skipping download as file already exists")
                     continue

                 Files._parallel_download(download_url, new_file_path)
                 logging.info(f"Successfully downloaded {new_file_path}")

             except Exception as e:
-                logging.error(f"Download from Globus failed for {new_file_path}: {str(e)}")
+                target = new_file_path or file_name
+                logging.error(f"Download from Globus failed for {target}: {str(e)}")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pridepy/files/files.py` around lines 584 - 601, Initialize new_file_path
(e.g., new_file_path = None) before entering the try block and update the except
handler to safely log a fallback identifier (such as the file variable or
download_url if available) to avoid UnboundLocalError and stale path reporting;
specifically, around the loop that calls Files._get_download_url(file,
"globus"), ensure new_file_path is declared before the try, then call
Files.get_output_file_name(...) and Files._parallel_download(...) inside the
try, and in the except use a safe expression (like new_file_path or file or
download_url) when composing the logging.error message so errors from
Files._get_download_url do not crash the loop or produce misleading paths.

🧹 Nitpick comments (2)

pridepy/files/files.py (2)

71-78: Required column set is stricter than what is actually consumed.

_find_tsv_columns requires file-size, but only the file-name and file-md5checksum indices are returned and used. If the PRIDE API ever renames/drops the size column, read_checksum_file will silently return an empty map and checksum_check becomes a no-op without any error surfaced.

♻️ Suggested change

     `@staticmethod`
     def _find_tsv_columns(header: str) -> Optional[Tuple[int, int]]:
         """Return (name_idx, checksum_idx) from a TSV header, or None."""
         cols = [col.strip().lower() for col in header.split("\t")]
-        required_cols = {"file-name", "file-md5checksum", "file-size"}
+        required_cols = {"file-name", "file-md5checksum"}
         if not required_cols.issubset(set(cols)):
             return None
         return cols.index("file-name"), cols.index("file-md5checksum")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pridepy/files/files.py` around lines 71 - 78, The method _find_tsv_columns
currently insists on a "file-size" column even though only "file-name" and
"file-md5checksum" are returned/used, which can cause silent no-ops in
read_checksum_file/checksum_check; change required_cols to only {"file-name",
"file-md5checksum"} and update read_checksum_file to explicitly handle a None
return from _find_tsv_columns by raising or logging an error (e.g., raise
ValueError or log and return) so missing required columns are surfaced rather
than silently producing an empty map.

483-568: Recommended refinements to the parallel download path.

Two small things worth tightening up while this code is fresh:

_download_range (line 486) builds a fresh retry session for each range. With the default 8 ranges that is 8 sessions, each with their own connection pool/adapter — passing the parent session in from _parallel_download would reuse pooled connections and keep retry behavior consistent.
The num_connections < 2 guard on line 516 is evaluated against the caller-supplied value, before the min(num_connections, total_size) clamp on line 530. If total_size == 1 and num_connections == 8, the parallel branch is taken with effectively a single worker. Consider clamping first, then falling back to single-stream when the clamp drops you below 2.

♻️ Sketch

     `@staticmethod`
-    def _download_range(url, file_path, start, end, pbar):
+    def _download_range(url, file_path, start, end, pbar, session=None):
         """Download a byte range directly into the target file using seek."""
-        session = Util.create_session_with_retries()
+        session = session or Util.create_session_with_retries()
         headers = {"Range": f"bytes={start}-{end}"}
@@
-        if total_size == 0 or accept_ranges != "bytes" or num_connections < 2:
+        effective_connections = min(num_connections, max(total_size, 1))
+        if total_size == 0 or accept_ranges != "bytes" or effective_connections < 2:
             logging.info(
                 "Server does not support Range requests, falling back to single connection"
             )
@@
-        num_connections = min(num_connections, total_size)
+        num_connections = effective_connections

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pridepy/files/files.py` around lines 483 - 568, Change Files._download_range
to accept a session parameter and reuse the session created in
Files._parallel_download so each range request uses the same connection pool and
retry behavior (pass the existing session when submitting executor tasks instead
of creating a new one inside _download_range). Also move the num_connections =
min(num_connections, total_size) clamp up before the check that falls back to
single-stream and then use the clamped value to decide if you should switch to
the single-connection path (ensure the guard is evaluated after clamping and
that ranges/ThreadPoolExecutor use the clamped num_connections).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@pridepy/files/files.py`:
- Around line 584-601: Initialize new_file_path (e.g., new_file_path = None)
before entering the try block and update the except handler to safely log a
fallback identifier (such as the file variable or download_url if available) to
avoid UnboundLocalError and stale path reporting; specifically, around the loop
that calls Files._get_download_url(file, "globus"), ensure new_file_path is
declared before the try, then call Files.get_output_file_name(...) and
Files._parallel_download(...) inside the try, and in the except use a safe
expression (like new_file_path or file or download_url) when composing the
logging.error message so errors from Files._get_download_url do not crash the
loop or produce misleading paths.

---

Nitpick comments:
In `@pridepy/files/files.py`:
- Around line 71-78: The method _find_tsv_columns currently insists on a
"file-size" column even though only "file-name" and "file-md5checksum" are
returned/used, which can cause silent no-ops in
read_checksum_file/checksum_check; change required_cols to only {"file-name",
"file-md5checksum"} and update read_checksum_file to explicitly handle a None
return from _find_tsv_columns by raising or logging an error (e.g., raise
ValueError or log and return) so missing required columns are surfaced rather
than silently producing an empty map.
- Around line 483-568: Change Files._download_range to accept a session
parameter and reuse the session created in Files._parallel_download so each
range request uses the same connection pool and retry behavior (pass the
existing session when submitting executor tasks instead of creating a new one
inside _download_range). Also move the num_connections = min(num_connections,
total_size) clamp up before the check that falls back to single-stream and then
use the clamped value to decide if you should switch to the single-connection
path (ensure the guard is evaluated after clamping and that
ranges/ThreadPoolExecutor use the clamped num_connections).

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bdf0bce7-e91a-4295-9159-6e68bda9d0e5

📥 Commits

Reviewing files that changed from the base of the PR and between 309e361 and 1c56035.

📒 Files selected for processing (4)

README.md
pridepy/files/files.py
pridepy/tests/test_download_resilience.py
pyproject.toml

💤 Files with no reviewable changes (1)

README.md

Shen-YuFei and others added 5 commits April 24, 2026 21:39

feat(globus): parallel HTTPS download and checksum TSV parsing

9f74744

git commit -m "fix(globus): update files docker tests and version"

7958d9a

test(globus): add HEAD failure and accept-ranges fallback tests

0bde5cf

chore(pridepy): remove Dockerfile

374748f

Merge pull request #90 from Shen-YuFei/dev

1c56035

feat(globus): parallel HTTPS download and checksum TSV parsing

ypriverol requested a review from chakrabandla April 27, 2026 04:50

coderabbitai Bot reviewed Apr 27, 2026

View reviewed changes

ypriverol merged commit ab6ce0d into master Apr 27, 2026
10 checks passed

coderabbitai Bot mentioned this pull request May 2, 2026

Improving parallelism #93

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dev#91

Dev#91
ypriverol merged 5 commits into
masterfrom
dev

ypriverol commented Apr 27, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 27, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ypriverol commented Apr 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Download logic improvements

Checksum file parsing

Test enhancements

Miscellaneous

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ypriverol commented Apr 27, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 27, 2026 •

edited

Loading