Skip to content

Dev#91

Merged
ypriverol merged 5 commits into
masterfrom
dev
Apr 27, 2026
Merged

Dev#91
ypriverol merged 5 commits into
masterfrom
dev

Conversation

@ypriverol
Copy link
Copy Markdown
Contributor

@ypriverol ypriverol commented Apr 27, 2026

This pull request introduces significant improvements to the pridepy file download and checksum handling logic, with a focus on robust parallel downloading, stricter checksum parsing, and improved test coverage. The main changes include a new parallel download implementation with fallback, stricter parsing of checksum files in the PRIDE API format, and updates to the tests to cover new behaviors and edge cases.

Key changes:

Download logic improvements

  • Added a new _parallel_download method to Files that enables parallel file downloading using HTTP Range requests, with automatic fallback to a single connection if the server does not support ranges or if errors occur. This improves download speed and reliability. (pridepy/files/files.py)
  • Updated the Globus download logic to use the new _parallel_download method, replacing the previous urllib-based approach. (pridepy/files/files.py)
  • Changed the Globus protocol mapping to use an HTTPS URL prefix instead of a custom Globus base URL, ensuring compatibility and simplifying the URL transformation. (pridepy/files/files.py) [1] [2]

Checksum file parsing

  • Rewrote checksum file parsing logic to expect the PRIDE API TSV format (File-Name\tFile-MD5Checksum\tFile-Size), with stricter validation and rejection of lines that do not match the expected format or contain invalid checksums. (pridepy/files/files.py)
  • Updated tests to reflect the new checksum parsing logic, ensuring only valid MD5 checksums in the correct format are accepted. (pridepy/tests/test_download_resilience.py)

Test enhancements

  • Added comprehensive tests for the new parallel download logic, including fallback scenarios when servers do not support Range requests, when HEAD requests fail, or when Accept-Ranges is not set. (pridepy/tests/test_download_resilience.py)
  • Added tests for the new checksum file parsing and for the updated Globus URL mapping. (pridepy/tests/test_download_resilience.py)

Miscellaneous

  • Bumped the package version to 0.0.14 to reflect these changes. (pyproject.toml)
  • Cleaned up the README.md by removing the old pandoc PDF build instructions. (README.md)

Summary by CodeRabbit

  • New Features

    • Implemented parallel byte-range download capability with intelligent automatic fallback to standard streaming for enhanced compatibility with diverse server types
  • Improvements

    • Globus transfer URLs now securely utilize HTTPS protocol for all data transfers
    • Checksum validation system enhanced to accurately parse PRIDE API TSV format with intelligent column header detection and robust MD5 verification
  • Documentation

    • Removed outdated pandoc Docker build instructions from project README

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 27, 2026

📝 Walkthrough

Walkthrough

The package updates download resilience and checksum parsing logic. Globus URL derivation now uses HTTPS prefix swapping instead of fixed base URLs. Checksum validation shifts from free-form parsing to PRIDE-API TSV format with column-based extraction. Downloads transition from single-connection urllib.request to a parallel byte-range downloader with streaming fallback. Version bumped to 0.0.14. README Docker build instructions removed.

Changes

Cohort / File(s) Summary
Documentation
README.md
Removed pandoc Docker command instructions for building white paper PDF.
Core Download & Checksum Logic
pridepy/files/files.py
Refactored Globus URL derivation to use HTTPS prefix swapping. Replaced free-form checksum parsing with PRIDE-API TSV format supporting header-based column detection and hex validation. Replaced urllib.request.urlretrieve with new _parallel_download method supporting HTTP byte-range requests with fallback to streaming on range unavailability or HEAD probe failure.
Resilience Test Expansion
pridepy/tests/test_download_resilience.py
Added comprehensive test coverage for TSV checksum parsing with header validation, MD5 extraction, and malformed row filtering. Extended download URL mapping tests for Globus HTTPS conversion. Added _parallel_download tests covering range support probing, fallback scenarios (missing range headers, failed HEAD requests, absent accept-ranges), and concurrent byte-range writes.
Package Version
pyproject.toml
Bumped version from 0.0.13 to 0.0.14.

Sequence Diagram(s)

sequenceDiagram
    participant App as Application
    participant HTTP as HTTP Server
    participant Disk as Local Disk
    
    App->>HTTP: HEAD request (probe range support)
    HTTP-->>App: Response with Accept-Ranges header
    
    alt Range Support Available
        loop For each byte range
            App->>HTTP: GET with Range header
            HTTP-->>App: Partial content (206)
            App->>Disk: Write range bytes concurrently
        end
        App->>Disk: Update progress bar
    else Range Not Available or HEAD Fails
        App->>HTTP: Single GET request
        HTTP-->>App: Full file content (200)
        App->>Disk: Stream and write sequentially
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Poem

🐰 A rabbit hops through byte-range lands,
Where parallel downloads join tiny hands,
No Docker builds or streaming alone—
With HEAD-proof ranges, resilience is shown!
TSV checksums parsed clean and bright,
Version bumps up—the code feels right!

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 43.75% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Dev' is vague and generic, providing no meaningful information about the actual changes made in the pull request. Provide a descriptive title that summarizes the main changes, such as 'Implement parallel downloads and update checksum parsing' or 'Refactor file download and checksum validation logic'.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ypriverol ypriverol requested a review from chakrabandla April 27, 2026 04:50
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pridepy/files/files.py (1)

584-601: ⚠️ Potential issue | 🟠 Major

new_file_path may be unbound (or stale) when _get_download_url raises.

Switching from a hardcoded URL build to Files._get_download_url(file, "globus") (line 586) introduces a ValueError path that fires before new_file_path is assigned on line 591. On the first iteration this turns the except handler on line 601 into an UnboundLocalError, which escapes the try/except and aborts the remaining files in the batch. On later iterations the log line will report the previous file's path, which is misleading during triage.

🐛 Proposed fix
         for file in file_list_json:
+            file_name = file.get("fileName", "<unknown>")
+            new_file_path = None
             try:
                 download_url = Files._get_download_url(file, "globus")

                 logging.debug(f"Downloading from Globus: {download_url}")

                 # Create a clean filename to save the downloaded file
                 new_file_path = Files.get_output_file_name(download_url, file, output_folder)

                 if skip_if_downloaded_already == True and os.path.exists(new_file_path):
                     logging.info("Skipping download as file already exists")
                     continue

                 Files._parallel_download(download_url, new_file_path)
                 logging.info(f"Successfully downloaded {new_file_path}")

             except Exception as e:
-                logging.error(f"Download from Globus failed for {new_file_path}: {str(e)}")
+                target = new_file_path or file_name
+                logging.error(f"Download from Globus failed for {target}: {str(e)}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pridepy/files/files.py` around lines 584 - 601, Initialize new_file_path
(e.g., new_file_path = None) before entering the try block and update the except
handler to safely log a fallback identifier (such as the file variable or
download_url if available) to avoid UnboundLocalError and stale path reporting;
specifically, around the loop that calls Files._get_download_url(file,
"globus"), ensure new_file_path is declared before the try, then call
Files.get_output_file_name(...) and Files._parallel_download(...) inside the
try, and in the except use a safe expression (like new_file_path or file or
download_url) when composing the logging.error message so errors from
Files._get_download_url do not crash the loop or produce misleading paths.
🧹 Nitpick comments (2)
pridepy/files/files.py (2)

71-78: Required column set is stricter than what is actually consumed.

_find_tsv_columns requires file-size, but only the file-name and file-md5checksum indices are returned and used. If the PRIDE API ever renames/drops the size column, read_checksum_file will silently return an empty map and checksum_check becomes a no-op without any error surfaced.

♻️ Suggested change
     `@staticmethod`
     def _find_tsv_columns(header: str) -> Optional[Tuple[int, int]]:
         """Return (name_idx, checksum_idx) from a TSV header, or None."""
         cols = [col.strip().lower() for col in header.split("\t")]
-        required_cols = {"file-name", "file-md5checksum", "file-size"}
+        required_cols = {"file-name", "file-md5checksum"}
         if not required_cols.issubset(set(cols)):
             return None
         return cols.index("file-name"), cols.index("file-md5checksum")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pridepy/files/files.py` around lines 71 - 78, The method _find_tsv_columns
currently insists on a "file-size" column even though only "file-name" and
"file-md5checksum" are returned/used, which can cause silent no-ops in
read_checksum_file/checksum_check; change required_cols to only {"file-name",
"file-md5checksum"} and update read_checksum_file to explicitly handle a None
return from _find_tsv_columns by raising or logging an error (e.g., raise
ValueError or log and return) so missing required columns are surfaced rather
than silently producing an empty map.

483-568: Recommended refinements to the parallel download path.

Two small things worth tightening up while this code is fresh:

  1. _download_range (line 486) builds a fresh retry session for each range. With the default 8 ranges that is 8 sessions, each with their own connection pool/adapter — passing the parent session in from _parallel_download would reuse pooled connections and keep retry behavior consistent.
  2. The num_connections < 2 guard on line 516 is evaluated against the caller-supplied value, before the min(num_connections, total_size) clamp on line 530. If total_size == 1 and num_connections == 8, the parallel branch is taken with effectively a single worker. Consider clamping first, then falling back to single-stream when the clamp drops you below 2.
♻️ Sketch
     `@staticmethod`
-    def _download_range(url, file_path, start, end, pbar):
+    def _download_range(url, file_path, start, end, pbar, session=None):
         """Download a byte range directly into the target file using seek."""
-        session = Util.create_session_with_retries()
+        session = session or Util.create_session_with_retries()
         headers = {"Range": f"bytes={start}-{end}"}
@@
-        if total_size == 0 or accept_ranges != "bytes" or num_connections < 2:
+        effective_connections = min(num_connections, max(total_size, 1))
+        if total_size == 0 or accept_ranges != "bytes" or effective_connections < 2:
             logging.info(
                 "Server does not support Range requests, falling back to single connection"
             )
@@
-        num_connections = min(num_connections, total_size)
+        num_connections = effective_connections
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pridepy/files/files.py` around lines 483 - 568, Change Files._download_range
to accept a session parameter and reuse the session created in
Files._parallel_download so each range request uses the same connection pool and
retry behavior (pass the existing session when submitting executor tasks instead
of creating a new one inside _download_range). Also move the num_connections =
min(num_connections, total_size) clamp up before the check that falls back to
single-stream and then use the clamped value to decide if you should switch to
the single-connection path (ensure the guard is evaluated after clamping and
that ranges/ThreadPoolExecutor use the clamped num_connections).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@pridepy/files/files.py`:
- Around line 584-601: Initialize new_file_path (e.g., new_file_path = None)
before entering the try block and update the except handler to safely log a
fallback identifier (such as the file variable or download_url if available) to
avoid UnboundLocalError and stale path reporting; specifically, around the loop
that calls Files._get_download_url(file, "globus"), ensure new_file_path is
declared before the try, then call Files.get_output_file_name(...) and
Files._parallel_download(...) inside the try, and in the except use a safe
expression (like new_file_path or file or download_url) when composing the
logging.error message so errors from Files._get_download_url do not crash the
loop or produce misleading paths.

---

Nitpick comments:
In `@pridepy/files/files.py`:
- Around line 71-78: The method _find_tsv_columns currently insists on a
"file-size" column even though only "file-name" and "file-md5checksum" are
returned/used, which can cause silent no-ops in
read_checksum_file/checksum_check; change required_cols to only {"file-name",
"file-md5checksum"} and update read_checksum_file to explicitly handle a None
return from _find_tsv_columns by raising or logging an error (e.g., raise
ValueError or log and return) so missing required columns are surfaced rather
than silently producing an empty map.
- Around line 483-568: Change Files._download_range to accept a session
parameter and reuse the session created in Files._parallel_download so each
range request uses the same connection pool and retry behavior (pass the
existing session when submitting executor tasks instead of creating a new one
inside _download_range). Also move the num_connections = min(num_connections,
total_size) clamp up before the check that falls back to single-stream and then
use the clamped value to decide if you should switch to the single-connection
path (ensure the guard is evaluated after clamping and that
ranges/ThreadPoolExecutor use the clamped num_connections).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bdf0bce7-e91a-4295-9159-6e68bda9d0e5

📥 Commits

Reviewing files that changed from the base of the PR and between 309e361 and 1c56035.

📒 Files selected for processing (4)
  • README.md
  • pridepy/files/files.py
  • pridepy/tests/test_download_resilience.py
  • pyproject.toml
💤 Files with no reviewable changes (1)
  • README.md

@ypriverol ypriverol merged commit ab6ce0d into master Apr 27, 2026
10 checks passed
@coderabbitai coderabbitai Bot mentioned this pull request May 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants