Conversation
feat(globus): parallel HTTPS download and checksum TSV parsing
📝 WalkthroughWalkthroughThe package updates download resilience and checksum parsing logic. Globus URL derivation now uses HTTPS prefix swapping instead of fixed base URLs. Checksum validation shifts from free-form parsing to PRIDE-API TSV format with column-based extraction. Downloads transition from single-connection Changes
Sequence Diagram(s)sequenceDiagram
participant App as Application
participant HTTP as HTTP Server
participant Disk as Local Disk
App->>HTTP: HEAD request (probe range support)
HTTP-->>App: Response with Accept-Ranges header
alt Range Support Available
loop For each byte range
App->>HTTP: GET with Range header
HTTP-->>App: Partial content (206)
App->>Disk: Write range bytes concurrently
end
App->>Disk: Update progress bar
else Range Not Available or HEAD Fails
App->>HTTP: Single GET request
HTTP-->>App: Full file content (200)
App->>Disk: Stream and write sequentially
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pridepy/files/files.py (1)
584-601:⚠️ Potential issue | 🟠 Major
new_file_pathmay be unbound (or stale) when_get_download_urlraises.Switching from a hardcoded URL build to
Files._get_download_url(file, "globus")(line 586) introduces aValueErrorpath that fires beforenew_file_pathis assigned on line 591. On the first iteration this turns the except handler on line 601 into anUnboundLocalError, which escapes the try/except and aborts the remaining files in the batch. On later iterations the log line will report the previous file's path, which is misleading during triage.🐛 Proposed fix
for file in file_list_json: + file_name = file.get("fileName", "<unknown>") + new_file_path = None try: download_url = Files._get_download_url(file, "globus") logging.debug(f"Downloading from Globus: {download_url}") # Create a clean filename to save the downloaded file new_file_path = Files.get_output_file_name(download_url, file, output_folder) if skip_if_downloaded_already == True and os.path.exists(new_file_path): logging.info("Skipping download as file already exists") continue Files._parallel_download(download_url, new_file_path) logging.info(f"Successfully downloaded {new_file_path}") except Exception as e: - logging.error(f"Download from Globus failed for {new_file_path}: {str(e)}") + target = new_file_path or file_name + logging.error(f"Download from Globus failed for {target}: {str(e)}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pridepy/files/files.py` around lines 584 - 601, Initialize new_file_path (e.g., new_file_path = None) before entering the try block and update the except handler to safely log a fallback identifier (such as the file variable or download_url if available) to avoid UnboundLocalError and stale path reporting; specifically, around the loop that calls Files._get_download_url(file, "globus"), ensure new_file_path is declared before the try, then call Files.get_output_file_name(...) and Files._parallel_download(...) inside the try, and in the except use a safe expression (like new_file_path or file or download_url) when composing the logging.error message so errors from Files._get_download_url do not crash the loop or produce misleading paths.
🧹 Nitpick comments (2)
pridepy/files/files.py (2)
71-78: Required column set is stricter than what is actually consumed.
_find_tsv_columnsrequiresfile-size, but only thefile-nameandfile-md5checksumindices are returned and used. If the PRIDE API ever renames/drops the size column,read_checksum_filewill silently return an empty map andchecksum_checkbecomes a no-op without any error surfaced.♻️ Suggested change
`@staticmethod` def _find_tsv_columns(header: str) -> Optional[Tuple[int, int]]: """Return (name_idx, checksum_idx) from a TSV header, or None.""" cols = [col.strip().lower() for col in header.split("\t")] - required_cols = {"file-name", "file-md5checksum", "file-size"} + required_cols = {"file-name", "file-md5checksum"} if not required_cols.issubset(set(cols)): return None return cols.index("file-name"), cols.index("file-md5checksum")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pridepy/files/files.py` around lines 71 - 78, The method _find_tsv_columns currently insists on a "file-size" column even though only "file-name" and "file-md5checksum" are returned/used, which can cause silent no-ops in read_checksum_file/checksum_check; change required_cols to only {"file-name", "file-md5checksum"} and update read_checksum_file to explicitly handle a None return from _find_tsv_columns by raising or logging an error (e.g., raise ValueError or log and return) so missing required columns are surfaced rather than silently producing an empty map.
483-568: Recommended refinements to the parallel download path.Two small things worth tightening up while this code is fresh:
_download_range(line 486) builds a fresh retry session for each range. With the default 8 ranges that is 8 sessions, each with their own connection pool/adapter — passing the parent session in from_parallel_downloadwould reuse pooled connections and keep retry behavior consistent.- The
num_connections < 2guard on line 516 is evaluated against the caller-supplied value, before themin(num_connections, total_size)clamp on line 530. Iftotal_size == 1andnum_connections == 8, the parallel branch is taken with effectively a single worker. Consider clamping first, then falling back to single-stream when the clamp drops you below 2.♻️ Sketch
`@staticmethod` - def _download_range(url, file_path, start, end, pbar): + def _download_range(url, file_path, start, end, pbar, session=None): """Download a byte range directly into the target file using seek.""" - session = Util.create_session_with_retries() + session = session or Util.create_session_with_retries() headers = {"Range": f"bytes={start}-{end}"} @@ - if total_size == 0 or accept_ranges != "bytes" or num_connections < 2: + effective_connections = min(num_connections, max(total_size, 1)) + if total_size == 0 or accept_ranges != "bytes" or effective_connections < 2: logging.info( "Server does not support Range requests, falling back to single connection" ) @@ - num_connections = min(num_connections, total_size) + num_connections = effective_connections🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pridepy/files/files.py` around lines 483 - 568, Change Files._download_range to accept a session parameter and reuse the session created in Files._parallel_download so each range request uses the same connection pool and retry behavior (pass the existing session when submitting executor tasks instead of creating a new one inside _download_range). Also move the num_connections = min(num_connections, total_size) clamp up before the check that falls back to single-stream and then use the clamped value to decide if you should switch to the single-connection path (ensure the guard is evaluated after clamping and that ranges/ThreadPoolExecutor use the clamped num_connections).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@pridepy/files/files.py`:
- Around line 584-601: Initialize new_file_path (e.g., new_file_path = None)
before entering the try block and update the except handler to safely log a
fallback identifier (such as the file variable or download_url if available) to
avoid UnboundLocalError and stale path reporting; specifically, around the loop
that calls Files._get_download_url(file, "globus"), ensure new_file_path is
declared before the try, then call Files.get_output_file_name(...) and
Files._parallel_download(...) inside the try, and in the except use a safe
expression (like new_file_path or file or download_url) when composing the
logging.error message so errors from Files._get_download_url do not crash the
loop or produce misleading paths.
---
Nitpick comments:
In `@pridepy/files/files.py`:
- Around line 71-78: The method _find_tsv_columns currently insists on a
"file-size" column even though only "file-name" and "file-md5checksum" are
returned/used, which can cause silent no-ops in
read_checksum_file/checksum_check; change required_cols to only {"file-name",
"file-md5checksum"} and update read_checksum_file to explicitly handle a None
return from _find_tsv_columns by raising or logging an error (e.g., raise
ValueError or log and return) so missing required columns are surfaced rather
than silently producing an empty map.
- Around line 483-568: Change Files._download_range to accept a session
parameter and reuse the session created in Files._parallel_download so each
range request uses the same connection pool and retry behavior (pass the
existing session when submitting executor tasks instead of creating a new one
inside _download_range). Also move the num_connections = min(num_connections,
total_size) clamp up before the check that falls back to single-stream and then
use the clamped value to decide if you should switch to the single-connection
path (ensure the guard is evaluated after clamping and that
ranges/ThreadPoolExecutor use the clamped num_connections).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: bdf0bce7-e91a-4295-9159-6e68bda9d0e5
📒 Files selected for processing (4)
README.mdpridepy/files/files.pypridepy/tests/test_download_resilience.pypyproject.toml
💤 Files with no reviewable changes (1)
- README.md
This pull request introduces significant improvements to the
pridepyfile download and checksum handling logic, with a focus on robust parallel downloading, stricter checksum parsing, and improved test coverage. The main changes include a new parallel download implementation with fallback, stricter parsing of checksum files in the PRIDE API format, and updates to the tests to cover new behaviors and edge cases.Key changes:
Download logic improvements
_parallel_downloadmethod toFilesthat enables parallel file downloading using HTTP Range requests, with automatic fallback to a single connection if the server does not support ranges or if errors occur. This improves download speed and reliability. (pridepy/files/files.py)_parallel_downloadmethod, replacing the previousurllib-based approach. (pridepy/files/files.py)pridepy/files/files.py) [1] [2]Checksum file parsing
File-Name\tFile-MD5Checksum\tFile-Size), with stricter validation and rejection of lines that do not match the expected format or contain invalid checksums. (pridepy/files/files.py)pridepy/tests/test_download_resilience.py)Test enhancements
pridepy/tests/test_download_resilience.py)pridepy/tests/test_download_resilience.py)Miscellaneous
0.0.14to reflect these changes. (pyproject.toml)README.mdby removing the old pandoc PDF build instructions. (README.md)Summary by CodeRabbit
New Features
Improvements
Documentation