Conversation
…for download-by-url/list
…e, update docstrings
Enhance CLI with file download features and options
|
Caution Review failedPull request was closed or merged during review 📝 WalkthroughWalkthroughThe PR adds manifest-based and URL-based file downloading capabilities with parallel file support, resumable byte-range downloads, and checksum validation. Two new CLI commands ( ChangesDocker Configuration
Download Feature Implementation
Sequence Diagram(s)sequenceDiagram
participant CLI as CLI User
participant DownloadByList as download_files_by_list
participant MetadataAPI as Metadata API
participant BatchDownloader as _batch_download_by_protocol<br/>(with parallel_files)
participant Protocol as HTTP/FTP/Globus<br/>Downloader
CLI->>DownloadByList: accession, file_names, options
DownloadByList->>MetadataAPI: fetch project metadata
MetadataAPI-->>DownloadByList: files list with URLs
DownloadByList->>DownloadByList: filter & match requested names
alt Files match found
DownloadByList->>BatchDownloader: download_files(matched files,<br/>parallel_files)
BatchDownloader->>Protocol: parallel file downloads<br/>with resume/checksum
Protocol-->>BatchDownloader: success/failure
BatchDownloader-->>DownloadByList: completion
else No files match
DownloadByList-->>CLI: ValueError
end
sequenceDiagram
participant CLI as CLI User
participant DownloadByUrl as download_files_by_url
participant URLDispatcher as URL Scheme<br/>Dispatcher
participant HTTPDownloader as _parallel_download<br/>(HTTP/HTTPS)
participant FTPDownloader as _ftp_download_url
participant ErrorHandler as Error Aggregator
CLI->>DownloadByUrl: urls[], options
alt List not empty
DownloadByUrl->>URLDispatcher: for each URL
URLDispatcher->>HTTPDownloader: https:// or http://
URLDispatcher->>FTPDownloader: ftp://
URLDispatcher->>ErrorHandler: unknown scheme
HTTPDownloader-->>URLDispatcher: success/resume/failure
FTPDownloader-->>URLDispatcher: success/failure
ErrorHandler-->>URLDispatcher: track failure
DownloadByUrl->>DownloadByUrl: optional PRIDE checksum<br/>validation
else List is empty
DownloadByUrl-->>CLI: ValueError
end
alt Any failures occurred
DownloadByUrl-->>CLI: RuntimeError with<br/>aggregated failures
else All succeeded
DownloadByUrl-->>CLI: success
end
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly Related PRs
Suggested Labels
Suggested Reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 60 minutes.Comment |
Review Summary by QodoAdd flexible file download commands and parallel globus support
WalkthroughsDescription• Add two new CLI commands for flexible file downloads: download-files-by-list (subset by filename) and download-files-by-url (raw URLs) • Implement parallel file download support via parallel_files parameter (1-3 workers) for globus protocol • Refactor HTTP download logic from multi-connection parallel ranges to single-connection streaming with resume capability • Add comprehensive checksum validation with protocol fallback and improved error handling Diagramflowchart LR
A["New CLI Commands"] --> B["download-files-by-list"]
A --> C["download-files-by-url"]
B --> D["Files.download_files_by_list"]
C --> E["Files.download_files_by_url"]
D --> F["Existing batch + fallback engine"]
E --> G["URL scheme dispatcher"]
G --> H["HTTP/HTTPS"]
G --> I["FTP"]
H --> J["_parallel_download with resume"]
I --> K["_ftp_download_url"]
L["parallel_files parameter"] --> M["ThreadPoolExecutor workers"]
M --> N["Globus downloads"]
File Changes1. pridepy/files/files.py
|
Code Review by Qodo
1. Checksum check can be no-op
|
| file_name = os.path.basename(urlparse(url).path) | ||
| target = os.path.join(output_folder, file_name) | ||
| expected = checksum_map.get(file_name) | ||
| logging.info("Validating %s", file_name) | ||
| valid, reason = Files.validate_download(target, expected) | ||
| if not valid: | ||
| logging.error("Validation failed for %s: %s", file_name, reason) | ||
| validation_failures.append(f"{file_name} ({reason})") | ||
| else: | ||
| logging.info("Checksum OK: %s", file_name) |
There was a problem hiding this comment.
1. Checksum check can be no-op 🐞 Bug ≡ Correctness
In _validate_urls_checksums, if a downloaded filename is not present in the PRIDE checksum TSV, expected becomes None and validate_download() performs no checksum comparison but the code still logs "Checksum OK". This can make --checksum-check claim validation succeeded even when no checksum verification occurred for that file.
Agent Prompt
### Issue description
`Files._validate_urls_checksums()` calls `validate_download(target, expected)` where `expected` comes from `checksum_map.get(file_name)`. When the checksum TSV does not contain the downloaded filename, `expected` is `None`, so no checksum is checked, yet the code logs `"Checksum OK"` and does not treat it as a validation failure.
### Issue Context
The CLI/README advertise `--checksum-check` as validating downloads against PRIDE checksums for PRIDE archive URLs.
### Fix Focus Areas
- pridepy/files/files.py[1318-1342]
- pridepy/files/files.py[133-148]
### Implementation guidance
- If `expected is None` for a URL being validated, treat it as a validation failure (or at minimum log a warning and include it in `validation_failures`).
- Consider updating log messages so success is only logged when a checksum was actually compared (e.g., `"Checksum OK"` only when `expected` is present).
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
This pull request introduces two major new CLI commands for downloading specific files from PRIDE projects by filename or by URL, along with supporting helper functions and comprehensive unit tests. It also adds parallel download options for several commands and improves documentation to cover these new features.
New CLI commands and features:
download-files-by-listcommand to download a specified subset of files from a PRIDE project using a manifest file or comma-separated list. Includes support for protocol selection, parallel downloads (Globus), and checksum validation. [1] [2] [3]download-files-by-urlcommand to download files directly from a list of raw URLs, supporting protocol selection, parallel downloads, and checksum validation for PRIDE URLs. [1] [2] [3]Improvements to existing commands:
--parallel-filesoption (with range 1–3) todownload-all-public-raw-filesanddownload-all-public-category-filesfor parallel downloads when using Globus. [1] [2] [3] [4] [5] [6] [7] [8] [9]Helper functions for manifest and argument parsing:
_parse_text_manifest,_read_filename_arguments, and_read_url_argumentsto handle manifest parsing, deduplication, and validation for both filenames and URLs in CLI commands.Documentation updates:
README.mdto document the new commands, their usage, options, and example invocations. [1] [2]Testing:
download-files-by-listanddownload-files-by-urlcommands, covering manifest parsing, error handling, and download dispatch logic. [1] [2]Summary by CodeRabbit
Release Notes
New Features
download-files-by-listcommand to download specific files using a manifest or filename listdownload-files-by-urlcommand to download files directly from URLs-w/--parallel-filesoption) for faster transfers on existing download commandsDocumentation