feat: modular download architecture + direct downloads (MassIVE/JPOST/iProX) (0.0.16)#105
Conversation
Remove conda recipe packaging workflow
Improving parallelism
Minor changes in parallelization
Minor change to have the version of the tool.
Improve MassIVE download.
Extends the direct-download support introduced for MassIVE in PR #98 to two more proteomics repositories whose datasets are often standalone (no ProteomeXchange accession): - JPOST (Japan ProteOme STandard Repository): JPST\d{6} accessions, listed and downloaded from ftp.jpostdb.org. - iProX (Integrated Proteome resources): IPX\d{7,10} accessions, listed and downloaded from ftp.iprox.cn. Refactor: - Add is_direct_download_accession() unifying the MSV/JPST/IPX checks, plus _list_direct_download_files() and _download_direct_download_records() dispatchers. All call sites (get_all_raw_file_list, download_all_raw_files, download_all_category_files, get_file_from_api, download_file_by_name, download_files_by_list, get_all_category_file_list) now go through the unified entry points. - Extract _list_ftp_repo_files() helper so the FTP connection lifecycle (connect / login / passive / walk / quit) lives in one place. As part of that, fix the FTP-constructor-outside-try issue flagged in the PR #98 review: a connect failure no longer triggers NameError in finally. - Keep is_massive_accession, _list_massive_public_files, and _download_massive_file_records as thin backward-compatible wrappers so existing tests and external callers continue to work. Tests: add test_jpost_files.py and test_iprox_files.py mirroring the MassIVE coverage (regex match, record building, raw-only filtering, and the download_file_by_name happy path). All 19 direct-download tests pass. Version: 0.0.16.
Address feedback that direct downloads should match the PRIDE feature
set (resume, parallel, retries) and that MassIVE's actual FTP server
requires TLS.
Live findings:
- massive-ftp.ucsd.edu now rejects plain anonymous FTP with
421 TLS is required. The merged PR #98 code was effectively broken
against the live server. Switch to FTP_TLS + PROT P.
- ftp.jpostdb.org accepts plain anonymous FTP; keep as-is.
- ftp.iprox.cn does not resolve (DNS fail) and no other iProX FTP
host responds. iProX is HTTPS-only and needs a different transport
(REST API). Defer iProX support; user to provide the endpoint.
Implementation:
- New static _open_ftp_connection(host, use_tls) opens FTP or FTP_TLS
with the right TLS setup, and transparently falls back to FTPS if
a plain FTP server replies 'TLS is required'.
- _list_ftp_repo_files() and download_ftp_urls() both grow a use_tls
flag. _repo_uses_tls(accession) wires this from the repo type
(MassIVE = True, JPOST = False).
- download_ftp_urls() grows parallel_files: when >1, a ThreadPoolExecutor
runs that many FTP workers per host, each with its own connection.
Existing serial single-connection-per-host path is preserved for
parallel_files <= 1.
- Extracted _download_one_ftp_path() with REST-based resume + per-file
retries; _download_ftp_paths_serial() / _download_ftp_paths_parallel()
pick the right scheduling. REST resume verified live (3 KB pre-stage
-> 10 KB final, MD5 matches full file).
- _download_direct_download_records() now accepts parallel_files and
forwards it (along with use_tls derived from the accession) to
download_ftp_urls. All call sites (download_all_raw_files,
download_all_category_files, download_files_by_list) thread the
user-supplied -w/--parallel-files through.
Tests:
- test_jpost_files / test_massive_files updated to assert the new
kwargs (use_tls, parallel_files).
- New test_repo_uses_tls_true_for_massive_false_for_jpost and
test_download_all_raw_files_threads_parallel_files_for_massive.
- test_iprox_files.py removed (iProX is deferred).
- All 15 unit tests pass.
Live testing:
- MassIVE: listed MSV000080175 (44 files), single-file download of
params.xml (10315 B, MD5 43d87368d705c3f380c1d030b14850c4),
REST resume from a 3000 B partial, and 3-worker parallel download
of files from MSV000080175 + MSV000078335 all succeeded.
- JPOST: rate-limited from this IP ('421 too many connections').
Code path is structurally identical to MassIVE (same FTP helper),
routing covered by unit tests; deferred to a follow-up live check.
Address deferred items from PR review:
1. JPOST PROXI listing
ftp.jpostdb.org rate-limits aggressively per source IP (sticky 421 on
any retry within ~10 min). For listing this is fatal because every
pridepy invocation needs a fresh tree walk. JPOST publishes a JSON
PROXI endpoint at
https://repository.jpostdb.org/proxi/datasets/<JPSTxxxxxx>
that returns datasetFiles[*].value as ftp:// URIs alongside CV labels
(Associated raw file URI, Search engine output file URI, Result file
URI, Peak list file URI, ...). _list_jpost_public_files now hits
PROXI first, builds file records with categories derived from the CV
name (mapped via JPOST_PROXI_CATEGORY_MAP), and falls back to the
FTP tree walk only if PROXI is unreachable or returns no records.
Live-tested against JPST002311 -> 160 files (88 RAW, 72 SEARCH).
2. Post-transfer size check
Neither MassIVE nor JPOST publishes per-file checksum manifests in
a standard location, so md5 verification isn't an option for these
datasets. As a lighter-weight integrity signal, _download_one_ftp_path
now compares the local file size against ftp.size() after retrbinary
returns and treats a mismatch as a retryable failure (next attempt
resumes via REST from the current partial). This catches half-finished
transfers where the data channel was closed early without retrbinary
raising.
3. iProX accession guard
Probing showed iProX REST endpoints (PMD009Controller/findByProjectId.jsonp,
findFilesBySubProjectID.jsonp) all redirect unauthenticated callers to
a CAS login page, and downloads use faspe:// URLs with per-session
tokens. Native support is therefore blocked until iProX exposes an
anonymous JSON API or pridepy carries iProX credentials. Until then,
add is_iprox_accession() and _raise_if_iprox() so every public entry
point (get_all_raw_file_list, download_all_raw_files,
download_all_category_files, download_file_by_name, download_files_by_list,
get_file_from_api) emits a clear NotImplementedError instead of
silently falling through to the PRIDE API and 404-ing.
Tests
- test_jpost_files: PROXI listing maps CV names to PRIDE categories;
PROXI failure falls back to FTP walk.
- test_iprox_guard: regex coverage; assert each entry point raises
NotImplementedError for IPX accessions; assert
is_direct_download_accession returns False for IPX.
- test_ftp_download_validation: size mismatch retries until success;
repeated mismatch raises after max_download_retries; correct size
skips retries.
- 25 tests total (10 MassIVE + 8 JPOST + 5 iProX guard + 3 size check).
Live verification (same MassIVE FTPS path as before)
- MSV000080175 listing + params.xml download still produces
10315 B, MD5 43d87368d705c3f380c1d030b14850c4.
- JPST002311 PROXI listing returns 160 files with correct categories.
Actual file transfer is still blocked from this IP by JPOST's
421-too-many-connections rate limit; the code path is shared with
MassIVE (same _download_one_ftp_path / download_ftp_urls), so a
fresh IP should succeed.
….org User pointed out the iProX dataset XML endpoint: https://www.iprox.cn/FAF016Controller/readXml.jsonp?fileId=file_<id>_xml Probing showed something simpler works: iProX publishes the ProteomeXchange XML for every public dataset at a deterministic, anonymous-accessible path on download.iprox.org. No CAS auth, no Aspera tokens, no fileId discovery needed: http://download.iprox.org/<accession>/PX_<accession>.xml The XML embeds Associated raw file URI / Search engine output file URI cvParams pointing at HTTPS file URLs on the same host (Accept-Ranges: bytes, so we get resume for free). The earlier 'iProX is auth-gated' finding was specifically about the iProX UI's CAS-protected JSON endpoints (PMD009Controller/findBySubProjectId.jsonp etc.); the public download server is anonymous. Changes ------- - IPROX_DOWNLOAD_BASE_URL, IPROX_PX_XML_URL_TEMPLATE, IPROX_PX_CATEGORY_MAP added (the latter is the same CV map JPOST PROXI uses). - _build_iprox_file_record() and _list_iprox_public_files() added, mirroring the JPOST helpers but using the PX XML schema. - is_direct_download_accession() now returns True for IPX accessions. - _raise_if_iprox() removed entirely. All entry points that previously called it no longer do, so IPX accessions flow through the unified direct-download dispatcher. - _list_direct_download_files() dispatches IPX -> _list_iprox_public_files. - _download_direct_download_records() now partitions URLs by scheme: ftp:// records go through download_ftp_urls (MassIVE/JPOST), http(s):// records through download_http_urls (iProX). A dataset whose records somehow contain both flows through both paths correctly. - download_http_urls() grew parallel_files + max_retries kwargs and a ThreadPoolExecutor path. The per-file worker _http_download_one() wraps _parallel_download() (reused from the globus codepath) with retry, so iProX gets the same HEAD-then-Range resume and restart-on- non-206 behaviour we already use for PRIDE HTTPS. Tests (26 pass) --------------- - test_iprox_guard.py renamed to test_iprox_files.py and rewritten as positive-path tests: regex coverage, IPX is a direct-download accession, PX XML parsing maps cvParam name -> PRIDE category, RAW filtering ignores non-FTP/HTTPS URIs, download_file_by_name routes iProX URLs to download_http_urls (not download_ftp_urls) with parallel_files=1. - All previously-added MassIVE / JPOST / size-validation tests still pass (20 + 6 iProX = 26 total). Live verification (this branch, against download.iprox.org) ----------------------------------------------------------- - IPX0017413000 listing: 7 files, correctly categorised (RAW + SEARCH). - download_file_by_name(IPX0017413000, protein_annotation_profile.xlsx): 3,253,230 B downloaded, MD5 c17baf230ffde1e2837ec4eb32dcea68, valid XLSX (PK header). - Range-based resume: pre-staged 1,000,000 B partial, completed to 3,253,230 B with the same MD5. - Parallel HTTPS: 3 worker downloads from IPX0017413000 (xlsx + 2 RAW) ran concurrently; cancelled mid-flight after observing all 3 files growing in parallel (xlsx finished at 3.25 MB, Tumor_NK1.raw was at 142 MB and Control_NK3.raw at 419 MB before cancellation).
…Registry Empty scaffolding for the per-provider refactor (spec: docs/specs/2026-05-27-files-py-provider-refactor-design.md). Introduces the Provider abstract base class, BaseDirectDownloadProvider with a shared download_files() that partitions URLs by scheme and routes through Files.download_ftp_urls / Files.download_http_urls (preserving test patches), and a Registry for accession -> provider resolution. No behaviour change. No providers registered yet.
…t.py Pulled download_ftp_urls, download_http_urls, and their helpers (_open_ftp_connection, _walk_ftp_tree, _list_ftp_repo_files, _download_one_ftp_path, _download_ftp_paths_serial/_parallel, _http_download_one, _parallel_download, _local_path_for_url) out of the Files class verbatim into providers/transport.py. Files keeps a shim staticmethod for each function that does a lazy import and delegates, so existing test patches like patch.object(Files, 'download_ftp_urls') keep intercepting calls. No behaviour change. Test suite green.
Moved Progress, _find_tsv_columns, _is_md5_checksum, read_checksum_file, compute_md5, validate_download, _remove_if_exists, _get_download_url, _resolve_local_path from Files into providers/util.py. Files keeps shim re-exports so existing references keep working. No behaviour change. Test suite green.
Moved MassIVE listing + record building from Files into providers/massive.py as MassiveProvider(BaseDirectDownloadProvider). Provider is registered with the Registry. Files keeps shim methods (is_massive_accession, _list_massive_public_files, _build_massive_file_record, _get_massive_public_root, _get_massive_public_ftp_url, _map_massive_collection_to_category) that delegate to the provider. MASSIVE_CATEGORY_MAP / MASSIVE_ARCHIVE_FTP constants remain on Files as class-attribute re-exports. All 10 MassIVE tests pass without modification. Full suite green.
Moved JPOST listing (PROXI JSON primary + FTP tree walk fallback) and record building from Files into providers/jpost.py as JpostProvider(BaseDirectDownloadProvider). Provider registered with the Registry. Files keeps shim methods (is_jpost_accession, _list_jpost_public_files, _list_jpost_public_files_via_proxi, _build_jpost_file_record, _get_jpost_public_root, _get_jpost_public_ftp_url) that delegate to the provider. JPOST_ARCHIVE_FTP / JPOST_PROXI_BASE_URL / JPOST_PROXI_CATEGORY_MAP constants remain on Files as class-attribute re-exports. All JPOST tests pass without modification. Full suite green.
Moved iProX listing (PX XML from download.iprox.org) and record building from Files into providers/iprox.py as IproxProvider(BaseDirectDownloadProvider). Provider registered with the Registry. Files keeps shim methods (is_iprox_accession, _list_iprox_public_files, _build_iprox_file_record, _get_iprox_public_root, _get_iprox_public_ftp_url) that delegate to the provider. IPROX_DOWNLOAD_BASE_URL / IPROX_PX_XML_URL_TEMPLATE / IPROX_PX_CATEGORY_MAP constants remain on Files as class-attribute re-exports. _download_direct_download_records on Files now dispatches via the registry instead of branching manually. _list_direct_download_files keeps shim dispatching so existing test patches on _list_massive_public_files and _list_jpost_public_files continue to intercept the calls. All iProX tests pass without modification. Full suite green.
…oning Mixed ftp:// + http(s):// records partition correctly: ftp URLs go to Files.download_ftp_urls with use_tls=True (the MassIVE setting), http URLs go to Files.download_http_urls. Both calls intercepted by patch.object(Files, ...) — proving the provider routes back through Files rather than calling transport directly (preserves test patches).
Moved PRIDE-specific logic (V3 API listing, multi-protocol batch downloader with aspera/s3/ftp/globus fallback, private-dataset path, submitter helpers, Globus/S3 per-protocol downloaders, legacy single-connection FTP) from Files into providers/pride.py as PrideProvider(Provider). ~510 LOC moved out of files.py. Files keeps shim methods for every patched helper (_batch_download_by_protocol, _download_with_fallback, _globus_download_one, download_files_from_globus, download_files_from_s3, download_files_from_ftp, download_private_file_name, get_ascp_binary, save_checksum_file, stream_all_files_by_project, stream_all_files_metadata, get_submitted_file_path_prefix, _protocol_sequence, download_files). PRIDE class constants (V3_API_BASE_URL, API_BASE_URL, API_PRIVATE_URL, PRIDE_ARCHIVE_FTP, *_URL_PREFIX, S3_URL, S3_BUCKET, PROTOCOL_ORDER) remain on Files as re-exports. PrideProvider's internal calls to patch-sensitive helpers go through Files.X (lazy import) so existing test patches in test_download_resilience.py keep working without modification. is_direct_download_accession updated to exclude PRIDE (returns True only for MSV/JPST/IPX) now that PrideProvider is registered too. All 8 patch-sensitive tests in test_download_resilience.py pass. Full suite green at 67 passed, 4 skipped.
Files public methods (get_all_raw_file_list, download_all_raw_files,
download_all_category_files, get_all_category_file_list,
download_file_by_name, get_file_from_api, download_files_by_list) now
dispatch via registry.resolve(accession).{list,download}_files(...).
Removed dead helpers: _list_direct_download_files,
_download_direct_download_records. Kept _repo_uses_tls as a registry
shim. Fixed _download_massive_file_records to use registry.
PrideProvider.download_files refactored: the old static method is now
_download_files_batch; a proper Provider-interface instance method
download_files(self, accession, records, ...) wraps it so PRIDE routes
uniformly through the registry like other providers.
Tests updated: patches on Files._list_massive_public_files,
Files._list_jpost_public_files, files_obj.stream_all_files_by_project,
and files_obj.download_files now target the provider classes
(MassiveProvider, JpostProvider, PrideProvider) directly.
Full suite green. files.py size: 1254 LOC (was 1352 before Task 9).
Verifies that Files().download_all_raw_files for a PXD accession flows through Registry.resolve -> PrideProvider.download_files -> _batch_download_by_protocol (patched via Files), and that _download_with_fallback is only called when batch returns failed files. Also mocks validate_download to return success so the happy-path test correctly asserts that fallback is not invoked when all files pass validation after the primary-protocol batch run. Spec acceptance criterion #5.
Empty scaffold for the follow-up refactor that extracts cross-cutting commands (download_files_by_url, download_files_by_list, download_px_raw_files) from Files into their own modules. No code moved yet. No behaviour change. Test suite green at 68 passed, 4 skipped.
…roteomexchange.py Moved download_px_raw_files, _normalize_px_xml_url, _parse_px_xml_for_raw_file_urls from Files into commands/proteomexchange.py. Files keeps shim re-exports. Also removed now-unused xml.etree.ElementTree import from files.py. No behaviour change. Test suite green.
Moved download_files_by_list from Files into commands/by_list.py. Files keeps a shim re-export. No behaviour change. Test suite green.
Moved download_files_by_url and its 6 helpers (_extract_pride_accession, _validate_urls_checksums, _http_download_url, _ftp_download_url, _dispatch_url_scheme, _download_single_url) from Files into commands/by_url.py. Files keeps shim re-exports for each. Internal calls to patch-sensitive helpers (_http_download_url, _ftp_download_url, _dispatch_url_scheme, _download_single_url) go through Files.X (lazy import) so existing test patches like patch.object(Files, '_http_download_url') keep intercepting. files.py drops below 1000 LOC. No behaviour change. Test suite green at 68 passed, 4 skipped.
…/ as a class
ProteomeXchange behaves more like a provider than a command: it takes a
PXD/PRD accession (or a ProteomeCentral URL) and returns file records,
the same shape as the four other providers. Moving it to providers/
aligns the architecture.
Changes:
- New: pridepy/providers/proteomexchange.py with
ProteomeXchangeProvider(Provider). It implements the full Provider
interface (matches, list_files, download_files) plus a convenience
method download_from_accession_or_url() for the
download-px-raw-files CLI command's existing behaviour
(skip_if_downloaded_already defaults to True, no parallel workers).
- Deleted: pridepy/commands/proteomexchange.py. Its three functions
(_normalize_px_xml_url, _parse_px_xml_for_raw_file_urls,
download_px_raw_files) are now static/instance methods on the
provider class.
- Updated: Files shims for _normalize_px_xml_url,
_parse_px_xml_for_raw_file_urls, and download_px_raw_files now
delegate to ProteomeXchangeProvider.
- Updated: commands/__init__.py docstring notes the move.
Important: ProteomeXchangeProvider is NOT auto-registered with
pridepy.providers.registry. PXD/PRD accessions continue to route through
PrideProvider's V3 API path by default. ProteomeXchangeProvider is the
explicit gateway for the cross-repository XML view, invoked via the
download-px-raw-files CLI command and Files.download_px_raw_files.
Full suite green at 68 passed, 4 skipped. No behaviour change for
existing PXD downloads; download-px-raw-files keeps its XML-based
listing flow exactly as before.
Two related fixes:
1. CI lint failure: files.py used importlib.resources, subprocess, and
time but never imported them. The Task 8 refactor removed these
top-level imports along with the PRIDE-specific code that used them
in PrideProvider, but two methods on Files (download_files_from_aspera
and the now-dead _download_range) still referenced the undefined
names. Flake8 with --select=F82 fails on 4 F821 undefined-name errors.
Fix: re-add the three stdlib imports at the top of files.py. Also
delete the orphaned _download_range method (no callers).
2. Hoist lazy provider/command imports out of Files method bodies.
The shim pattern previously did 'from pridepy.providers import X'
inside every method body — ~75 occurrences. Since providers do not
import Files at module load (only inside method bodies), the reverse
direction is safe to hoist: Files now imports
{registry, transport, util, IproxProvider, JpostProvider,
MassiveProvider, PrideProvider, ProteomeXchangeProvider, by_list,
by_url} at module top, and the shims reference these names directly.
Lazy imports inside provider/command method bodies that go back to
Files (e.g. BaseDirectDownloadProvider.download_files doing
'from pridepy.files.files import Files') are kept lazy — they are
genuinely cyclic and required for backward-compat test patching.
Also: commands/by_list.py's 'from pridepy.providers import registry'
hoisted to module top (no Files dependency, no cycle risk).
Note: 'import requests' is kept in files.py (noqa: F401) because
test suites patch 'pridepy.files.files.requests.get' directly.
Tests: 68 passed, 4 skipped. flake8 --select=E9,F63,F7,F82 now clean.
Migrated ~17 patch.object(Files, "X") test targets to canonical locations:
- PRIDE methods now patched on PrideProvider
- transport/util helpers patched on transport / util modules
- by_url helpers patched on commands.by_url
Moved PRIDE protocol workers permanently into PrideProvider:
- download_files_from_{aspera,globus,s3,ftp}
- _batch_download_by_protocol, _download_with_fallback, _globus_download_one
- _protocol_sequence, download_private_file_name, stream_all_files_by_project
- get_submitted_file_path_prefix, save_checksum_file, get_ascp_binary,
get_output_file_name
Eliminated every from pridepy.files.files import Files lazy import inside
providers/ and commands/. Providers call self.X / transport.X / util.X
directly. Zero back-into-Files coupling.
Files dropped from 912 to 382 LOC: only public CLI methods + class-attribute
constant re-exports. Kept ~5 one-line @staticmethod shims for likely
downstream imports (compute_md5, validate_download, read_checksum_file,
download_ftp_urls, download_http_urls) plus the accession-matcher helpers
(is_massive_accession etc. and is_direct_download_accession) as useful API.
No behaviour change. No test assertion changes. 68 passed, 4 skipped.
Flake8 (--select=E9,F63,F7,F82) clean.
Pure move: pridepy/providers/* -> pridepy/download/*, plus commands/by_url.py and commands/by_list.py -> download/. All imports updated from pridepy.providers / pridepy.commands to pridepy.download. download/__init__.py docstring rewritten to describe the download subsystem. No logic change. Test suite green (68 passed, 4 skipped).
The facade moved to pridepy/download/client.py and the class renamed Files -> Client. pridepy/files/files.py is now a back-compat shim (Files = Client, plus Progress re-export). CLI imports Client (aliased as Files internally to keep the diff minimal). Test imports keep working via the shim. No behaviour change; 68 passed, 4 skipped.
Align PrideProvider.download_files signature with the Provider base and add a download_by_name override implementing PRIDE's public/private split (V2 private API for private datasets, inherited public path otherwise). Moves this logic out of the Client facade.
Reduce the public listing/download methods to one-line registry.resolve(...).<workflow>(...) dispatches now that the Provider base owns the workflow. Drops the in-facade PRIDE public/private split and by_list delegation; removes newly-unused imports.
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Tier 1 + Tier 2 of the provider-specific helper audit:
by_url.py (Tier 1) -> PrideProvider:
- _extract_pride_accession -> PrideProvider.extract_accession_from_url
- _validate_urls_checksums -> PrideProvider.validate_urls_checksums
download_files_by_url now calls PrideProvider.validate_urls_checksums when
checksum_check is set. by_url keeps only generic URL-download mechanics.
util.py (Tier 2) -> PrideProvider:
- _get_download_url (the aspera/ftp/globus/s3 protocol resolution +
PRIDE archive HTTPS rewrite for globus) -> PrideProvider._get_download_url
- _resolve_local_path -> PrideProvider._resolve_local_path
The base Provider.get_download_url is now generic (returns the
'FTP Protocol' location value, which is what MassIVE/JPOST/iProX need);
PrideProvider.get_download_url overrides it with the multi-protocol
resolution. PRIDE's static download methods call
PrideProvider._get_download_url directly.
util.py now holds only genuinely generic helpers (Progress, compute_md5,
validate_download, read_checksum_file, _find_tsv_columns, _is_md5_checksum,
_remove_if_exists). Removed the lazy 'from pridepy.download.pride import
PrideProvider' imports that util needed for the moved functions.
Test: the one test referencing util._get_download_url now references
PrideProvider._get_download_url (PrideProvider already imported).
No behaviour change. 68 passed, 4 skipped. flake8 (E9,F63,F7,F82) clean.
Smoke MD5 unchanged.
The pridepy/files/files.py shim (Files = Client re-export) was load-bearing only for the test suite — the CLI already imports from pridepy.download.client. Updated the 9 test files to 'from pridepy.download.client import Client as Files' (keeps the local Files alias so test bodies are otherwise untouched) and deleted the pridepy/files/ package entirely. Breaking change for any downstream code doing 'from pridepy.files.files import Files' — use 'from pridepy.download.client import Client' instead. Pre-1.0 cleanup; the class was already renamed Files -> Client in this PR. Also updated two stale docstrings (transport.py, client.py) that referenced the old pridepy.files.files path. 68 passed, 4 skipped. flake8 (E9,F63,F7,F82) clean. Smoke MD5 unchanged.
Code reviewFound 1 issue:
Lines 254 to 272 in 121b06b - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
Addresses the code-review findings on PR #105 and reorganizes the README command docs into collapsible sections. - README: fix broken Python-API imports (pridepy.files.files -> download.client) - README: restructure CLI docs into "PRIDE File Downloads", "Metadata and Search", and "Download from ProteomeXchange and other repositories"; wrap command/option subsections in <details> for an expandable layout - remove dead by_list.py (logic lives in Provider.download_by_filenames) and drop it from the download package docstring - fix stale "Files" facade references in pride.py / proteomexchange.py / util.py docstrings (renamed to Client) - correct base.py / __init__.py adapter-contract docstrings (PrideProvider also overrides get_download_url, download_files, download_by_name) - remove unwired supports_checksum ClassVar from Provider
- README: -w/--parallel-files is not available on download-file-by-name; note this in the common-options table - README + iprox.py docstring: iProX serves over plain HTTP, not HTTPS (DOWNLOAD_BASE_URL uses http://); fix the wording - README: drop the over-broad "follow the collection layout" claim (files are written by basename) and scope the post-transfer size check to FTP - fix search default sort field: submission_date -> submissionDate so the default matches the allowed click.Choice values
Code reviewRe-reviewed at
No remaining issues in the changed code. Pre-existing (out of scope for this refactor, previously raised on #98/#100) — worth a dedicated follow-up in
pridepy/pridepy/download/transport.py Lines 274 to 280 in 3188719
pridepy/pridepy/download/transport.py Lines 377 to 414 in 3188719 - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
Addresses the two long-standing transport data-integrity issues flagged on #98/#100 for the direct-download providers (MassIVE/JPOST/iProX). - Thread each record's relativePath from Provider.download_files through to download_ftp_urls / download_http_urls, writing files to output_folder/<relativePath> instead of flattening to the URL basename. Identically-named files in different sub-collections no longer overwrite each other, and --skip-if-downloaded-already no longer skips the wrong file. Parent directories are created per file; a _safe_join guard prevents path traversal outside output_folder. - Add a post-transfer size check in _parallel_download (HTTP/globus) mirroring the FTP path: a stream shorter than Content-Length now raises so the retry loop re-downloads (Range-resuming when possible) instead of silently accepting a truncated file. - Remove the now-unused _local_path_for_url helper (superseded by _dest_path). - Tests: collision-free relative_paths threading (ftp+http), truncated-stream raise, _safe_join traversal guard; update massive/jpost exact-call asserts.
Some networks block FTP/FTPS outright. MassiveProvider.list_files now tries the FTPS tree walk first and, on failure, falls back to an all-HTTPS path: - list files from the GNPS2 dataset cache file index (datasetcache.gnps2.org, CSV stream) over HTTPS - build records whose download URL is the ProteoSAFe endpoint (massive.ucsd.edu/ProteoSAFe/DownloadResultFile?forceDownload=true&file=f.<acc>/<path>) Downloads then route over HTTPS automatically (scheme-based dispatch in Provider.download_files), with relativePath preserving the dataset layout. Verified live: FTPS and HTTPS copies of the same file are byte-identical (md5 match). Adds unit tests for the URL builder, the HTTPS record, and the FTPS->HTTPS listing fallback.
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. |
There was a problem hiding this comment.
Pull request overview
This PR refactors pridepy’s file download system into a modular provider-based architecture and adds native direct-download support for MassIVE, JPOST, and iProX accessions while bumping the package to 0.0.16.
Changes:
- Replaces the monolithic
Filesimplementation withpridepy.download.Client, provider adapters, shared transport, registry, and utility modules. - Adds direct-download listing/download paths for MassIVE, JPOST, iProX, and explicit ProteomeXchange XML downloads.
- Updates CLI/tests/docs for the new download architecture and adds Codacy tooling files.
Reviewed changes
Copilot reviewed 28 out of 29 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| README.md | Documents the new modular/direct-download workflows and updated Python API import. |
| pyproject.toml | Bumps package version to 0.0.16. |
| pridepy/pridepy.py | Switches CLI to Client as Files and updates search sort default/help text. |
| pridepy/files/files.py | Removes the old monolithic Files implementation. |
| pridepy/files/init.py | Leaves the old package shim empty. |
| pridepy/download/init.py | Introduces the new download package overview. |
| pridepy/download/base.py | Adds shared provider template-method workflow. |
| pridepy/download/client.py | Adds the public Client facade over provider dispatch. |
| pridepy/download/registry.py | Adds accession-to-provider registration/resolution. |
| pridepy/download/transport.py | Adds shared FTP/FTPS/HTTP transport, resume, parallelism, and relative-path handling. |
| pridepy/download/util.py | Adds checksum/progress/download validation utilities. |
| pridepy/download/by_url.py | Moves URL-list download command helpers into the download package. |
| pridepy/download/pride.py | Moves PRIDE-specific listing, private download, and multi-protocol fallback logic. |
| pridepy/download/massive.py | Adds MassIVE direct FTPS listing/download with HTTPS fallback. |
| pridepy/download/jpost.py | Adds JPOST PROXI listing with FTP fallback/download records. |
| pridepy/download/iprox.py | Adds iProX PX XML listing and HTTP download records. |
| pridepy/download/proteomexchange.py | Adds explicit ProteomeXchange XML raw-file download support. |
| pridepy/tests/test_authentication.py | Updates import to the new Client facade. |
| pridepy/tests/test_search.py | Updates import to the new Client facade. |
| pridepy/tests/test_raw_files.py | Updates import to the new Client facade. |
| pridepy/tests/test_massive_files.py | Updates MassIVE tests and adds HTTPS fallback/parallel/relative-path coverage. |
| pridepy/tests/test_jpost_files.py | Adds JPOST direct-download provider coverage. |
| pridepy/tests/test_iprox_files.py | Adds iProX direct-download provider coverage. |
| pridepy/tests/test_ftp_download_validation.py | Adds FTP size-mismatch retry validation coverage. |
| pridepy/tests/test_download_resilience.py | Updates patch targets and adds transport resilience/relative-path tests. |
| pridepy/tests/test_download_by_url.py | Updates URL-download tests to patch the new module location. |
| pridepy/tests/test_download_by_list.py | Updates list-download tests for provider-based dispatch. |
| .codacy/codacy.yaml | Adds Codacy runtime/tool configuration. |
| .codacy/cli.sh | Adds Codacy CLI bootstrap script. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Addresses two Codex review findings. High — direct-download commands reported success even when transfers failed: transport.download_ftp_urls / download_http_urls swallowed per-file exceptions (log + continue) and returned, so Provider.download_files could not detect failures. The FTP serial/parallel helpers now collect and return failed paths, the HTTP loop collects failed URLs, and both download_ftp_urls and download_http_urls raise RuntimeError when any file fails. This restores parity with the PRIDE path (which raises) and makes the earlier HTTP/FTP post-transfer size checks actually surface as command failures. Successful downloads return empty failure lists and do not raise (verified live for MassIVE FTPS and iProX HTTP). Medium — download-px-raw-files flattened every URI to its basename, so duplicate raw filenames in different directories overwrote each other. ProteomeXchangeProvider.list_files now derives a relativePath per file by stripping the common parent directory shared by all URIs (e.g. run1/sample.raw vs run2/sample.raw), with a basename fallback for a single file or no shared prefix. Adds tests: FTP/HTTP failure propagation, Provider.download_files propagation, and PX duplicate-basename disambiguation.
- by_url: _http_download_url / _ftp_download_url now verify the written size against Content-Length / ftp.size() and raise on truncation (validate_download only checked non-empty, so a partial file slipped through and could be wrongly skipped on the next run). _download_single_url removes a partial file when the transfer raises. Brings download-files-by-url to parity with the accession transport's post-transfer size check. - iprox: iProX serves over plain HTTP, not HTTPS — fix the _build_file_record parameter name (https_url -> file_url), the location-label comment, and the "no downloadable URIs" error message. - base: clarify get_download_url docstring re: records (MassIVE HTTPS fallback) that use a location name other than "FTP Protocol" and fall through to the first location. - pride: docstrings list s3 as a supported protocol and drop the nonexistent "Phase 3" wording (the per-file fallback is part of Phase 2). - test: by_url HTTP truncation now raises.
…aths Follow-up review of the failure-propagation/size-verification work. - Size verification (by_url._http_download_url and transport._parallel_download) skipped the check when the response carries Content-Encoding: requests decompresses gzip/deflate transparently, so on-disk size != Content-Length (compressed) and the check would have falsely raised "Incomplete download" and deleted an intact file. Also compute the size once for the message. - ProteomeXchange._relative_paths_for_urls: when raw URIs span different top-level directories the common prefix is "/", which the previous guard rejected and collapsed to a colliding basename. Strip any non-empty common prefix (including "/") so run1/x.raw and run2/x.raw stay distinct. - Docs: document that download_ftp_urls/download_http_urls raise on failure; README's download-files-by-url -w applies to any scheme (not globus-only); iProX accepts http(s); base warning says "ftp / http(s) only". - Tests: gzip response is not treated as truncated; PX root-prefix paths disambiguate.
Review follow-ups (no functional change):
- download-files-by-url --help for -w said "Primarily used by globus
protocol", but by_url parallelizes for any scheme; align with the README.
- the three _parallel_download success-path tests left response.headers as a
bare Mock, so the new Content-Encoding guard saw a truthy value and silently
skipped the size check; set headers={} so the check is actually exercised.
Summary
Consolidates the modular-download refactor (previously split across #103, #104, #105) into a single PR targeting
master. Bumps0.0.15→0.0.16.Reorganizes the download stack into a per-provider, object-oriented architecture and adds direct downloads for non-ProteomeXchange datasets.
Architecture
pridepy/download/package (renamed fromproviders/, folds in the oldcommands/). A Template-MethodProviderbase owns the download workflow; adapters implementmatches+list_files. PRIDE additionally overridesget_download_url,download_files, anddownload_by_name.files.py/Files→download/client.py/Client(facade dispatches via a registry). The oldpridepy/files/shim is removed (breaking import change — see below).Direct downloads (datasets without a ProteomeXchange accession)
MSV…): FTPS atmassive-ftp.ucsd.edu, with an automatic HTTPS fallback for FTP/FTPS-blocked networks (GNPS2 file index for listing + ProteoSAFe for downloads; verified byte-identical to FTPS).JPST…): PROXI JSON listing +ftp.jpostdb.org.IPX…): ProteomeXchange XML + anonymous HTTP atdownload.iprox.org.-w), and post-transfer size verification.Data-integrity fixes (transport)
relativePath) instead of flattening to the basename — identically-named files in different collections no longer collide (with a path-traversal guard).Content-Lengthand retry on truncation, matching the FTP path.Docs
Breaking change
from pridepy.files.files import Filesis removed. Usefrom pridepy.download.client import Client(orClient as Files).Test plan
pytest— 75 passed / 4 skippedE9,F63,F7,F82) cleanSupersedes #103 and #104.