feat: modular download architecture + direct downloads (MassIVE/JPOST/iProX) (0.0.16) by ypriverol · Pull Request #105 · PRIDE-Archive/pridepy

ypriverol · 2026-05-28T06:14:49Z

Summary

Consolidates the modular-download refactor (previously split across #103, #104, #105) into a single PR targeting master. Bumps 0.0.15 → 0.0.16.

Reorganizes the download stack into a per-provider, object-oriented architecture and adds direct downloads for non-ProteomeXchange datasets.

Architecture

New pridepy/download/ package (renamed from providers/, folds in the old commands/). A Template-Method Provider base owns the download workflow; adapters implement matches + list_files. PRIDE additionally overrides get_download_url, download_files, and download_by_name.
files.py/Files → download/client.py/Client (facade dispatches via a registry). The old pridepy/files/ shim is removed (breaking import change — see below).

Direct downloads (datasets without a ProteomeXchange accession)

MassIVE (MSV…): FTPS at massive-ftp.ucsd.edu, with an automatic HTTPS fallback for FTP/FTPS-blocked networks (GNPS2 file index for listing + ProteoSAFe for downloads; verified byte-identical to FTPS).
JPOST (JPST…): PROXI JSON listing + ftp.jpostdb.org.
iProX (IPX…): ProteomeXchange XML + anonymous HTTP at download.iprox.org.
Parity with PRIDE downloads: resume, per-file retries, parallel workers (-w), and post-transfer size verification.

Data-integrity fixes (transport)

Preserve each dataset's sub-directory layout (relativePath) instead of flattening to the basename — identically-named files in different collections no longer collide (with a path-traversal guard).
HTTP/globus downloads now verify transfer size against Content-Length and retry on truncation, matching the FTP path.

Docs

README restructured into collapsible sections: PRIDE File Downloads, Metadata and Search, Download from ProteomeXchange and other repositories.

Breaking change

from pridepy.files.files import Files is removed. Use from pridepy.download.client import Client (or Client as Files).

Test plan

pytest — 75 passed / 4 skipped
flake8 (E9,F63,F7,F82) clean
Live smoke: MassIVE FTPS vs HTTPS byte-identical (md5 match); JPOST / iProX single-file downloads

Supersedes #103 and #104.

Remove conda recipe packaging workflow

Dev

Improving parallelism

Minor changes in parallelization

Minor change to have the version of the tool.

Improve MassIVE download.

Extends the direct-download support introduced for MassIVE in PR #98 to two more proteomics repositories whose datasets are often standalone (no ProteomeXchange accession): - JPOST (Japan ProteOme STandard Repository): JPST\d{6} accessions, listed and downloaded from ftp.jpostdb.org. - iProX (Integrated Proteome resources): IPX\d{7,10} accessions, listed and downloaded from ftp.iprox.cn. Refactor: - Add is_direct_download_accession() unifying the MSV/JPST/IPX checks, plus _list_direct_download_files() and _download_direct_download_records() dispatchers. All call sites (get_all_raw_file_list, download_all_raw_files, download_all_category_files, get_file_from_api, download_file_by_name, download_files_by_list, get_all_category_file_list) now go through the unified entry points. - Extract _list_ftp_repo_files() helper so the FTP connection lifecycle (connect / login / passive / walk / quit) lives in one place. As part of that, fix the FTP-constructor-outside-try issue flagged in the PR #98 review: a connect failure no longer triggers NameError in finally. - Keep is_massive_accession, _list_massive_public_files, and _download_massive_file_records as thin backward-compatible wrappers so existing tests and external callers continue to work. Tests: add test_jpost_files.py and test_iprox_files.py mirroring the MassIVE coverage (regex match, record building, raw-only filtering, and the download_file_by_name happy path). All 19 direct-download tests pass. Version: 0.0.16.

Address feedback that direct downloads should match the PRIDE feature set (resume, parallel, retries) and that MassIVE's actual FTP server requires TLS. Live findings: - massive-ftp.ucsd.edu now rejects plain anonymous FTP with 421 TLS is required. The merged PR #98 code was effectively broken against the live server. Switch to FTP_TLS + PROT P. - ftp.jpostdb.org accepts plain anonymous FTP; keep as-is. - ftp.iprox.cn does not resolve (DNS fail) and no other iProX FTP host responds. iProX is HTTPS-only and needs a different transport (REST API). Defer iProX support; user to provide the endpoint. Implementation: - New static _open_ftp_connection(host, use_tls) opens FTP or FTP_TLS with the right TLS setup, and transparently falls back to FTPS if a plain FTP server replies 'TLS is required'. - _list_ftp_repo_files() and download_ftp_urls() both grow a use_tls flag. _repo_uses_tls(accession) wires this from the repo type (MassIVE = True, JPOST = False). - download_ftp_urls() grows parallel_files: when >1, a ThreadPoolExecutor runs that many FTP workers per host, each with its own connection. Existing serial single-connection-per-host path is preserved for parallel_files <= 1. - Extracted _download_one_ftp_path() with REST-based resume + per-file retries; _download_ftp_paths_serial() / _download_ftp_paths_parallel() pick the right scheduling. REST resume verified live (3 KB pre-stage -> 10 KB final, MD5 matches full file). - _download_direct_download_records() now accepts parallel_files and forwards it (along with use_tls derived from the accession) to download_ftp_urls. All call sites (download_all_raw_files, download_all_category_files, download_files_by_list) thread the user-supplied -w/--parallel-files through. Tests: - test_jpost_files / test_massive_files updated to assert the new kwargs (use_tls, parallel_files). - New test_repo_uses_tls_true_for_massive_false_for_jpost and test_download_all_raw_files_threads_parallel_files_for_massive. - test_iprox_files.py removed (iProX is deferred). - All 15 unit tests pass. Live testing: - MassIVE: listed MSV000080175 (44 files), single-file download of params.xml (10315 B, MD5 43d87368d705c3f380c1d030b14850c4), REST resume from a 3000 B partial, and 3-worker parallel download of files from MSV000080175 + MSV000078335 all succeeded. - JPOST: rate-limited from this IP ('421 too many connections'). Code path is structurally identical to MassIVE (same FTP helper), routing covered by unit tests; deferred to a follow-up live check.

Address deferred items from PR review: 1. JPOST PROXI listing ftp.jpostdb.org rate-limits aggressively per source IP (sticky 421 on any retry within ~10 min). For listing this is fatal because every pridepy invocation needs a fresh tree walk. JPOST publishes a JSON PROXI endpoint at https://repository.jpostdb.org/proxi/datasets/<JPSTxxxxxx> that returns datasetFiles[*].value as ftp:// URIs alongside CV labels (Associated raw file URI, Search engine output file URI, Result file URI, Peak list file URI, ...). _list_jpost_public_files now hits PROXI first, builds file records with categories derived from the CV name (mapped via JPOST_PROXI_CATEGORY_MAP), and falls back to the FTP tree walk only if PROXI is unreachable or returns no records. Live-tested against JPST002311 -> 160 files (88 RAW, 72 SEARCH). 2. Post-transfer size check Neither MassIVE nor JPOST publishes per-file checksum manifests in a standard location, so md5 verification isn't an option for these datasets. As a lighter-weight integrity signal, _download_one_ftp_path now compares the local file size against ftp.size() after retrbinary returns and treats a mismatch as a retryable failure (next attempt resumes via REST from the current partial). This catches half-finished transfers where the data channel was closed early without retrbinary raising. 3. iProX accession guard Probing showed iProX REST endpoints (PMD009Controller/findByProjectId.jsonp, findFilesBySubProjectID.jsonp) all redirect unauthenticated callers to a CAS login page, and downloads use faspe:// URLs with per-session tokens. Native support is therefore blocked until iProX exposes an anonymous JSON API or pridepy carries iProX credentials. Until then, add is_iprox_accession() and _raise_if_iprox() so every public entry point (get_all_raw_file_list, download_all_raw_files, download_all_category_files, download_file_by_name, download_files_by_list, get_file_from_api) emits a clear NotImplementedError instead of silently falling through to the PRIDE API and 404-ing. Tests - test_jpost_files: PROXI listing maps CV names to PRIDE categories; PROXI failure falls back to FTP walk. - test_iprox_guard: regex coverage; assert each entry point raises NotImplementedError for IPX accessions; assert is_direct_download_accession returns False for IPX. - test_ftp_download_validation: size mismatch retries until success; repeated mismatch raises after max_download_retries; correct size skips retries. - 25 tests total (10 MassIVE + 8 JPOST + 5 iProX guard + 3 size check). Live verification (same MassIVE FTPS path as before) - MSV000080175 listing + params.xml download still produces 10315 B, MD5 43d87368d705c3f380c1d030b14850c4. - JPST002311 PROXI listing returns 160 files with correct categories. Actual file transfer is still blocked from this IP by JPOST's 421-too-many-connections rate limit; the code path is shared with MassIVE (same _download_one_ftp_path / download_ftp_urls), so a fresh IP should succeed.

….org User pointed out the iProX dataset XML endpoint: https://www.iprox.cn/FAF016Controller/readXml.jsonp?fileId=file_<id>_xml Probing showed something simpler works: iProX publishes the ProteomeXchange XML for every public dataset at a deterministic, anonymous-accessible path on download.iprox.org. No CAS auth, no Aspera tokens, no fileId discovery needed: http://download.iprox.org/<accession>/PX_<accession>.xml The XML embeds Associated raw file URI / Search engine output file URI cvParams pointing at HTTPS file URLs on the same host (Accept-Ranges: bytes, so we get resume for free). The earlier 'iProX is auth-gated' finding was specifically about the iProX UI's CAS-protected JSON endpoints (PMD009Controller/findBySubProjectId.jsonp etc.); the public download server is anonymous. Changes ------- - IPROX_DOWNLOAD_BASE_URL, IPROX_PX_XML_URL_TEMPLATE, IPROX_PX_CATEGORY_MAP added (the latter is the same CV map JPOST PROXI uses). - _build_iprox_file_record() and _list_iprox_public_files() added, mirroring the JPOST helpers but using the PX XML schema. - is_direct_download_accession() now returns True for IPX accessions. - _raise_if_iprox() removed entirely. All entry points that previously called it no longer do, so IPX accessions flow through the unified direct-download dispatcher. - _list_direct_download_files() dispatches IPX -> _list_iprox_public_files. - _download_direct_download_records() now partitions URLs by scheme: ftp:// records go through download_ftp_urls (MassIVE/JPOST), http(s):// records through download_http_urls (iProX). A dataset whose records somehow contain both flows through both paths correctly. - download_http_urls() grew parallel_files + max_retries kwargs and a ThreadPoolExecutor path. The per-file worker _http_download_one() wraps _parallel_download() (reused from the globus codepath) with retry, so iProX gets the same HEAD-then-Range resume and restart-on- non-206 behaviour we already use for PRIDE HTTPS. Tests (26 pass) --------------- - test_iprox_guard.py renamed to test_iprox_files.py and rewritten as positive-path tests: regex coverage, IPX is a direct-download accession, PX XML parsing maps cvParam name -> PRIDE category, RAW filtering ignores non-FTP/HTTPS URIs, download_file_by_name routes iProX URLs to download_http_urls (not download_ftp_urls) with parallel_files=1. - All previously-added MassIVE / JPOST / size-validation tests still pass (20 + 6 iProX = 26 total). Live verification (this branch, against download.iprox.org) ----------------------------------------------------------- - IPX0017413000 listing: 7 files, correctly categorised (RAW + SEARCH). - download_file_by_name(IPX0017413000, protein_annotation_profile.xlsx): 3,253,230 B downloaded, MD5 c17baf230ffde1e2837ec4eb32dcea68, valid XLSX (PK header). - Range-based resume: pre-staged 1,000,000 B partial, completed to 3,253,230 B with the same MD5. - Parallel HTTPS: 3 worker downloads from IPX0017413000 (xlsx + 2 RAW) ran concurrently; cancelled mid-flight after observing all 3 files growing in parallel (xlsx finished at 3.25 MB, Tumor_NK1.raw was at 142 MB and Control_NK3.raw at 419 MB before cancellation).

…Registry Empty scaffolding for the per-provider refactor (spec: docs/specs/2026-05-27-files-py-provider-refactor-design.md). Introduces the Provider abstract base class, BaseDirectDownloadProvider with a shared download_files() that partitions URLs by scheme and routes through Files.download_ftp_urls / Files.download_http_urls (preserving test patches), and a Registry for accession -> provider resolution. No behaviour change. No providers registered yet.

…t.py Pulled download_ftp_urls, download_http_urls, and their helpers (_open_ftp_connection, _walk_ftp_tree, _list_ftp_repo_files, _download_one_ftp_path, _download_ftp_paths_serial/_parallel, _http_download_one, _parallel_download, _local_path_for_url) out of the Files class verbatim into providers/transport.py. Files keeps a shim staticmethod for each function that does a lazy import and delegates, so existing test patches like patch.object(Files, 'download_ftp_urls') keep intercepting calls. No behaviour change. Test suite green.

Moved Progress, _find_tsv_columns, _is_md5_checksum, read_checksum_file, compute_md5, validate_download, _remove_if_exists, _get_download_url, _resolve_local_path from Files into providers/util.py. Files keeps shim re-exports so existing references keep working. No behaviour change. Test suite green.

Moved MassIVE listing + record building from Files into providers/massive.py as MassiveProvider(BaseDirectDownloadProvider). Provider is registered with the Registry. Files keeps shim methods (is_massive_accession, _list_massive_public_files, _build_massive_file_record, _get_massive_public_root, _get_massive_public_ftp_url, _map_massive_collection_to_category) that delegate to the provider. MASSIVE_CATEGORY_MAP / MASSIVE_ARCHIVE_FTP constants remain on Files as class-attribute re-exports. All 10 MassIVE tests pass without modification. Full suite green.

Moved JPOST listing (PROXI JSON primary + FTP tree walk fallback) and record building from Files into providers/jpost.py as JpostProvider(BaseDirectDownloadProvider). Provider registered with the Registry. Files keeps shim methods (is_jpost_accession, _list_jpost_public_files, _list_jpost_public_files_via_proxi, _build_jpost_file_record, _get_jpost_public_root, _get_jpost_public_ftp_url) that delegate to the provider. JPOST_ARCHIVE_FTP / JPOST_PROXI_BASE_URL / JPOST_PROXI_CATEGORY_MAP constants remain on Files as class-attribute re-exports. All JPOST tests pass without modification. Full suite green.

Moved iProX listing (PX XML from download.iprox.org) and record building from Files into providers/iprox.py as IproxProvider(BaseDirectDownloadProvider). Provider registered with the Registry. Files keeps shim methods (is_iprox_accession, _list_iprox_public_files, _build_iprox_file_record, _get_iprox_public_root, _get_iprox_public_ftp_url) that delegate to the provider. IPROX_DOWNLOAD_BASE_URL / IPROX_PX_XML_URL_TEMPLATE / IPROX_PX_CATEGORY_MAP constants remain on Files as class-attribute re-exports. _download_direct_download_records on Files now dispatches via the registry instead of branching manually. _list_direct_download_files keeps shim dispatching so existing test patches on _list_massive_public_files and _list_jpost_public_files continue to intercept the calls. All iProX tests pass without modification. Full suite green.

…oning Mixed ftp:// + http(s):// records partition correctly: ftp URLs go to Files.download_ftp_urls with use_tls=True (the MassIVE setting), http URLs go to Files.download_http_urls. Both calls intercepted by patch.object(Files, ...) — proving the provider routes back through Files rather than calling transport directly (preserves test patches).

Moved PRIDE-specific logic (V3 API listing, multi-protocol batch downloader with aspera/s3/ftp/globus fallback, private-dataset path, submitter helpers, Globus/S3 per-protocol downloaders, legacy single-connection FTP) from Files into providers/pride.py as PrideProvider(Provider). ~510 LOC moved out of files.py. Files keeps shim methods for every patched helper (_batch_download_by_protocol, _download_with_fallback, _globus_download_one, download_files_from_globus, download_files_from_s3, download_files_from_ftp, download_private_file_name, get_ascp_binary, save_checksum_file, stream_all_files_by_project, stream_all_files_metadata, get_submitted_file_path_prefix, _protocol_sequence, download_files). PRIDE class constants (V3_API_BASE_URL, API_BASE_URL, API_PRIVATE_URL, PRIDE_ARCHIVE_FTP, *_URL_PREFIX, S3_URL, S3_BUCKET, PROTOCOL_ORDER) remain on Files as re-exports. PrideProvider's internal calls to patch-sensitive helpers go through Files.X (lazy import) so existing test patches in test_download_resilience.py keep working without modification. is_direct_download_accession updated to exclude PRIDE (returns True only for MSV/JPST/IPX) now that PrideProvider is registered too. All 8 patch-sensitive tests in test_download_resilience.py pass. Full suite green at 67 passed, 4 skipped.

Files public methods (get_all_raw_file_list, download_all_raw_files, download_all_category_files, get_all_category_file_list, download_file_by_name, get_file_from_api, download_files_by_list) now dispatch via registry.resolve(accession).{list,download}_files(...). Removed dead helpers: _list_direct_download_files, _download_direct_download_records. Kept _repo_uses_tls as a registry shim. Fixed _download_massive_file_records to use registry. PrideProvider.download_files refactored: the old static method is now _download_files_batch; a proper Provider-interface instance method download_files(self, accession, records, ...) wraps it so PRIDE routes uniformly through the registry like other providers. Tests updated: patches on Files._list_massive_public_files, Files._list_jpost_public_files, files_obj.stream_all_files_by_project, and files_obj.download_files now target the provider classes (MassiveProvider, JpostProvider, PrideProvider) directly. Full suite green. files.py size: 1254 LOC (was 1352 before Task 9).

Verifies that Files().download_all_raw_files for a PXD accession flows through Registry.resolve -> PrideProvider.download_files -> _batch_download_by_protocol (patched via Files), and that _download_with_fallback is only called when batch returns failed files. Also mocks validate_download to return success so the happy-path test correctly asserts that fallback is not invoked when all files pass validation after the primary-protocol batch run. Spec acceptance criterion #5.

Empty scaffold for the follow-up refactor that extracts cross-cutting commands (download_files_by_url, download_files_by_list, download_px_raw_files) from Files into their own modules. No code moved yet. No behaviour change. Test suite green at 68 passed, 4 skipped.

…roteomexchange.py Moved download_px_raw_files, _normalize_px_xml_url, _parse_px_xml_for_raw_file_urls from Files into commands/proteomexchange.py. Files keeps shim re-exports. Also removed now-unused xml.etree.ElementTree import from files.py. No behaviour change. Test suite green.

Moved download_files_by_list from Files into commands/by_list.py. Files keeps a shim re-export. No behaviour change. Test suite green.

Moved download_files_by_url and its 6 helpers (_extract_pride_accession, _validate_urls_checksums, _http_download_url, _ftp_download_url, _dispatch_url_scheme, _download_single_url) from Files into commands/by_url.py. Files keeps shim re-exports for each. Internal calls to patch-sensitive helpers (_http_download_url, _ftp_download_url, _dispatch_url_scheme, _download_single_url) go through Files.X (lazy import) so existing test patches like patch.object(Files, '_http_download_url') keep intercepting. files.py drops below 1000 LOC. No behaviour change. Test suite green at 68 passed, 4 skipped.

…/ as a class ProteomeXchange behaves more like a provider than a command: it takes a PXD/PRD accession (or a ProteomeCentral URL) and returns file records, the same shape as the four other providers. Moving it to providers/ aligns the architecture. Changes: - New: pridepy/providers/proteomexchange.py with ProteomeXchangeProvider(Provider). It implements the full Provider interface (matches, list_files, download_files) plus a convenience method download_from_accession_or_url() for the download-px-raw-files CLI command's existing behaviour (skip_if_downloaded_already defaults to True, no parallel workers). - Deleted: pridepy/commands/proteomexchange.py. Its three functions (_normalize_px_xml_url, _parse_px_xml_for_raw_file_urls, download_px_raw_files) are now static/instance methods on the provider class. - Updated: Files shims for _normalize_px_xml_url, _parse_px_xml_for_raw_file_urls, and download_px_raw_files now delegate to ProteomeXchangeProvider. - Updated: commands/__init__.py docstring notes the move. Important: ProteomeXchangeProvider is NOT auto-registered with pridepy.providers.registry. PXD/PRD accessions continue to route through PrideProvider's V3 API path by default. ProteomeXchangeProvider is the explicit gateway for the cross-repository XML view, invoked via the download-px-raw-files CLI command and Files.download_px_raw_files. Full suite green at 68 passed, 4 skipped. No behaviour change for existing PXD downloads; download-px-raw-files keeps its XML-based listing flow exactly as before.

Two related fixes: 1. CI lint failure: files.py used importlib.resources, subprocess, and time but never imported them. The Task 8 refactor removed these top-level imports along with the PRIDE-specific code that used them in PrideProvider, but two methods on Files (download_files_from_aspera and the now-dead _download_range) still referenced the undefined names. Flake8 with --select=F82 fails on 4 F821 undefined-name errors. Fix: re-add the three stdlib imports at the top of files.py. Also delete the orphaned _download_range method (no callers). 2. Hoist lazy provider/command imports out of Files method bodies. The shim pattern previously did 'from pridepy.providers import X' inside every method body — ~75 occurrences. Since providers do not import Files at module load (only inside method bodies), the reverse direction is safe to hoist: Files now imports {registry, transport, util, IproxProvider, JpostProvider, MassiveProvider, PrideProvider, ProteomeXchangeProvider, by_list, by_url} at module top, and the shims reference these names directly. Lazy imports inside provider/command method bodies that go back to Files (e.g. BaseDirectDownloadProvider.download_files doing 'from pridepy.files.files import Files') are kept lazy — they are genuinely cyclic and required for backward-compat test patching. Also: commands/by_list.py's 'from pridepy.providers import registry' hoisted to module top (no Files dependency, no cycle risk). Note: 'import requests' is kept in files.py (noqa: F401) because test suites patch 'pridepy.files.files.requests.get' directly. Tests: 68 passed, 4 skipped. flake8 --select=E9,F63,F7,F82 now clean.

@staticmethod

Migrated ~17 patch.object(Files, "X") test targets to canonical locations: - PRIDE methods now patched on PrideProvider - transport/util helpers patched on transport / util modules - by_url helpers patched on commands.by_url Moved PRIDE protocol workers permanently into PrideProvider: - download_files_from_{aspera,globus,s3,ftp} - _batch_download_by_protocol, _download_with_fallback, _globus_download_one - _protocol_sequence, download_private_file_name, stream_all_files_by_project - get_submitted_file_path_prefix, save_checksum_file, get_ascp_binary, get_output_file_name Eliminated every from pridepy.files.files import Files lazy import inside providers/ and commands/. Providers call self.X / transport.X / util.X directly. Zero back-into-Files coupling. Files dropped from 912 to 382 LOC: only public CLI methods + class-attribute constant re-exports. Kept ~5 one-line @staticmethod shims for likely downstream imports (compute_md5, validate_download, read_checksum_file, download_ftp_urls, download_http_urls) plus the accession-matcher helpers (is_massive_accession etc. and is_direct_download_accession) as useful API. No behaviour change. No test assertion changes. 68 passed, 4 skipped. Flake8 (--select=E9,F63,F7,F82) clean.

Pure move: pridepy/providers/* -> pridepy/download/*, plus commands/by_url.py and commands/by_list.py -> download/. All imports updated from pridepy.providers / pridepy.commands to pridepy.download. download/__init__.py docstring rewritten to describe the download subsystem. No logic change. Test suite green (68 passed, 4 skipped).

The facade moved to pridepy/download/client.py and the class renamed Files -> Client. pridepy/files/files.py is now a back-compat shim (Files = Client, plus Progress re-export). CLI imports Client (aliased as Files internally to keep the diff minimal). Test imports keep working via the shim. No behaviour change; 68 passed, 4 skipped.

Align PrideProvider.download_files signature with the Provider base and add a download_by_name override implementing PRIDE's public/private split (V2 private API for private datasets, inherited public path otherwise). Moves this logic out of the Client facade.

Reduce the public listing/download methods to one-line registry.resolve(...).<workflow>(...) dispatches now that the Provider base owns the workflow. Drops the in-facade PRIDE public/private split and by_list delegation; removes newly-unused imports.

qodo-code-review · 2026-05-28T06:14:54Z

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

coderabbitai · 2026-05-28T06:14:57Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0a1cf22e-ffb6-4791-baa8-16df289f86b9

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch refactor/provider-template-method

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tier 1 + Tier 2 of the provider-specific helper audit: by_url.py (Tier 1) -> PrideProvider: - _extract_pride_accession -> PrideProvider.extract_accession_from_url - _validate_urls_checksums -> PrideProvider.validate_urls_checksums download_files_by_url now calls PrideProvider.validate_urls_checksums when checksum_check is set. by_url keeps only generic URL-download mechanics. util.py (Tier 2) -> PrideProvider: - _get_download_url (the aspera/ftp/globus/s3 protocol resolution + PRIDE archive HTTPS rewrite for globus) -> PrideProvider._get_download_url - _resolve_local_path -> PrideProvider._resolve_local_path The base Provider.get_download_url is now generic (returns the 'FTP Protocol' location value, which is what MassIVE/JPOST/iProX need); PrideProvider.get_download_url overrides it with the multi-protocol resolution. PRIDE's static download methods call PrideProvider._get_download_url directly. util.py now holds only genuinely generic helpers (Progress, compute_md5, validate_download, read_checksum_file, _find_tsv_columns, _is_md5_checksum, _remove_if_exists). Removed the lazy 'from pridepy.download.pride import PrideProvider' imports that util needed for the moved functions. Test: the one test referencing util._get_download_url now references PrideProvider._get_download_url (PrideProvider already imported). No behaviour change. 68 passed, 4 skipped. flake8 (E9,F63,F7,F82) clean. Smoke MD5 unchanged.

The pridepy/files/files.py shim (Files = Client re-export) was load-bearing only for the test suite — the CLI already imports from pridepy.download.client. Updated the 9 test files to 'from pridepy.download.client import Client as Files' (keeps the local Files alias so test bodies are otherwise untouched) and deleted the pridepy/files/ package entirely. Breaking change for any downstream code doing 'from pridepy.files.files import Files' — use 'from pridepy.download.client import Client' instead. Pre-1.0 cleanup; the class was already renamed Files -> Client in this PR. Also updated two stale docstrings (transport.py, client.py) that referenced the old pridepy.files.files path. 68 passed, 4 skipped. flake8 (E9,F63,F7,F82) clean. Smoke MD5 unchanged.

ypriverol · 2026-05-28T07:13:54Z

Code review

Found 1 issue:

The README Python API examples import from pridepy.files.files import Files, but this PR deletes the pridepy/files/ package (the Files = Client back-compat shim, removed in 121b06b). Both examples now raise ModuleNotFoundError for anyone following the docs. Update them to from pridepy.download.client import Client (or from pridepy.download.client import Client as Files to keep the rest of the snippet unchanged).

pridepy/README.md

Lines 254 to 272 in 121b06b

    
           ```python 
        
           from pridepy.files.files import Files 
        
           files = Files() 
        
           raw_files = files.get_all_raw_file_list("PXD008644") 
        
           print(f"RAW files: {len(raw_files)}") 
        
           print(raw_files[0]["fileName"]) 
        
           ``` 
        
           For MassIVE / JPOST / iProX accessions, the same method returns the files found under the dataset's `raw/` collection: 
        
           ```python 
        
           from pridepy.files.files import Files 
        
           files = Files() 
        
           for accession in ("MSV000082297", "JPST002311", "IPX0017413000"): 
        
               raw_files = files.get_all_raw_file_list(accession) 
        
               print(f"{accession} raw files: {len(raw_files)}") 
        
           ```

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

Addresses the code-review findings on PR #105 and reorganizes the README command docs into collapsible sections. - README: fix broken Python-API imports (pridepy.files.files -> download.client) - README: restructure CLI docs into "PRIDE File Downloads", "Metadata and Search", and "Download from ProteomeXchange and other repositories"; wrap command/option subsections in <details> for an expandable layout - remove dead by_list.py (logic lives in Provider.download_by_filenames) and drop it from the download package docstring - fix stale "Files" facade references in pride.py / proteomexchange.py / util.py docstrings (renamed to Client) - correct base.py / __init__.py adapter-contract docstrings (PrideProvider also overrides get_download_url, download_files, download_by_name) - remove unwired supports_checksum ClassVar from Provider

- README: -w/--parallel-files is not available on download-file-by-name; note this in the common-options table - README + iprox.py docstring: iProX serves over plain HTTP, not HTTPS (DOWNLOAD_BASE_URL uses http://); fix the wording - README: drop the over-broad "follow the collection layout" claim (files are written by basename) and scope the post-transfer size check to FTP - fix search default sort field: submission_date -> submissionDate so the default matches the allowed click.Choice values

ypriverol · 2026-05-28T08:26:50Z

Code review

Re-reviewed at 3188719. The earlier finding (README imports) and the accuracy issues introduced during the README restructure are all resolved in the latest commits:

-w/--parallel-files is no longer shown as a shared option for download-file-by-name (the command has no such flag).
iProX is now described as plain HTTP (the DOWNLOAD_BASE_URL uses http://), in both the README and the iprox.py docstring.
Dropped the over-broad "follows the collection layout" claim (files are written by basename) and scoped the post-transfer size check to FTP.
Search default sort field fixed: submission_date → submissionDate so the default matches the allowed click.Choice values.

No remaining issues in the changed code.

Pre-existing (out of scope for this refactor, previously raised on #98/#100) — worth a dedicated follow-up in download/transport.py:

FTP downloads flatten the tree to the basename, so identically-named files in different sub-collections overwrite each other and --skip-if-downloaded-already can skip the wrong file. The relativePath on each record isn't threaded through to the transport layer.

pridepy/pridepy/download/transport.py

Lines 274 to 280 in 3188719

    
               Download paths concurrently using ``parallel_files`` workers; each 
        
               worker opens its own FTP connection so transfers don't serialize. 
        
               """ 
        
               def worker(ftp_path: str, position: int) -> None: 
        
                   local_path = os.path.join(output_folder, os.path.basename(ftp_path)) 
        
                   if skip_if_downloaded_already and os.path.exists(local_path): 
        
                       logging.info(f"Skipping download as file already exists: {local_path}")

The HTTP/globus path (_parallel_download) has no post-transfer size verification, unlike the FTP path (which checks final_size != total_size and retries). A stream truncated mid-transfer is accepted silently.

pridepy/pridepy/download/transport.py

Lines 377 to 414 in 3188719

    
           def _parallel_download(url, file_path, position=0): 
        
               """Download a file via a single-connection HTTP stream with optional resume. 
        
               If a partial file exists and the server supports Range requests, resumes 
        
               from where it left off; otherwise restarts from scratch.""" 
        
               session = Util.create_session_with_retries() 
        
               try: 
        
                   head = session.head(url, timeout=(30, 30)) 
        
                   head.raise_for_status() 
        
                   total_size = int(head.headers.get("content-length", 0)) 
        
                   accept_ranges = head.headers.get("accept-ranges", "none").strip().lower() 
        
               except (requests.RequestException, ValueError) as exc: 
        
                   logging.info(f"HEAD request failed, falling back to single connection: {exc}") 
        
                   total_size = 0 
        
                   accept_ranges = "none" 
        
               resume_size = 0 
        
               if os.path.exists(file_path) and accept_ranges == "bytes" and total_size > 0: 
        
                   resume_size = os.path.getsize(file_path) 
        
                   if resume_size >= total_size: 
        
                       logging.info(f"File already complete: {file_path}") 
        
                       return 
        
                   if resume_size > 0: 
        
                       logging.info(f"Resuming download from {resume_size} bytes: {file_path}") 
        
               headers = {"Range": f"bytes={resume_size}-"} if resume_size > 0 else {} 
        
               with session.get(url, headers=headers, stream=True, timeout=(30, 60)) as r: 
        
                   r.raise_for_status() 
        
                   if resume_size > 0 and r.status_code != 206: 
        
                       logging.warning("Server did not honor Range request (status %s), restarting download", r.status_code) 
        
                       resume_size = 0 
        
                   with tqdm(total=total_size, unit="B", unit_scale=True, desc=file_path, 
        
                             initial=resume_size, position=position, leave=True) as pbar: 
        
                       mode = "ab" if resume_size > 0 else "wb" 
        
                       with open(file_path, mode, buffering=8 * 1024 * 1024) as f: 
        
                           for chunk in r.iter_content(chunk_size=8 * 1024 * 1024): 
        
                               if chunk: 
        
                                   f.write(chunk) 
        
                                   pbar.update(len(chunk))

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

Addresses the two long-standing transport data-integrity issues flagged on #98/#100 for the direct-download providers (MassIVE/JPOST/iProX). - Thread each record's relativePath from Provider.download_files through to download_ftp_urls / download_http_urls, writing files to output_folder/<relativePath> instead of flattening to the URL basename. Identically-named files in different sub-collections no longer overwrite each other, and --skip-if-downloaded-already no longer skips the wrong file. Parent directories are created per file; a _safe_join guard prevents path traversal outside output_folder. - Add a post-transfer size check in _parallel_download (HTTP/globus) mirroring the FTP path: a stream shorter than Content-Length now raises so the retry loop re-downloads (Range-resuming when possible) instead of silently accepting a truncated file. - Remove the now-unused _local_path_for_url helper (superseded by _dest_path). - Tests: collision-free relative_paths threading (ftp+http), truncated-stream raise, _safe_join traversal guard; update massive/jpost exact-call asserts.

Some networks block FTP/FTPS outright. MassiveProvider.list_files now tries the FTPS tree walk first and, on failure, falls back to an all-HTTPS path: - list files from the GNPS2 dataset cache file index (datasetcache.gnps2.org, CSV stream) over HTTPS - build records whose download URL is the ProteoSAFe endpoint (massive.ucsd.edu/ProteoSAFe/DownloadResultFile?forceDownload=true&file=f.<acc>/<path>) Downloads then route over HTTPS automatically (scheme-based dispatch in Provider.download_files), with relativePath preserving the dataset layout. Verified live: FTPS and HTTPS copies of the same file are byte-identical (md5 match). Adds unit tests for the URL builder, the HTTPS record, and the FTPS->HTTPS listing fallback.

Consolidate the modular-download work (formerly split across PRs #103/#104/ #105) into a single release off master (0.0.15 -> 0.0.16).

ypriverol · 2026-05-28T13:34:56Z

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

Copilot

Pull request overview

This PR refactors pridepy’s file download system into a modular provider-based architecture and adds native direct-download support for MassIVE, JPOST, and iProX accessions while bumping the package to 0.0.16.

Changes:

Replaces the monolithic Files implementation with pridepy.download.Client, provider adapters, shared transport, registry, and utility modules.
Adds direct-download listing/download paths for MassIVE, JPOST, iProX, and explicit ProteomeXchange XML downloads.
Updates CLI/tests/docs for the new download architecture and adds Codacy tooling files.

Reviewed changes

Copilot reviewed 28 out of 29 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
README.md	Documents the new modular/direct-download workflows and updated Python API import.
pyproject.toml	Bumps package version to `0.0.16`.
pridepy/pridepy.py	Switches CLI to `Client as Files` and updates search sort default/help text.
pridepy/files/files.py	Removes the old monolithic `Files` implementation.
pridepy/files/init.py	Leaves the old package shim empty.
pridepy/download/init.py	Introduces the new download package overview.
pridepy/download/base.py	Adds shared provider template-method workflow.
pridepy/download/client.py	Adds the public `Client` facade over provider dispatch.
pridepy/download/registry.py	Adds accession-to-provider registration/resolution.
pridepy/download/transport.py	Adds shared FTP/FTPS/HTTP transport, resume, parallelism, and relative-path handling.
pridepy/download/util.py	Adds checksum/progress/download validation utilities.
pridepy/download/by_url.py	Moves URL-list download command helpers into the download package.
pridepy/download/pride.py	Moves PRIDE-specific listing, private download, and multi-protocol fallback logic.
pridepy/download/massive.py	Adds MassIVE direct FTPS listing/download with HTTPS fallback.
pridepy/download/jpost.py	Adds JPOST PROXI listing with FTP fallback/download records.
pridepy/download/iprox.py	Adds iProX PX XML listing and HTTP download records.
pridepy/download/proteomexchange.py	Adds explicit ProteomeXchange XML raw-file download support.
pridepy/tests/test_authentication.py	Updates import to the new `Client` facade.
pridepy/tests/test_search.py	Updates import to the new `Client` facade.
pridepy/tests/test_raw_files.py	Updates import to the new `Client` facade.
pridepy/tests/test_massive_files.py	Updates MassIVE tests and adds HTTPS fallback/parallel/relative-path coverage.
pridepy/tests/test_jpost_files.py	Adds JPOST direct-download provider coverage.
pridepy/tests/test_iprox_files.py	Adds iProX direct-download provider coverage.
pridepy/tests/test_ftp_download_validation.py	Adds FTP size-mismatch retry validation coverage.
pridepy/tests/test_download_resilience.py	Updates patch targets and adds transport resilience/relative-path tests.
pridepy/tests/test_download_by_url.py	Updates URL-download tests to patch the new module location.
pridepy/tests/test_download_by_list.py	Updates list-download tests for provider-based dispatch.
.codacy/codacy.yaml	Adds Codacy runtime/tool configuration.
.codacy/cli.sh	Adds Codacy CLI bootstrap script.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Addresses two Codex review findings. High — direct-download commands reported success even when transfers failed: transport.download_ftp_urls / download_http_urls swallowed per-file exceptions (log + continue) and returned, so Provider.download_files could not detect failures. The FTP serial/parallel helpers now collect and return failed paths, the HTTP loop collects failed URLs, and both download_ftp_urls and download_http_urls raise RuntimeError when any file fails. This restores parity with the PRIDE path (which raises) and makes the earlier HTTP/FTP post-transfer size checks actually surface as command failures. Successful downloads return empty failure lists and do not raise (verified live for MassIVE FTPS and iProX HTTP). Medium — download-px-raw-files flattened every URI to its basename, so duplicate raw filenames in different directories overwrote each other. ProteomeXchangeProvider.list_files now derives a relativePath per file by stripping the common parent directory shared by all URIs (e.g. run1/sample.raw vs run2/sample.raw), with a basename fallback for a single file or no shared prefix. Adds tests: FTP/HTTP failure propagation, Provider.download_files propagation, and PX duplicate-basename disambiguation.

- by_url: _http_download_url / _ftp_download_url now verify the written size against Content-Length / ftp.size() and raise on truncation (validate_download only checked non-empty, so a partial file slipped through and could be wrongly skipped on the next run). _download_single_url removes a partial file when the transfer raises. Brings download-files-by-url to parity with the accession transport's post-transfer size check. - iprox: iProX serves over plain HTTP, not HTTPS — fix the _build_file_record parameter name (https_url -> file_url), the location-label comment, and the "no downloadable URIs" error message. - base: clarify get_download_url docstring re: records (MassIVE HTTPS fallback) that use a location name other than "FTP Protocol" and fall through to the first location. - pride: docstrings list s3 as a supported protocol and drop the nonexistent "Phase 3" wording (the per-file fallback is part of Phase 2). - test: by_url HTTP truncation now raises.

…aths Follow-up review of the failure-propagation/size-verification work. - Size verification (by_url._http_download_url and transport._parallel_download) skipped the check when the response carries Content-Encoding: requests decompresses gzip/deflate transparently, so on-disk size != Content-Length (compressed) and the check would have falsely raised "Incomplete download" and deleted an intact file. Also compute the size once for the message. - ProteomeXchange._relative_paths_for_urls: when raw URIs span different top-level directories the common prefix is "/", which the previous guard rejected and collapsed to a colliding basename. Strip any non-empty common prefix (including "/") so run1/x.raw and run2/x.raw stay distinct. - Docs: document that download_ftp_urls/download_http_urls raise on failure; README's download-files-by-url -w applies to any scheme (not globus-only); iProX accepts http(s); base warning says "ftp / http(s) only". - Tests: gzip response is not treated as truncated; PX root-prefix paths disambiguate.

Review follow-ups (no functional change): - download-files-by-url --help for -w said "Primarily used by globus protocol", but by_url parallelizes for any scheme; align with the README. - the three _parallel_download success-path tests left response.headers as a bare Mock, so the new Content-Encoding guard saw a truthy value and silently skipped the size check; set headers={} so the check is actually exercised.

ypriverol added 30 commits April 22, 2026 11:29

Merge pull request #89 from PRIDE-Archive/dev

309e361

Remove conda recipe packaging workflow

Merge pull request #91 from PRIDE-Archive/dev

ab6ce0d

Dev

Merge pull request #93 from PRIDE-Archive/dev

a99c73b

Improving parallelism

Merge pull request #95 from PRIDE-Archive/dev

cacf553

Minor changes in parallelization

Merge pull request #97 from PRIDE-Archive/dev

6c554b5

Minor change to have the version of the tool.

Merge pull request #99 from PRIDE-Archive/dev

c09cca5

Improve MassIVE download.

chore(release): bump version to 0.0.17

f60a69c

refactor(commands): move download_files_by_list into commands/by_list.py

1a0a9b8

Moved download_files_by_list from Files into commands/by_list.py. Files keeps a shim re-export. No behaviour change. Test suite green.

ypriverol added 3 commits May 28, 2026 07:11

chore(release): bump version to 0.0.19

4746408

ypriverol added 2 commits May 28, 2026 07:42

ypriverol added 2 commits May 28, 2026 09:12

ypriverol added 3 commits May 28, 2026 09:54

chore(release): set version to 0.0.16

9c31f15

Consolidate the modular-download work (formerly split across PRs #103/#104/ #105) into a single release off master (0.0.15 -> 0.0.16).

ypriverol changed the title ~~refactor: Provider Template-Method + download/ package (0.0.19)~~ feat: modular download architecture + direct downloads (MassIVE/JPOST/iProX) (0.0.16) May 28, 2026

ypriverol changed the base branch from refactor/files-facade-cleanup to master May 28, 2026 10:03

This was referenced May 28, 2026

refactor: collapse Files shim layer (0.0.18) #104

Closed

feat: modular providers + commands architecture, direct downloads for MassIVE/JPOST/iProX (0.0.17) #103

Closed

ypriverol requested a review from Copilot May 28, 2026 13:38

ypriverol changed the base branch from master to dev May 28, 2026 13:38

Copilot started reviewing on behalf of ypriverol May 28, 2026 13:38 View session

ypriverol requested review from chakrabandla and selvaebi May 28, 2026 13:41

Copilot AI reviewed May 28, 2026

View reviewed changes

Comment thread pridepy/download/transport.py

Comment thread pridepy/download/transport.py Outdated

Comment thread pridepy/download/transport.py

Comment thread pridepy/download/transport.py Outdated

Comment thread pridepy/download/by_url.py

Comment thread .codacy/cli.sh

ypriverol added 4 commits May 28, 2026 16:50

ypriverol merged commit cb7e58a into dev May 28, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: modular download architecture + direct downloads (MassIVE/JPOST/iProX) (0.0.16)#105

feat: modular download architecture + direct downloads (MassIVE/JPOST/iProX) (0.0.16)#105
ypriverol merged 45 commits into
devfrom
refactor/provider-template-method

ypriverol commented May 28, 2026 •

edited

Loading

Uh oh!

qodo-code-review Bot commented May 28, 2026

Uh oh!

coderabbitai Bot commented May 28, 2026 •

edited

Loading

Review skipped

Uh oh!

ypriverol commented May 28, 2026

Uh oh!

ypriverol commented May 28, 2026

Uh oh!

ypriverol commented May 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ypriverol commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture

Direct downloads (datasets without a ProteomeXchange accession)

Data-integrity fixes (transport)

Docs

Breaking change

Test plan

Uh oh!

qodo-code-review Bot commented May 28, 2026

Qodo reviews are paused for this user.

Uh oh!

coderabbitai Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

ypriverol commented May 28, 2026

Code review

Uh oh!

ypriverol commented May 28, 2026

Code review

Uh oh!

ypriverol commented May 28, 2026

Code review

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ypriverol commented May 28, 2026 •

edited

Loading

coderabbitai Bot commented May 28, 2026 •

edited

Loading