Direct downloads: MassIVE FTPS + JPOST PROXI + iProX HTTPS + parallel/resume; 0.0.16#100
Direct downloads: MassIVE FTPS + JPOST PROXI + iProX HTTPS + parallel/resume; 0.0.16#100ypriverol wants to merge 10 commits into
Conversation
Remove conda recipe packaging workflow
Improving parallelism
Minor changes in parallelization
Minor change to have the version of the tool.
Improve MassIVE download.
Extends the direct-download support introduced for MassIVE in PR #98 to two more proteomics repositories whose datasets are often standalone (no ProteomeXchange accession): - JPOST (Japan ProteOme STandard Repository): JPST\d{6} accessions, listed and downloaded from ftp.jpostdb.org. - iProX (Integrated Proteome resources): IPX\d{7,10} accessions, listed and downloaded from ftp.iprox.cn. Refactor: - Add is_direct_download_accession() unifying the MSV/JPST/IPX checks, plus _list_direct_download_files() and _download_direct_download_records() dispatchers. All call sites (get_all_raw_file_list, download_all_raw_files, download_all_category_files, get_file_from_api, download_file_by_name, download_files_by_list, get_all_category_file_list) now go through the unified entry points. - Extract _list_ftp_repo_files() helper so the FTP connection lifecycle (connect / login / passive / walk / quit) lives in one place. As part of that, fix the FTP-constructor-outside-try issue flagged in the PR #98 review: a connect failure no longer triggers NameError in finally. - Keep is_massive_accession, _list_massive_public_files, and _download_massive_file_records as thin backward-compatible wrappers so existing tests and external callers continue to work. Tests: add test_jpost_files.py and test_iprox_files.py mirroring the MassIVE coverage (regex match, record building, raw-only filtering, and the download_file_by_name happy path). All 19 direct-download tests pass. Version: 0.0.16.
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Address feedback that direct downloads should match the PRIDE feature
set (resume, parallel, retries) and that MassIVE's actual FTP server
requires TLS.
Live findings:
- massive-ftp.ucsd.edu now rejects plain anonymous FTP with
421 TLS is required. The merged PR #98 code was effectively broken
against the live server. Switch to FTP_TLS + PROT P.
- ftp.jpostdb.org accepts plain anonymous FTP; keep as-is.
- ftp.iprox.cn does not resolve (DNS fail) and no other iProX FTP
host responds. iProX is HTTPS-only and needs a different transport
(REST API). Defer iProX support; user to provide the endpoint.
Implementation:
- New static _open_ftp_connection(host, use_tls) opens FTP or FTP_TLS
with the right TLS setup, and transparently falls back to FTPS if
a plain FTP server replies 'TLS is required'.
- _list_ftp_repo_files() and download_ftp_urls() both grow a use_tls
flag. _repo_uses_tls(accession) wires this from the repo type
(MassIVE = True, JPOST = False).
- download_ftp_urls() grows parallel_files: when >1, a ThreadPoolExecutor
runs that many FTP workers per host, each with its own connection.
Existing serial single-connection-per-host path is preserved for
parallel_files <= 1.
- Extracted _download_one_ftp_path() with REST-based resume + per-file
retries; _download_ftp_paths_serial() / _download_ftp_paths_parallel()
pick the right scheduling. REST resume verified live (3 KB pre-stage
-> 10 KB final, MD5 matches full file).
- _download_direct_download_records() now accepts parallel_files and
forwards it (along with use_tls derived from the accession) to
download_ftp_urls. All call sites (download_all_raw_files,
download_all_category_files, download_files_by_list) thread the
user-supplied -w/--parallel-files through.
Tests:
- test_jpost_files / test_massive_files updated to assert the new
kwargs (use_tls, parallel_files).
- New test_repo_uses_tls_true_for_massive_false_for_jpost and
test_download_all_raw_files_threads_parallel_files_for_massive.
- test_iprox_files.py removed (iProX is deferred).
- All 15 unit tests pass.
Live testing:
- MassIVE: listed MSV000080175 (44 files), single-file download of
params.xml (10315 B, MD5 43d87368d705c3f380c1d030b14850c4),
REST resume from a 3000 B partial, and 3-worker parallel download
of files from MSV000080175 + MSV000078335 all succeeded.
- JPOST: rate-limited from this IP ('421 too many connections').
Code path is structurally identical to MassIVE (same FTP helper),
routing covered by unit tests; deferred to a follow-up live check.
Address deferred items from PR review:
1. JPOST PROXI listing
ftp.jpostdb.org rate-limits aggressively per source IP (sticky 421 on
any retry within ~10 min). For listing this is fatal because every
pridepy invocation needs a fresh tree walk. JPOST publishes a JSON
PROXI endpoint at
https://repository.jpostdb.org/proxi/datasets/<JPSTxxxxxx>
that returns datasetFiles[*].value as ftp:// URIs alongside CV labels
(Associated raw file URI, Search engine output file URI, Result file
URI, Peak list file URI, ...). _list_jpost_public_files now hits
PROXI first, builds file records with categories derived from the CV
name (mapped via JPOST_PROXI_CATEGORY_MAP), and falls back to the
FTP tree walk only if PROXI is unreachable or returns no records.
Live-tested against JPST002311 -> 160 files (88 RAW, 72 SEARCH).
2. Post-transfer size check
Neither MassIVE nor JPOST publishes per-file checksum manifests in
a standard location, so md5 verification isn't an option for these
datasets. As a lighter-weight integrity signal, _download_one_ftp_path
now compares the local file size against ftp.size() after retrbinary
returns and treats a mismatch as a retryable failure (next attempt
resumes via REST from the current partial). This catches half-finished
transfers where the data channel was closed early without retrbinary
raising.
3. iProX accession guard
Probing showed iProX REST endpoints (PMD009Controller/findByProjectId.jsonp,
findFilesBySubProjectID.jsonp) all redirect unauthenticated callers to
a CAS login page, and downloads use faspe:// URLs with per-session
tokens. Native support is therefore blocked until iProX exposes an
anonymous JSON API or pridepy carries iProX credentials. Until then,
add is_iprox_accession() and _raise_if_iprox() so every public entry
point (get_all_raw_file_list, download_all_raw_files,
download_all_category_files, download_file_by_name, download_files_by_list,
get_file_from_api) emits a clear NotImplementedError instead of
silently falling through to the PRIDE API and 404-ing.
Tests
- test_jpost_files: PROXI listing maps CV names to PRIDE categories;
PROXI failure falls back to FTP walk.
- test_iprox_guard: regex coverage; assert each entry point raises
NotImplementedError for IPX accessions; assert
is_direct_download_accession returns False for IPX.
- test_ftp_download_validation: size mismatch retries until success;
repeated mismatch raises after max_download_retries; correct size
skips retries.
- 25 tests total (10 MassIVE + 8 JPOST + 5 iProX guard + 3 size check).
Live verification (same MassIVE FTPS path as before)
- MSV000080175 listing + params.xml download still produces
10315 B, MD5 43d87368d705c3f380c1d030b14850c4.
- JPST002311 PROXI listing returns 160 files with correct categories.
Actual file transfer is still blocked from this IP by JPOST's
421-too-many-connections rate limit; the code path is shared with
MassIVE (same _download_one_ftp_path / download_ftp_urls), so a
fresh IP should succeed.
….org User pointed out the iProX dataset XML endpoint: https://www.iprox.cn/FAF016Controller/readXml.jsonp?fileId=file_<id>_xml Probing showed something simpler works: iProX publishes the ProteomeXchange XML for every public dataset at a deterministic, anonymous-accessible path on download.iprox.org. No CAS auth, no Aspera tokens, no fileId discovery needed: http://download.iprox.org/<accession>/PX_<accession>.xml The XML embeds Associated raw file URI / Search engine output file URI cvParams pointing at HTTPS file URLs on the same host (Accept-Ranges: bytes, so we get resume for free). The earlier 'iProX is auth-gated' finding was specifically about the iProX UI's CAS-protected JSON endpoints (PMD009Controller/findBySubProjectId.jsonp etc.); the public download server is anonymous. Changes ------- - IPROX_DOWNLOAD_BASE_URL, IPROX_PX_XML_URL_TEMPLATE, IPROX_PX_CATEGORY_MAP added (the latter is the same CV map JPOST PROXI uses). - _build_iprox_file_record() and _list_iprox_public_files() added, mirroring the JPOST helpers but using the PX XML schema. - is_direct_download_accession() now returns True for IPX accessions. - _raise_if_iprox() removed entirely. All entry points that previously called it no longer do, so IPX accessions flow through the unified direct-download dispatcher. - _list_direct_download_files() dispatches IPX -> _list_iprox_public_files. - _download_direct_download_records() now partitions URLs by scheme: ftp:// records go through download_ftp_urls (MassIVE/JPOST), http(s):// records through download_http_urls (iProX). A dataset whose records somehow contain both flows through both paths correctly. - download_http_urls() grew parallel_files + max_retries kwargs and a ThreadPoolExecutor path. The per-file worker _http_download_one() wraps _parallel_download() (reused from the globus codepath) with retry, so iProX gets the same HEAD-then-Range resume and restart-on- non-206 behaviour we already use for PRIDE HTTPS. Tests (26 pass) --------------- - test_iprox_guard.py renamed to test_iprox_files.py and rewritten as positive-path tests: regex coverage, IPX is a direct-download accession, PX XML parsing maps cvParam name -> PRIDE category, RAW filtering ignores non-FTP/HTTPS URIs, download_file_by_name routes iProX URLs to download_http_urls (not download_ftp_urls) with parallel_files=1. - All previously-added MassIVE / JPOST / size-validation tests still pass (20 + 6 iProX = 26 total). Live verification (this branch, against download.iprox.org) ----------------------------------------------------------- - IPX0017413000 listing: 7 files, correctly categorised (RAW + SEARCH). - download_file_by_name(IPX0017413000, protein_annotation_profile.xlsx): 3,253,230 B downloaded, MD5 c17baf230ffde1e2837ec4eb32dcea68, valid XLSX (PK header). - Range-based resume: pre-staged 1,000,000 B partial, completed to 3,253,230 B with the same MD5. - Parallel HTTPS: 3 worker downloads from IPX0017413000 (xlsx + 2 RAW) ran concurrently; cancelled mid-flight after observing all 3 files growing in parallel (xlsx finished at 3.25 MB, Tumor_NK1.raw was at 142 MB and Control_NK3.raw at 419 MB before cancellation).
Code reviewFound 2 issues:
|
Addresses the two long-standing transport data-integrity issues flagged on #98/#100 for the direct-download providers (MassIVE/JPOST/iProX). - Thread each record's relativePath from Provider.download_files through to download_ftp_urls / download_http_urls, writing files to output_folder/<relativePath> instead of flattening to the URL basename. Identically-named files in different sub-collections no longer overwrite each other, and --skip-if-downloaded-already no longer skips the wrong file. Parent directories are created per file; a _safe_join guard prevents path traversal outside output_folder. - Add a post-transfer size check in _parallel_download (HTTP/globus) mirroring the FTP path: a stream shorter than Content-Length now raises so the retry loop re-downloads (Range-resuming when possible) instead of silently accepting a truncated file. - Remove the now-unused _local_path_for_url helper (superseded by _dest_path). - Tests: collision-free relative_paths threading (ftp+http), truncated-stream raise, _safe_join traversal guard; update massive/jpost exact-call asserts.
Summary
Adds direct-download support matching the PRIDE feature set (resume, parallel, retries, post-transfer verification) for three non-PXD repositories. Bumps to 0.0.16.
The PR's scope grew through live probing of the actual servers — each repository turned out to need a different transport than the original PR assumed.
What ships
MSV\\d{9}massive-ftp.ucsd.eduJPST\\d{6}repository.jpostdb.org, FTP fallbackIPX\\d{7,10}download.iprox.org/<acc>/PX_<acc>.xmlShared plumbing:
_open_ftp_connection(host, use_tls)centralises FTP vs FTPS, transparently retries with TLS on421 TLS is required._download_one_ftp_path(...)per-file worker with REST resume + size-mismatch retry._download_ftp_paths_serial/_download_ftp_paths_parallelpick scheduling.download_ftp_urls(..., use_tls=, parallel_files=)._http_download_one(...)per-file HTTPS worker reusing_parallel_download()(HEAD-then-Range resume + restart-on-non-206 from the globus codepath) with retry on top.download_http_urls(..., parallel_files=)now ThreadPoolExecutor-aware._download_direct_download_records(...)is the unified dispatcher: it builds the URL list from records, partitions by scheme (ftp://vshttp(s)://), and fans out to the right transport. Same parallel-workers knob flows through both.Verification
Live (this branch, against real servers)
MassIVE — FTPS at
massive-ftp.ucsd.edu_list_massive_public_files("MSV000080175")→ 44 filesdownload_file_by_name("MSV000080175", "params.xml")→ 10 315 B, MD543d87368d705c3f380c1d030b14850c4JPOST — PROXI at
repository.jpostdb.org_list_jpost_public_files("JPST002311")→ 160 files (88 RAW, 72 SEARCH), categories from cvParamname421 too many connections). The transfer path isdownload_ftp_urlswithuse_tls=False, i.e. the same code MassIVE uses on a different host. A fresh IP will succeed.iProX — HTTPS at
download.iprox.org_list_iprox_public_files("IPX0017413000")→ 7 files, RAW + SEARCH categoriseddownload_file_by_name("IPX0017413000", "protein_annotation_profile.xlsx")→ 3 253 230 B, MD5c17baf230ffde1e2837ec4eb32dcea68, valid XLSX (PK header)download.iprox.org(cancelled mid-flight after observing 3 active progress bars; xlsx finished, Tumor_NK1.raw was at 142 MB, Control_NK3.raw at 419 MB)Unit tests (26 pass)
test_massive_files.py(10) — FTPS flag, parallel-files threading,_repo_uses_tlstest_jpost_files.py(8) — PROXI CV→category mapping, PROXI→FTP fallbacktest_iprox_files.py(6) — PX XML parsing, category mapping (Associated raw file URI→ RAW,Search engine output file URI→ SEARCH), download_file_by_name routes iProX to download_http_urls, NOT download_ftp_urlstest_ftp_download_validation.py(3) — size mismatch retries, exhausts then raises, correct size one-shotWhy the iProX path is anonymous (and Aspera isn't needed)
iProX's documented Aspera workflow (
ascp -P 33001 username@download.iprox.org:/data/iprox/<acc>...) requires an iProX user account. Probing showed that the samedownload.iprox.orghost also serves files over plain anonymous HTTPS withAccept-Ranges: bytesand the dataset's PX XML athttp://download.iprox.org/<accession>/PX_<accession>.xml. The XML lists every file underhttp://download.iprox.org/<accession>/<sub-accession>/<filename>. So no Aspera client, no iProX credentials, and HEAD-then-Range gives us byte-precise resume. The earlier "CAS-protected" finding was specifically about iProX's UI JSON endpoints (PMD009Controller/findBySubProjectId.jsonpetc.); the download host is a different surface.Test plan
pridepy download-all-public-raw-files -a MSV000082297 -o /tmp/msv -w 3 --skip-if-downloaded-alreadysucceeds (FTPS)pridepy download-file-by-name -a JPST002311 -f <small.sne> -o /tmp/jpostsucceeds from an IP not rate-limited by JPOST FTPpridepy download-file-by-name -a IPX0017413000 -f protein_annotation_profile.xlsx -o /tmp/iproxsucceeds (anonymous HTTPS)pridepy --versionreports0.0.16