Skip to content

Direct downloads: MassIVE FTPS + JPOST PROXI + iProX HTTPS + parallel/resume; 0.0.16#100

Closed
ypriverol wants to merge 10 commits into
devfrom
feat/jpost-iprox-direct-downloads
Closed

Direct downloads: MassIVE FTPS + JPOST PROXI + iProX HTTPS + parallel/resume; 0.0.16#100
ypriverol wants to merge 10 commits into
devfrom
feat/jpost-iprox-direct-downloads

Conversation

@ypriverol
Copy link
Copy Markdown
Contributor

@ypriverol ypriverol commented May 27, 2026

Summary

Adds direct-download support matching the PRIDE feature set (resume, parallel, retries, post-transfer verification) for three non-PXD repositories. Bumps to 0.0.16.

The PR's scope grew through live probing of the actual servers — each repository turned out to need a different transport than the original PR assumed.

What ships

Repository Accession Listing Transport Resume Parallel Live-tested
MassIVE MSV\\d{9} FTPS tree walk at massive-ftp.ucsd.edu FTPS REST yes yes
JPOST JPST\\d{6} PROXI JSON at repository.jpostdb.org, FTP fallback plain FTP REST yes listing only (FTP IP-rate-limited)
iProX IPX\\d{7,10} PX XML at download.iprox.org/<acc>/PX_<acc>.xml anonymous HTTPS byte-Range yes yes (xlsx end-to-end + resume + parallel)

Shared plumbing:

  • _open_ftp_connection(host, use_tls) centralises FTP vs FTPS, transparently retries with TLS on 421 TLS is required.
  • _download_one_ftp_path(...) per-file worker with REST resume + size-mismatch retry.
  • _download_ftp_paths_serial / _download_ftp_paths_parallel pick scheduling. download_ftp_urls(..., use_tls=, parallel_files=).
  • _http_download_one(...) per-file HTTPS worker reusing _parallel_download() (HEAD-then-Range resume + restart-on-non-206 from the globus codepath) with retry on top.
  • download_http_urls(..., parallel_files=) now ThreadPoolExecutor-aware.
  • _download_direct_download_records(...) is the unified dispatcher: it builds the URL list from records, partitions by scheme (ftp:// vs http(s)://), and fans out to the right transport. Same parallel-workers knob flows through both.

Verification

Live (this branch, against real servers)

MassIVE — FTPS at massive-ftp.ucsd.edu

  • _list_massive_public_files("MSV000080175") → 44 files
  • download_file_by_name("MSV000080175", "params.xml") → 10 315 B, MD5 43d87368d705c3f380c1d030b14850c4
  • REST resume from a 3 KB pre-stage → same MD5
  • 3-worker parallel pull of 3 small files (MSV000080175 + MSV000078335) → 3 concurrent progress bars, all complete & correct

JPOST — PROXI at repository.jpostdb.org

  • _list_jpost_public_files("JPST002311") → 160 files (88 RAW, 72 SEARCH), categories from cvParam name
  • File transfer not exercised live: `ftp.jpostdb.org` rate-limited my source IP for the entire session (421 too many connections). The transfer path is download_ftp_urls with use_tls=False, i.e. the same code MassIVE uses on a different host. A fresh IP will succeed.

iProX — HTTPS at download.iprox.org

  • _list_iprox_public_files("IPX0017413000") → 7 files, RAW + SEARCH categorised
  • download_file_by_name("IPX0017413000", "protein_annotation_profile.xlsx") → 3 253 230 B, MD5 c17baf230ffde1e2837ec4eb32dcea68, valid XLSX (PK header)
  • Range-based resume: 1 MB pre-stage → complete file, MD5 matches
  • 3-worker parallel: xlsx + 2 RAW files downloaded concurrently against download.iprox.org (cancelled mid-flight after observing 3 active progress bars; xlsx finished, Tumor_NK1.raw was at 142 MB, Control_NK3.raw at 419 MB)

Unit tests (26 pass)

  • test_massive_files.py (10) — FTPS flag, parallel-files threading, _repo_uses_tls
  • test_jpost_files.py (8) — PROXI CV→category mapping, PROXI→FTP fallback
  • test_iprox_files.py (6) — PX XML parsing, category mapping (Associated raw file URI → RAW, Search engine output file URI → SEARCH), download_file_by_name routes iProX to download_http_urls, NOT download_ftp_urls
  • test_ftp_download_validation.py (3) — size mismatch retries, exhausts then raises, correct size one-shot

Why the iProX path is anonymous (and Aspera isn't needed)

iProX's documented Aspera workflow (ascp -P 33001 username@download.iprox.org:/data/iprox/<acc>...) requires an iProX user account. Probing showed that the same download.iprox.org host also serves files over plain anonymous HTTPS with Accept-Ranges: bytes and the dataset's PX XML at http://download.iprox.org/<accession>/PX_<accession>.xml. The XML lists every file under http://download.iprox.org/<accession>/<sub-accession>/<filename>. So no Aspera client, no iProX credentials, and HEAD-then-Range gives us byte-precise resume. The earlier "CAS-protected" finding was specifically about iProX's UI JSON endpoints (PMD009Controller/findBySubProjectId.jsonp etc.); the download host is a different surface.

Test plan

  • CI passes
  • pridepy download-all-public-raw-files -a MSV000082297 -o /tmp/msv -w 3 --skip-if-downloaded-already succeeds (FTPS)
  • pridepy download-file-by-name -a JPST002311 -f <small.sne> -o /tmp/jpost succeeds from an IP not rate-limited by JPOST FTP
  • pridepy download-file-by-name -a IPX0017413000 -f protein_annotation_profile.xlsx -o /tmp/iprox succeeds (anonymous HTTPS)
  • pridepy --version reports 0.0.16

Remove conda recipe packaging workflow
Minor changes in parallelization
Minor change to have the version of the tool.
Improve MassIVE download.
Extends the direct-download support introduced for MassIVE in PR #98 to two
more proteomics repositories whose datasets are often standalone (no
ProteomeXchange accession):

  - JPOST (Japan ProteOme STandard Repository): JPST\d{6} accessions,
    listed and downloaded from ftp.jpostdb.org.
  - iProX (Integrated Proteome resources): IPX\d{7,10} accessions, listed
    and downloaded from ftp.iprox.cn.

Refactor:

  - Add is_direct_download_accession() unifying the MSV/JPST/IPX checks,
    plus _list_direct_download_files() and _download_direct_download_records()
    dispatchers. All call sites (get_all_raw_file_list, download_all_raw_files,
    download_all_category_files, get_file_from_api, download_file_by_name,
    download_files_by_list, get_all_category_file_list) now go through the
    unified entry points.
  - Extract _list_ftp_repo_files() helper so the FTP connection lifecycle
    (connect / login / passive / walk / quit) lives in one place. As part
    of that, fix the FTP-constructor-outside-try issue flagged in the PR #98
    review: a connect failure no longer triggers NameError in finally.
  - Keep is_massive_accession, _list_massive_public_files, and
    _download_massive_file_records as thin backward-compatible wrappers
    so existing tests and external callers continue to work.

Tests: add test_jpost_files.py and test_iprox_files.py mirroring the
MassIVE coverage (regex match, record building, raw-only filtering, and
the download_file_by_name happy path). All 19 direct-download tests pass.

Version: 0.0.16.
@qodo-code-review
Copy link
Copy Markdown
Contributor

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 27, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 24d67e52-fea5-4127-b5a0-4a58780ce66d

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/jpost-iprox-direct-downloads

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Address feedback that direct downloads should match the PRIDE feature
set (resume, parallel, retries) and that MassIVE's actual FTP server
requires TLS.

Live findings:

  - massive-ftp.ucsd.edu now rejects plain anonymous FTP with
    421 TLS is required. The merged PR #98 code was effectively broken
    against the live server. Switch to FTP_TLS + PROT P.
  - ftp.jpostdb.org accepts plain anonymous FTP; keep as-is.
  - ftp.iprox.cn does not resolve (DNS fail) and no other iProX FTP
    host responds. iProX is HTTPS-only and needs a different transport
    (REST API). Defer iProX support; user to provide the endpoint.

Implementation:

  - New static _open_ftp_connection(host, use_tls) opens FTP or FTP_TLS
    with the right TLS setup, and transparently falls back to FTPS if
    a plain FTP server replies 'TLS is required'.
  - _list_ftp_repo_files() and download_ftp_urls() both grow a use_tls
    flag. _repo_uses_tls(accession) wires this from the repo type
    (MassIVE = True, JPOST = False).
  - download_ftp_urls() grows parallel_files: when >1, a ThreadPoolExecutor
    runs that many FTP workers per host, each with its own connection.
    Existing serial single-connection-per-host path is preserved for
    parallel_files <= 1.
  - Extracted _download_one_ftp_path() with REST-based resume + per-file
    retries; _download_ftp_paths_serial() / _download_ftp_paths_parallel()
    pick the right scheduling. REST resume verified live (3 KB pre-stage
    -> 10 KB final, MD5 matches full file).
  - _download_direct_download_records() now accepts parallel_files and
    forwards it (along with use_tls derived from the accession) to
    download_ftp_urls. All call sites (download_all_raw_files,
    download_all_category_files, download_files_by_list) thread the
    user-supplied -w/--parallel-files through.

Tests:

  - test_jpost_files / test_massive_files updated to assert the new
    kwargs (use_tls, parallel_files).
  - New test_repo_uses_tls_true_for_massive_false_for_jpost and
    test_download_all_raw_files_threads_parallel_files_for_massive.
  - test_iprox_files.py removed (iProX is deferred).
  - All 15 unit tests pass.

Live testing:

  - MassIVE: listed MSV000080175 (44 files), single-file download of
    params.xml (10315 B, MD5 43d87368d705c3f380c1d030b14850c4),
    REST resume from a 3000 B partial, and 3-worker parallel download
    of files from MSV000080175 + MSV000078335 all succeeded.
  - JPOST: rate-limited from this IP ('421 too many connections').
    Code path is structurally identical to MassIVE (same FTP helper),
    routing covered by unit tests; deferred to a follow-up live check.
@ypriverol ypriverol changed the title Add direct downloads for JPOST and iProX accessions; bump to 0.0.16 Direct downloads: FTPS for MassIVE, parallel + resume, add JPOST; bump 0.0.16 May 27, 2026
Address deferred items from PR review:

1. JPOST PROXI listing
   ftp.jpostdb.org rate-limits aggressively per source IP (sticky 421 on
   any retry within ~10 min). For listing this is fatal because every
   pridepy invocation needs a fresh tree walk. JPOST publishes a JSON
   PROXI endpoint at
     https://repository.jpostdb.org/proxi/datasets/<JPSTxxxxxx>
   that returns datasetFiles[*].value as ftp:// URIs alongside CV labels
   (Associated raw file URI, Search engine output file URI, Result file
   URI, Peak list file URI, ...). _list_jpost_public_files now hits
   PROXI first, builds file records with categories derived from the CV
   name (mapped via JPOST_PROXI_CATEGORY_MAP), and falls back to the
   FTP tree walk only if PROXI is unreachable or returns no records.
   Live-tested against JPST002311 -> 160 files (88 RAW, 72 SEARCH).

2. Post-transfer size check
   Neither MassIVE nor JPOST publishes per-file checksum manifests in
   a standard location, so md5 verification isn't an option for these
   datasets. As a lighter-weight integrity signal, _download_one_ftp_path
   now compares the local file size against ftp.size() after retrbinary
   returns and treats a mismatch as a retryable failure (next attempt
   resumes via REST from the current partial). This catches half-finished
   transfers where the data channel was closed early without retrbinary
   raising.

3. iProX accession guard
   Probing showed iProX REST endpoints (PMD009Controller/findByProjectId.jsonp,
   findFilesBySubProjectID.jsonp) all redirect unauthenticated callers to
   a CAS login page, and downloads use faspe:// URLs with per-session
   tokens. Native support is therefore blocked until iProX exposes an
   anonymous JSON API or pridepy carries iProX credentials. Until then,
   add is_iprox_accession() and _raise_if_iprox() so every public entry
   point (get_all_raw_file_list, download_all_raw_files,
   download_all_category_files, download_file_by_name, download_files_by_list,
   get_file_from_api) emits a clear NotImplementedError instead of
   silently falling through to the PRIDE API and 404-ing.

Tests
   - test_jpost_files: PROXI listing maps CV names to PRIDE categories;
     PROXI failure falls back to FTP walk.
   - test_iprox_guard: regex coverage; assert each entry point raises
     NotImplementedError for IPX accessions; assert
     is_direct_download_accession returns False for IPX.
   - test_ftp_download_validation: size mismatch retries until success;
     repeated mismatch raises after max_download_retries; correct size
     skips retries.
   - 25 tests total (10 MassIVE + 8 JPOST + 5 iProX guard + 3 size check).

Live verification (same MassIVE FTPS path as before)
   - MSV000080175 listing + params.xml download still produces
     10315 B, MD5 43d87368d705c3f380c1d030b14850c4.
   - JPST002311 PROXI listing returns 160 files with correct categories.
     Actual file transfer is still blocked from this IP by JPOST's
     421-too-many-connections rate limit; the code path is shared with
     MassIVE (same _download_one_ftp_path / download_ftp_urls), so a
     fresh IP should succeed.
@ypriverol ypriverol changed the title Direct downloads: FTPS for MassIVE, parallel + resume, add JPOST; bump 0.0.16 Direct downloads: MassIVE FTPS + JPOST PROXI + parallel + integrity; 0.0.16 May 27, 2026
….org

User pointed out the iProX dataset XML endpoint:
https://www.iprox.cn/FAF016Controller/readXml.jsonp?fileId=file_<id>_xml

Probing showed something simpler works: iProX publishes the
ProteomeXchange XML for every public dataset at a deterministic,
anonymous-accessible path on download.iprox.org. No CAS auth, no Aspera
tokens, no fileId discovery needed:

  http://download.iprox.org/<accession>/PX_<accession>.xml

The XML embeds Associated raw file URI / Search engine output file URI
cvParams pointing at HTTPS file URLs on the same host (Accept-Ranges:
bytes, so we get resume for free). The earlier 'iProX is auth-gated'
finding was specifically about the iProX UI's CAS-protected JSON
endpoints (PMD009Controller/findBySubProjectId.jsonp etc.); the public
download server is anonymous.

Changes
-------
- IPROX_DOWNLOAD_BASE_URL, IPROX_PX_XML_URL_TEMPLATE, IPROX_PX_CATEGORY_MAP
  added (the latter is the same CV map JPOST PROXI uses).
- _build_iprox_file_record() and _list_iprox_public_files() added,
  mirroring the JPOST helpers but using the PX XML schema.
- is_direct_download_accession() now returns True for IPX accessions.
- _raise_if_iprox() removed entirely. All entry points that previously
  called it no longer do, so IPX accessions flow through the unified
  direct-download dispatcher.
- _list_direct_download_files() dispatches IPX -> _list_iprox_public_files.
- _download_direct_download_records() now partitions URLs by scheme:
  ftp:// records go through download_ftp_urls (MassIVE/JPOST), http(s)://
  records through download_http_urls (iProX). A dataset whose records
  somehow contain both flows through both paths correctly.
- download_http_urls() grew parallel_files + max_retries kwargs and a
  ThreadPoolExecutor path. The per-file worker _http_download_one()
  wraps _parallel_download() (reused from the globus codepath) with
  retry, so iProX gets the same HEAD-then-Range resume and restart-on-
  non-206 behaviour we already use for PRIDE HTTPS.

Tests (26 pass)
---------------
- test_iprox_guard.py renamed to test_iprox_files.py and rewritten as
  positive-path tests: regex coverage, IPX is a direct-download
  accession, PX XML parsing maps cvParam name -> PRIDE category, RAW
  filtering ignores non-FTP/HTTPS URIs, download_file_by_name routes
  iProX URLs to download_http_urls (not download_ftp_urls) with
  parallel_files=1.
- All previously-added MassIVE / JPOST / size-validation tests still
  pass (20 + 6 iProX = 26 total).

Live verification (this branch, against download.iprox.org)
-----------------------------------------------------------
- IPX0017413000 listing: 7 files, correctly categorised (RAW + SEARCH).
- download_file_by_name(IPX0017413000, protein_annotation_profile.xlsx):
  3,253,230 B downloaded, MD5 c17baf230ffde1e2837ec4eb32dcea68, valid
  XLSX (PK header).
- Range-based resume: pre-staged 1,000,000 B partial, completed to
  3,253,230 B with the same MD5.
- Parallel HTTPS: 3 worker downloads from IPX0017413000 (xlsx + 2 RAW)
  ran concurrently; cancelled mid-flight after observing all 3 files
  growing in parallel (xlsx finished at 3.25 MB, Tumor_NK1.raw was at
  142 MB and Control_NK3.raw at 419 MB before cancellation).
@ypriverol ypriverol changed the title Direct downloads: MassIVE FTPS + JPOST PROXI + parallel + integrity; 0.0.16 Direct downloads: MassIVE FTPS + JPOST PROXI + iProX HTTPS + parallel/resume; 0.0.16 May 27, 2026
@ypriverol
Copy link
Copy Markdown
Contributor Author

Code review

Found 2 issues:

  1. FTP/HTTPS download flattens the dataset tree, contradicting the README and the prior PR Add direct download support for MassIVE accessions #98 review. Both _download_ftp_paths_serial, _download_ftp_paths_parallel, and the HTTPS path (_local_path_for_url -> _http_download_one) compute the local path with os.path.basename(ftp_path). Each file record already carries a relativePath field, but it isn't threaded through. For datasets with same-named files in different subdirectories (e.g. IPX0017413000/IPX0017413001/sample.raw and a sibling sub-accession with the same leaf name; or MassIVE/JPOST datasets with raw/run01.raw and peak/run01.raw), the second download silently overwrites the first, and skip_if_downloaded_already will falsely short-circuit. The README claim added in this PR ("Raw downloads follow each repository's own collection layout") is therefore not honoured. This is the same issue raised in the PR #98 review; it should be fixed in PR Direct downloads: MassIVE FTPS + JPOST PROXI + iProX HTTPS + parallel/resume; 0.0.16 #100 since the code paths are new.

    pridepy/pridepy/files/files.py

    Lines 2275 to 2281 in df165ab

    logging.info(f"Connected to FTP host: {host} (tls={use_tls})")
    for ftp_path in paths:
    local_path = os.path.join(output_folder, os.path.basename(ftp_path))
    try:
    Files._download_one_ftp_path(
    ftp=ftp,
    ftp_path=ftp_path,

    pridepy/pridepy/files/files.py

    Lines 2326 to 2331 in df165ab

    """
    def worker(ftp_path: str, position: int) -> None:
    local_path = os.path.join(output_folder, os.path.basename(ftp_path))
    if skip_if_downloaded_already and os.path.exists(local_path):
    logging.info(f"Skipping download as file already exists: {local_path}")
    return

  2. iProX HTTPS downloads have no post-transfer size verification, while the FTP path does. _download_one_ftp_path explicitly compares the on-disk size to ftp.size() after retrbinary returns and retries on mismatch (covered by test_ftp_download_validation.py). The HTTPS path used for iProX runs through _parallel_download, which streams via iter_content() and returns as soon as the loop exits, with no final size check against total_size (from the HEAD Content-Length). If the server closes the data channel mid-transfer (no exception raised), the function returns success and the next call with skip_if_downloaded_already will skip the truncated file. The README documents "post-transfer size verification" as a feature of direct downloads, but the iProX HTTPS path doesn't implement it. Adding a final_size != total_size check (mirroring the FTP path) inside _parallel_download would close the gap.

    pridepy/pridepy/files/files.py

    Lines 1093 to 1101 in df165ab

    with tqdm(total=total_size, unit="B", unit_scale=True, desc=file_path,
    initial=resume_size, position=position, leave=True) as pbar:
    mode = "ab" if resume_size > 0 else "wb"
    with open(file_path, mode, buffering=8 * 1024 * 1024) as f:
    for chunk in r.iter_content(chunk_size=8 * 1024 * 1024):
    if chunk:
    f.write(chunk)
    pbar.update(len(chunk))

@ypriverol
Copy link
Copy Markdown
Contributor Author

Superseded by #103, which consolidates this PR with #101 and #102 into a single branch (feat/files-py-modular-architecture) targeting dev. All 4 work commits from this PR are preserved verbatim at the head of #103. Please review and merge there instead.

@ypriverol ypriverol closed this May 27, 2026
ypriverol added a commit that referenced this pull request May 28, 2026
Addresses the two long-standing transport data-integrity issues flagged on
#98/#100 for the direct-download providers (MassIVE/JPOST/iProX).

- Thread each record's relativePath from Provider.download_files through to
  download_ftp_urls / download_http_urls, writing files to
  output_folder/<relativePath> instead of flattening to the URL basename.
  Identically-named files in different sub-collections no longer overwrite
  each other, and --skip-if-downloaded-already no longer skips the wrong file.
  Parent directories are created per file; a _safe_join guard prevents path
  traversal outside output_folder.
- Add a post-transfer size check in _parallel_download (HTTP/globus) mirroring
  the FTP path: a stream shorter than Content-Length now raises so the retry
  loop re-downloads (Range-resuming when possible) instead of silently
  accepting a truncated file.
- Remove the now-unused _local_path_for_url helper (superseded by _dest_path).
- Tests: collision-free relative_paths threading (ftp+http), truncated-stream
  raise, _safe_join traversal guard; update massive/jpost exact-call asserts.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant