Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
309e361
Merge pull request #89 from PRIDE-Archive/dev
ypriverol Apr 22, 2026
ab6ce0d
Merge pull request #91 from PRIDE-Archive/dev
ypriverol Apr 27, 2026
a99c73b
Merge pull request #93 from PRIDE-Archive/dev
ypriverol May 2, 2026
cacf553
Merge pull request #95 from PRIDE-Archive/dev
ypriverol May 2, 2026
6c554b5
Merge pull request #97 from PRIDE-Archive/dev
ypriverol May 4, 2026
c09cca5
Merge pull request #99 from PRIDE-Archive/dev
ypriverol May 27, 2026
d471e0b
Add direct downloads for JPOST and iProX accessions; bump to 0.0.16
ypriverol May 27, 2026
3f8ed3f
Direct downloads: FTPS for MassIVE, parallelism, defer iProX
ypriverol May 27, 2026
a360378
JPOST PROXI listing, post-transfer size check, iProX accession guard
ypriverol May 27, 2026
df165ab
iProX direct downloads via PX XML + anonymous HTTPS at download.iprox…
ypriverol May 27, 2026
eedbef1
refactor(providers): scaffold providers/ package with Provider ABC + …
ypriverol May 27, 2026
912287d
refactor(providers): move FTP/HTTPS transport into providers/transpor…
ypriverol May 27, 2026
e16c870
refactor(providers): move cross-cutting utilities into providers/util.py
ypriverol May 27, 2026
41539e7
refactor(providers): extract MassiveProvider
ypriverol May 27, 2026
2010f63
refactor(providers): extract JpostProvider with PROXI + FTP listing
ypriverol May 27, 2026
e128a16
refactor(providers): extract IproxProvider with PX XML listing
ypriverol May 27, 2026
835d77b
test(providers): verify BaseDirectDownloadProvider URL-scheme partiti…
ypriverol May 27, 2026
3e358ae
refactor(providers): extract PrideProvider with multi-protocol fallback
ypriverol May 27, 2026
6fd743d
refactor(providers): rewire Files facade through Registry
ypriverol May 27, 2026
6d5637b
test: integration test for PRIDE multi-protocol fallback through facade
ypriverol May 27, 2026
f60a69c
chore(release): bump version to 0.0.17
ypriverol May 27, 2026
02de401
refactor(commands): scaffold commands/ package
ypriverol May 27, 2026
c165345
refactor(commands): move ProteomeXchange XML download into commands/p…
ypriverol May 27, 2026
1a0a9b8
refactor(commands): move download_files_by_list into commands/by_list.py
ypriverol May 27, 2026
ced7415
refactor(commands): move download_files_by_url into commands/by_url.py
ypriverol May 27, 2026
e3d694b
refactor(providers): move ProteomeXchange from commands/ to providers…
ypriverol May 27, 2026
6bc8e5a
fix(files): restore missing imports and hoist lazy imports to module top
ypriverol May 27, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 24 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@

You can:
- download public and private PRIDE files
- download public MassIVE datasets directly from `MSV...` accessions
- download public MassIVE (`MSV...`), JPOST (`JPST...`), and iProX (`IPX...`) datasets directly. MassIVE goes through FTPS at `massive-ftp.ucsd.edu`; JPOST uses the JSON PROXI endpoint at `repository.jpostdb.org` for listings and `ftp.jpostdb.org` for transfers; iProX fetches the dataset's ProteomeXchange XML from `download.iprox.org` and downloads files over anonymous HTTPS
- download by category (`RAW`, `SEARCH`, `RESULT`, etc.)
- stream project and file metadata
- search projects by keyword and filters
Expand Down Expand Up @@ -80,15 +80,31 @@ pridepy download-all-public-raw-files \
--checksum-check
```

### 3) Download a public MassIVE dataset directly
### 3) Download a public MassIVE, JPOST, or iProX dataset directly

```bash
# MassIVE
pridepy download-all-public-raw-files \
-a MSV000082297 \
-o ./downloads/MSV000082297

# JPOST
pridepy download-all-public-raw-files \
-a JPST002311 \
-o ./downloads/JPST002311

# iProX
pridepy download-all-public-raw-files \
-a IPX0017413000 \
-o ./downloads/IPX0017413000
```

For direct `MSV...` downloads, `pridepy` enumerates the dataset from MassIVE's public FTP tree. Raw downloads follow MassIVE's own collection layout, so `download-all-public-raw-files` downloads the files stored under the dataset's `raw/` collection.
For these direct downloads, `pridepy` enumerates the dataset from the repository:
- **MassIVE** lists files by walking the FTPS tree at `massive-ftp.ucsd.edu` (TLS is required by the server).
- **JPOST** lists files through the JSON PROXI endpoint at `https://repository.jpostdb.org/proxi/datasets/<JPSTxxxxxx>` and downloads them from `ftp.jpostdb.org` over plain FTP. The PROXI listing avoids the source-IP connection limit JPOST enforces on FTP.
- **iProX** fetches the dataset's ProteomeXchange XML from `http://download.iprox.org/<accession>/PX_<accession>.xml`, then downloads each referenced file from the same host over anonymous HTTPS. iProX exposes Aspera (`faspe://`) with username/password for very large bulk transfers; `pridepy` uses the public HTTPS endpoint instead so no iProX credentials are required.

Raw downloads follow each repository's own collection layout, so `download-all-public-raw-files` downloads the files stored under the dataset's `raw/` collection. Direct downloads support resume (REST for FTP, byte-Range for HTTPS), per-file retries, parallel workers (`-w N` up to 3), and post-transfer size verification against the server-reported size.

### 4) Download only selected categories

Expand All @@ -99,7 +115,7 @@ pridepy download-all-public-category-files \
-c RAW,SEARCH
```

You can also request a specific MassIVE collection through the same category interface:
You can also request a specific MassIVE / JPOST / iProX collection through the same category interface:

```bash
pridepy download-all-public-category-files \
Expand Down Expand Up @@ -244,14 +260,15 @@ print(f"RAW files: {len(raw_files)}")
print(raw_files[0]["fileName"])
```

For MassIVE accessions, the same method returns the files found under the dataset's `raw/` collection:
For MassIVE / JPOST / iProX accessions, the same method returns the files found under the dataset's `raw/` collection:

```python
from pridepy.files.files import Files

files = Files()
raw_files = files.get_all_raw_file_list("MSV000082297")
print(f"MassIVE raw files: {len(raw_files)}")
for accession in ("MSV000082297", "JPST002311", "IPX0017413000"):
raw_files = files.get_all_raw_file_list(accession)
print(f"{accession} raw files: {len(raw_files)}")
```

### Example: search projects
Expand Down
20 changes: 20 additions & 0 deletions pridepy/commands/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
"""Cross-cutting download commands.

Each module under this package owns one user-facing command that doesn't
fit any single provider:

- ``by_url``: download a list of explicit URLs (ftp/http/https)
- ``by_list``: download a subset of a project's files by filename

ProteomeXchange used to live here too but moved to
:class:`pridepy.providers.proteomexchange.ProteomeXchangeProvider` because
it conforms to the ``Provider`` interface (takes an accession or URL and
returns file records). It is deliberately not auto-registered with the
provider registry — PXD/PRD accessions continue to route through
:class:`pridepy.providers.pride.PrideProvider`; ProteomeXchangeProvider is
the explicit gateway for the cross-repository XML view, invoked via the
``download-px-raw-files`` CLI command and ``Files.download_px_raw_files``.

The ``pridepy.files.files.Files`` facade keeps shim methods that
delegate here, so existing test patches on ``Files.X`` keep working.
"""
59 changes: 59 additions & 0 deletions pridepy/commands/by_list.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
"""Download a subset of project files identified by a filename list."""
import logging
from typing import List, Optional

from pridepy.providers import registry


def download_files_by_list(
accession: str,
file_names: List[str],
output_folder: str,
skip_if_downloaded_already: bool,
protocol: str = "ftp",
aspera_maximum_bandwidth: str = "100M",
checksum_check: bool = False,
parallel_files: int = 1,
) -> None:
"""Download a subset of project files identified by a filename list.

Resolves each requested filename via the project metadata API and
delegates to the provider's ``download_files`` so the existing batch +
protocol fallback engine is reused.

:param accession: PRIDE or MassIVE project accession (public)
:param file_names: filenames to download
:param output_folder: directory to write downloaded files into
:param skip_if_downloaded_already: skip files already present locally
:param protocol: preferred protocol; falls back across others on failure
:param aspera_maximum_bandwidth: aspera ascp bandwidth cap
:param checksum_check: download project checksums and validate
:param parallel_files: number of files to download simultaneously for globus
:raises ValueError: if ``file_names`` is empty or none match the project
"""
if not file_names:
raise ValueError("file_names must contain at least one filename")

provider = registry.resolve(accession)
all_files = provider.list_files(accession)

requested = set(file_names)
matched = [f for f in all_files if f.get("fileName") in requested]
missing = sorted(requested - {f.get("fileName") for f in matched})
if missing:
logging.warning("Files not found in project %s: %s", accession, missing)
if not matched:
raise ValueError(
f"No matching files in project {accession} for: {sorted(requested)}"
)

provider.download_files(
accession=accession,
records=matched,
output_folder=output_folder,
skip_if_downloaded_already=skip_if_downloaded_already,
protocol=protocol,
parallel_files=parallel_files,
checksum_check=checksum_check,
aspera_maximum_bandwidth=aspera_maximum_bandwidth,
)
Loading
Loading