Skip to content

Latest commit

 

History

History
350 lines (273 loc) · 12.8 KB

File metadata and controls

350 lines (273 loc) · 12.8 KB

pridepy usage guide

This guide covers how to download data and query metadata with pridepy. For installation, see the README.

pridepy works with PRIDE accessions and, transparently, with native MassIVE (MSV…), JPOST (JPST…), and iProX (IPX…) accessions. The downloader supports ftp, aspera, s3, and globus: by default it starts with FTP, falls back across the remaining protocols when a transfer fails, and validates downloaded files (non-empty, and checksum validation when enabled).

Contents

Command overview

pridepy --help
Command Purpose
download-all-public-raw-files Download every public RAW file of a dataset
download-all-public-category-files Download files of one or more categories (RAW, SEARCH, …)
download-file-by-name Download a single file (public or private)
download-files-by-list Download a named subset of files from a manifest/CSV
download-files-by-url Download files from raw http/https/ftp URLs
download-px-raw-files Download RAW files resolved from a ProteomeXchange accession
list-private-files List files of a private project (needs credentials)
stream-files-metadata Stream file metadata (one project or all) to JSON
stream-projects-metadata Stream all project metadata to JSON
search-projects-by-keywords-and-filters Search projects by keyword and filters

The download commands work for PRIDE accessions and, transparently, for native MassIVE (MSV…), JPOST (JPST…), and iProX (IPX…) accessions — see Download from ProteomeXchange and other repositories.

PRIDE file downloads

PRIDE downloads start with FTP and fall back across the remaining protocols (ftp -> aspera -> s3 -> globus) when a transfer fails. They support resume, per-file retries, parallel workers, and optional checksum validation. Empty or corrupt files are retried automatically.

Common download options

These options are shared by download-all-public-raw-files, download-all-public-category-files, download-file-by-name, and download-files-by-list:

Option Description Default
-a, --accession Dataset accession (e.g. PXD008644) required
-o, --output-folder Destination directory required
-p, --protocol Transfer protocol: ftp, aspera, globus, s3 (FTP-first with fallback) ftp
-w, --parallel-files Download 1–3 files concurrently — primarily for globus; not available on download-file-by-name 1
--skip-if-downloaded-already Resume: skip files already present locally off
--checksum-check Download PRIDE checksums and validate each file off
--aspera-maximum-bandwidth Aspera cap, e.g. 50M, 100M, 200M (Aspera only) 100M
--preserve-structure Recreate the dataset's subdirectory layout (e.g. raw/…/) under the output folder instead of downloading flat off

By default, files are downloaded flat into the output folder (no raw/…/ subdirectories). When two files would collapse to the same name, later ones get a numeric suffix (run.raw, run_1.raw). Pass --preserve-structure to keep the dataset's original directory layout.

Download all raw files (robust mode)

pridepy download-all-public-raw-files \
  -a PXD008644 \
  -o ./downloads/PXD008644 \
  --checksum-check

Continue an interrupted download safely by adding --skip-if-downloaded-already:

pridepy download-all-public-raw-files \
  -a PXD008644 \
  -o ./downloads/PXD008644 \
  --skip-if-downloaded-already \
  --checksum-check

Download only selected categories

pridepy download-all-public-category-files \
  -a PXD022105 \
  -o ./downloads/PXD022105 \
  -c RAW,SEARCH

-c, --category takes one or more comma-separated categories. Valid values: RAW, PEAK, SEARCH, RESULT, SPECTRUM_LIBRARY, OTHER, FASTA.

Download one file by name

pridepy download-file-by-name \
  -a PXD022105 \
  -f checksum.txt \
  -o ./downloads/PXD022105 \
  --checksum-check

-f, --file-name is the file to download.

Download a named subset of files (manifest)

pridepy download-files-by-list \
  -a PXD001819 \
  -F files.txt \
  -o ./downloads/PXD001819 \
  --checksum-check

files.txt is one filename per line (blank lines and # comments are ignored). Each filename is resolved against the project metadata and downloaded via the same batch + protocol-fallback engine as download-all-public-raw-files. Use -f a.raw,b.raw,c.raw instead of -F for a small inline list (you can combine both).

Download files from raw URLs

pridepy download-files-by-url \
  -F urls.txt \
  -o ./downloads/urls

urls.txt is one fully-qualified URL per line. Schemes http, https, and ftp are dispatched to the matching downloader. Use -u, --urls for one or more comma-separated URLs, e.g. --urls https://a.com/x.raw,ftp://b.com/y.raw (URLs containing literal commas must use a manifest file instead).

Command-specific options:

Option Description Default
-F, --url-list Manifest file, one URL per line
-u, --urls Comma-separated URL(s)
-p, --protocol ftp (per-scheme) or globus (resume-capable http/https) ftp
-w, --parallel-files Download 1–3 files concurrently (any scheme) 1
--checksum-check Validate against PRIDE checksums (accession inferred from PRIDE URL paths; only PRIDE archive URLs supported) off

Private (restricted) files

List the files of a private project with your PRIDE credentials:

pridepy list-private-files -a PXD022105 -u YOUR_USER -p YOUR_PASSWORD

Download a private file by passing --username/--password to download-file-by-name:

pridepy download-file-by-name \
  -a PXD022105 \
  -f checksum.txt \
  -o ./downloads/private \
  --username YOUR_USER \
  --password YOUR_PASSWORD

Metadata and search

Stream all project metadata to JSON

pridepy stream-projects-metadata -o all_pride_projects.json
Option Description Default
-o, --output-file JSON file to write all project metadata to required

Stream file metadata

# All file metadata for one accession
pridepy stream-files-metadata -a PXD005011 -o PXD005011_files.json
Option Description Default
-o, --output-file JSON file to write file metadata to required
-a, --accession Limit to one project (omit to stream all files) optional

Search projects by keywords and filters

pridepy search-projects-by-keywords-and-filters \
  -k human \
  -f projectTags==ProteomeTools,organismsPart==Pancreas \
  -sd DESC \
  -sf accession \
  -sf submissionDate
Option Description Default
-k, --keyword Keyword searched across project fields required
-f, --filters field==value filters, comma-separated (e.g. accession==PRD000001)
-ps, --page-size Results per page (1–1000) 100
-p, --page Page number (0-based) 0
-sd, --sort-direction ASC or DESC DESC
-sf, --sort-fields Sort field(s), repeatable. One of: accession, submissionDate, diseases, organismsPart, organisms, instruments, softwares, avgDownloadsPerFile, downloadCount, publicationDate submissionDate

Download from ProteomeXchange and other repositories

A ProteomeXchange (PXD… / PRD…) accession is a cross-repository identifier: the dataset may be hosted at PRIDE, MassIVE, JPOST, iProX, or elsewhere. pridepy lets you start from the ProteomeXchange accession, or go straight to the hosting repository using its native accession.

Start from a ProteomeXchange accession

download-px-raw-files resolves the dataset's ProteomeXchange XML and downloads the RAW files it references, regardless of which repository hosts them:

pridepy download-px-raw-files \
  -a PXD039236 \
  -o ./downloads/PXD039236
Option Description Default
-a, --accession ProteomeXchange accession (e.g. PXD039236). --px is a deprecated alias required
-o, --output-folder Destination directory required
--skip-if-downloaded-already Skip files already present locally off

Go directly to the hosting repository (native MassIVE / JPOST / iProX accessions)

Datasets that do not have a ProteomeXchange accession — or where you already know the native accession — can be downloaded directly. The standard download commands accept MassIVE, JPOST, and iProX accessions transparently:

# MassIVE (FTPS at massive-ftp.ucsd.edu)
pridepy download-all-public-raw-files \
  -a MSV000082297 \
  -o ./downloads/MSV000082297

# JPOST (PROXI listing + ftp.jpostdb.org)
pridepy download-all-public-raw-files \
  -a JPST002311 \
  -o ./downloads/JPST002311

# iProX (ProteomeXchange XML + anonymous HTTP at download.iprox.org)
pridepy download-all-public-raw-files \
  -a IPX0017413000 \
  -o ./downloads/IPX0017413000

How each repository is enumerated:

  • MassIVE walks the FTPS tree at massive-ftp.ucsd.edu (the server requires TLS). MassIVE distributes datasets across versioned root directories (/v01/vNN); pridepy discovers the correct root automatically. If FTP/FTPS is blocked by the network, pridepy falls back to HTTPS: it lists the dataset from the GNPS2 file index (datasetcache.gnps2.org) and downloads each file from the ProteoSAFe endpoint at massive.ucsd.edu (byte-identical to the FTPS copy).
  • JPOST lists files through the JSON PROXI endpoint at https://repository.jpostdb.org/proxi/datasets/<JPSTxxxxxx> and downloads from ftp.jpostdb.org over plain FTP. The PROXI listing avoids the source-IP connection limit JPOST enforces on FTP.
  • iProX fetches the dataset's ProteomeXchange XML from http://download.iprox.org/<accession>/PX_<accession>.xml, then downloads each referenced file from the same host over anonymous HTTP (with Range support for resume). iProX also exposes Aspera (faspe://) with username/password for very large bulk transfers; pridepy uses the public HTTP endpoint so no iProX credentials are required.

download-all-public-raw-files retrieves the files stored under the dataset's raw/ collection. These direct downloads support resume (REST for FTP, byte-Range for HTTP), per-file retries, parallel workers (-w up to 3), and post-transfer size verification against the server-reported size. By default files are written flat into the output folder; pass --preserve-structure to keep the dataset's sub-directory layout.

You can also request a specific collection from these repositories through the same category interface:

pridepy download-all-public-category-files \
  -a MSV000082297 \
  -o ./downloads/MSV000082297-results \
  -c RESULT

Python API examples

Breaking change (0.0.16): the legacy pridepy.files.files.Files class has been removed. Replace from pridepy.files.files import Files with from pridepy.download.client import Client; Client exposes the same public methods (get_all_raw_file_list, download_all_raw_files, get_submitted_file_path_prefix, download_file_by_name, download_all_category_files, download_px_raw_files, …).

Get raw files for a project

from pridepy.download.client import Client

client = Client()
raw_files = client.get_all_raw_file_list("PXD008644")
print(f"RAW files: {len(raw_files)}")
print(raw_files[0]["fileName"])

For MassIVE / JPOST / iProX accessions, the same method returns the files found under the dataset's raw/ collection:

from pridepy.download.client import Client

client = Client()
for accession in ("MSV000082297", "JPST002311", "IPX0017413000"):
    raw_files = client.get_all_raw_file_list(accession)
    print(f"{accession} raw files: {len(raw_files)}")

Download all raw files for a project

from pridepy.download.client import Client

client = Client()
client.download_all_raw_files(
    accession="PXD008644",
    output_folder="./downloads/PXD008644",
    skip_if_downloaded_already=True,
    protocol="ftp",
    aspera_maximum_bandwidth="100M",
    checksum_check=True,
)

Search projects

from pridepy.project.project import Project

project = Project()
results = project.search_by_keywords_and_filters(
    keyword="PXD009476",
    query_filter="",
    page_size=25,
    page=0,
    sort_direction="DESC",
    sort_fields="accession",
)
print(f"Hits: {len(results)}")