This guide covers how to download data and query metadata with pridepy. For
installation, see the README.
pridepy works with PRIDE accessions and, transparently, with native MassIVE
(MSV…), JPOST (JPST…), and iProX (IPX…) accessions. The downloader
supports ftp, aspera, s3, and globus: by default it starts with FTP,
falls back across the remaining protocols when a transfer fails, and validates
downloaded files (non-empty, and checksum validation when enabled).
- Command overview
- PRIDE file downloads
- Metadata and search
- Download from ProteomeXchange and other repositories
- Python API examples
pridepy --help| Command | Purpose |
|---|---|
download-all-public-raw-files |
Download every public RAW file of a dataset |
download-all-public-category-files |
Download files of one or more categories (RAW, SEARCH, …) |
download-file-by-name |
Download a single file (public or private) |
download-files-by-list |
Download a named subset of files from a manifest/CSV |
download-files-by-url |
Download files from raw http/https/ftp URLs |
download-px-raw-files |
Download RAW files resolved from a ProteomeXchange accession |
list-private-files |
List files of a private project (needs credentials) |
stream-files-metadata |
Stream file metadata (one project or all) to JSON |
stream-projects-metadata |
Stream all project metadata to JSON |
search-projects-by-keywords-and-filters |
Search projects by keyword and filters |
The download commands work for PRIDE accessions and, transparently, for native
MassIVE (MSV…), JPOST (JPST…), and iProX (IPX…) accessions — see
Download from ProteomeXchange and other repositories.
PRIDE downloads start with FTP and fall back across the remaining protocols
(ftp -> aspera -> s3 -> globus) when a transfer fails. They support resume,
per-file retries, parallel workers, and optional checksum validation. Empty or
corrupt files are retried automatically.
These options are shared by download-all-public-raw-files,
download-all-public-category-files, download-file-by-name, and
download-files-by-list:
| Option | Description | Default |
|---|---|---|
-a, --accession |
Dataset accession (e.g. PXD008644) |
required |
-o, --output-folder |
Destination directory | required |
-p, --protocol |
Transfer protocol: ftp, aspera, globus, s3 (FTP-first with fallback) |
ftp |
-w, --parallel-files |
Download 1–3 files concurrently — primarily for globus; not available on download-file-by-name |
1 |
--skip-if-downloaded-already |
Resume: skip files already present locally | off |
--checksum-check |
Download PRIDE checksums and validate each file | off |
--aspera-maximum-bandwidth |
Aspera cap, e.g. 50M, 100M, 200M (Aspera only) |
100M |
--preserve-structure |
Recreate the dataset's subdirectory layout (e.g. raw/…/) under the output folder instead of downloading flat |
off |
By default, files are downloaded flat into the output folder (no
raw/…/ subdirectories). When two files would collapse to the same name,
later ones get a numeric suffix (run.raw, run_1.raw). Pass
--preserve-structure to keep the dataset's original directory layout.
pridepy download-all-public-raw-files \
-a PXD008644 \
-o ./downloads/PXD008644 \
--checksum-checkContinue an interrupted download safely by adding --skip-if-downloaded-already:
pridepy download-all-public-raw-files \
-a PXD008644 \
-o ./downloads/PXD008644 \
--skip-if-downloaded-already \
--checksum-checkpridepy download-all-public-category-files \
-a PXD022105 \
-o ./downloads/PXD022105 \
-c RAW,SEARCH-c, --category takes one or more comma-separated categories. Valid values:
RAW, PEAK, SEARCH, RESULT, SPECTRUM_LIBRARY, OTHER, FASTA.
pridepy download-file-by-name \
-a PXD022105 \
-f checksum.txt \
-o ./downloads/PXD022105 \
--checksum-check-f, --file-name is the file to download.
pridepy download-files-by-list \
-a PXD001819 \
-F files.txt \
-o ./downloads/PXD001819 \
--checksum-checkfiles.txt is one filename per line (blank lines and # comments are
ignored). Each filename is resolved against the project metadata and downloaded
via the same batch + protocol-fallback engine as download-all-public-raw-files.
Use -f a.raw,b.raw,c.raw instead of -F for a small inline list (you can
combine both).
pridepy download-files-by-url \
-F urls.txt \
-o ./downloads/urlsurls.txt is one fully-qualified URL per line. Schemes http, https, and
ftp are dispatched to the matching downloader. Use -u, --urls for one or
more comma-separated URLs, e.g. --urls https://a.com/x.raw,ftp://b.com/y.raw
(URLs containing literal commas must use a manifest file instead).
Command-specific options:
| Option | Description | Default |
|---|---|---|
-F, --url-list |
Manifest file, one URL per line | — |
-u, --urls |
Comma-separated URL(s) | — |
-p, --protocol |
ftp (per-scheme) or globus (resume-capable http/https) |
ftp |
-w, --parallel-files |
Download 1–3 files concurrently (any scheme) | 1 |
--checksum-check |
Validate against PRIDE checksums (accession inferred from PRIDE URL paths; only PRIDE archive URLs supported) | off |
List the files of a private project with your PRIDE credentials:
pridepy list-private-files -a PXD022105 -u YOUR_USER -p YOUR_PASSWORDDownload a private file by passing --username/--password to
download-file-by-name:
pridepy download-file-by-name \
-a PXD022105 \
-f checksum.txt \
-o ./downloads/private \
--username YOUR_USER \
--password YOUR_PASSWORDpridepy stream-projects-metadata -o all_pride_projects.json| Option | Description | Default |
|---|---|---|
-o, --output-file |
JSON file to write all project metadata to | required |
# All file metadata for one accession
pridepy stream-files-metadata -a PXD005011 -o PXD005011_files.json| Option | Description | Default |
|---|---|---|
-o, --output-file |
JSON file to write file metadata to | required |
-a, --accession |
Limit to one project (omit to stream all files) | optional |
pridepy search-projects-by-keywords-and-filters \
-k human \
-f projectTags==ProteomeTools,organismsPart==Pancreas \
-sd DESC \
-sf accession \
-sf submissionDate| Option | Description | Default |
|---|---|---|
-k, --keyword |
Keyword searched across project fields | required |
-f, --filters |
field==value filters, comma-separated (e.g. accession==PRD000001) |
— |
-ps, --page-size |
Results per page (1–1000) | 100 |
-p, --page |
Page number (0-based) | 0 |
-sd, --sort-direction |
ASC or DESC |
DESC |
-sf, --sort-fields |
Sort field(s), repeatable. One of: accession, submissionDate, diseases, organismsPart, organisms, instruments, softwares, avgDownloadsPerFile, downloadCount, publicationDate |
submissionDate |
A ProteomeXchange (PXD… / PRD…) accession is a cross-repository identifier:
the dataset may be hosted at PRIDE, MassIVE, JPOST, iProX, or elsewhere.
pridepy lets you start from the ProteomeXchange accession, or go straight to
the hosting repository using its native accession.
download-px-raw-files resolves the dataset's ProteomeXchange XML and downloads
the RAW files it references, regardless of which repository hosts them:
pridepy download-px-raw-files \
-a PXD039236 \
-o ./downloads/PXD039236| Option | Description | Default |
|---|---|---|
-a, --accession |
ProteomeXchange accession (e.g. PXD039236). --px is a deprecated alias |
required |
-o, --output-folder |
Destination directory | required |
--skip-if-downloaded-already |
Skip files already present locally | off |
Datasets that do not have a ProteomeXchange accession — or where you already know the native accession — can be downloaded directly. The standard download commands accept MassIVE, JPOST, and iProX accessions transparently:
# MassIVE (FTPS at massive-ftp.ucsd.edu)
pridepy download-all-public-raw-files \
-a MSV000082297 \
-o ./downloads/MSV000082297
# JPOST (PROXI listing + ftp.jpostdb.org)
pridepy download-all-public-raw-files \
-a JPST002311 \
-o ./downloads/JPST002311
# iProX (ProteomeXchange XML + anonymous HTTP at download.iprox.org)
pridepy download-all-public-raw-files \
-a IPX0017413000 \
-o ./downloads/IPX0017413000How each repository is enumerated:
- MassIVE walks the FTPS tree at
massive-ftp.ucsd.edu(the server requires TLS). MassIVE distributes datasets across versioned root directories (/v01–/vNN);pridepydiscovers the correct root automatically. If FTP/FTPS is blocked by the network,pridepyfalls back to HTTPS: it lists the dataset from the GNPS2 file index (datasetcache.gnps2.org) and downloads each file from the ProteoSAFe endpoint atmassive.ucsd.edu(byte-identical to the FTPS copy). - JPOST lists files through the JSON PROXI endpoint at
https://repository.jpostdb.org/proxi/datasets/<JPSTxxxxxx>and downloads fromftp.jpostdb.orgover plain FTP. The PROXI listing avoids the source-IP connection limit JPOST enforces on FTP. - iProX fetches the dataset's ProteomeXchange XML from
http://download.iprox.org/<accession>/PX_<accession>.xml, then downloads each referenced file from the same host over anonymous HTTP (withRangesupport for resume). iProX also exposes Aspera (faspe://) with username/password for very large bulk transfers;pridepyuses the public HTTP endpoint so no iProX credentials are required.
download-all-public-raw-files retrieves the files stored under the dataset's
raw/ collection. These direct downloads support resume (REST for FTP,
byte-Range for HTTP), per-file retries, parallel workers (-w up to 3), and
post-transfer size verification against the server-reported size. By default
files are written flat into the output folder; pass --preserve-structure to
keep the dataset's sub-directory layout.
You can also request a specific collection from these repositories through the same category interface:
pridepy download-all-public-category-files \
-a MSV000082297 \
-o ./downloads/MSV000082297-results \
-c RESULTBreaking change (0.0.16): the legacy
pridepy.files.files.Filesclass has been removed. Replacefrom pridepy.files.files import Fileswithfrom pridepy.download.client import Client;Clientexposes the same public methods (get_all_raw_file_list,download_all_raw_files,get_submitted_file_path_prefix,download_file_by_name,download_all_category_files,download_px_raw_files, …).
from pridepy.download.client import Client
client = Client()
raw_files = client.get_all_raw_file_list("PXD008644")
print(f"RAW files: {len(raw_files)}")
print(raw_files[0]["fileName"])For MassIVE / JPOST / iProX accessions, the same method returns the files found under the dataset's raw/ collection:
from pridepy.download.client import Client
client = Client()
for accession in ("MSV000082297", "JPST002311", "IPX0017413000"):
raw_files = client.get_all_raw_file_list(accession)
print(f"{accession} raw files: {len(raw_files)}")from pridepy.download.client import Client
client = Client()
client.download_all_raw_files(
accession="PXD008644",
output_folder="./downloads/PXD008644",
skip_if_downloaded_already=True,
protocol="ftp",
aspera_maximum_bandwidth="100M",
checksum_check=True,
)from pridepy.project.project import Project
project = Project()
results = project.search_by_keywords_and_filters(
keyword="PXD009476",
query_filter="",
page_size=25,
page=0,
sort_direction="DESC",
sort_fields="accession",
)
print(f"Hits: {len(results)}")