Improve MassIVE download. by ypriverol · Pull Request #99 · PRIDE-Archive/pridepy

ypriverol · 2026-05-27T09:40:38Z

This pull request adds support for downloading public MassIVE datasets directly using their MSV... accessions, alongside existing PRIDE dataset functionality. The changes include both user-facing command-line improvements and substantial updates to the core file handling logic to enable MassIVE dataset discovery, file listing, and downloading via anonymous FTP. Documentation is also updated to reflect the new MassIVE capabilities.

MassIVE dataset support:

Added logic to detect MassIVE accessions, enumerate public files from the MassIVE FTP server, and map MassIVE collections to standard file categories. New helper methods and a recursive FTP directory walker were introduced in Files to enable this functionality. [1] [2]
Updated all major download and file listing methods (e.g., get_all_raw_file_list, download_all_raw_files, download_file_by_name, download_files_by_list, download_all_category_files, get_all_category_file_list) to support MassIVE accessions and route them through the new MassIVE-specific logic. [1] [2] [3] [4] [5] [6] [7] [8] [9]

Command-line interface and documentation:

Updated CLI commands and help text to clarify support for both PRIDE and MassIVE accessions, and added new usage examples for MassIVE datasets in the README.md. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11]

File download and FTP handling improvements:

Improved FTP file path handling for both PRIDE and MassIVE downloads, including more robust extraction of file paths from URLs and recursive directory traversal for MassIVE datasets.
Added a category mapping for MassIVE collections to align with PRIDE file categories, ensuring consistent filtering and download behavior.

General code improvements:

Minor refactoring and docstring updates to clarify that functions now support both PRIDE and MassIVE accessions. [1] [2] [3] [4] [5]

These changes collectively allow users to seamlessly download public datasets from MassIVE in addition to PRIDE, using the same CLI commands and Python API.

Resolves conflict in pridepy/files/files.py: drop the now-obsolete GLOBUS_BASE_URL constant (renamed to PRIDE_ARCHIVE_HTTPS_URL_PREFIX on dev in 7958d9a), and keep the MASSIVE_ARCHIVE_FTP* constants alongside.

Previously download_files_by_list always called stream_all_files_by_project (PRIDE-only), which broke filename-based downloads for MassIVE accessions. Branch on is_massive_accession to list via _list_massive_public_files and download via _download_massive_file_records. Also clarify --parallel-files help text: it applies to all protocols, not only globus.

Add direct download support for MassIVE accessions

qodo-code-review · 2026-05-27T09:40:42Z

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

coderabbitai · 2026-05-27T09:40:51Z

📝 Walkthrough

Walkthrough

pridepy extends file download and listing workflows to support MassIVE dataset accessions (MSV/rmsv format) via anonymous FTP enumeration. Core entry points now detect and branch on MassIVE accessions, reusing new FTP enumeration and file-record building helpers. FTP path extraction logic is also refactored to use URL parsing.

Changes

MassIVE Dataset Download Support

Layer / File(s)	Summary
MassIVE configuration and accession detection `pridepy/files/files.py`	`posixpath` is imported and MassIVE configuration constants are defined: FTP hostname, URL prefix, and a collection-to-category mapping for normalizing file records.
MassIVE FTP enumeration and file record building `pridepy/files/files.py`	Helper methods detect MassIVE accessions, build FTP root/URL paths, map collections to PRIDE-like categories, construct file records from FTP paths (handling raw/peak/wiff.scan special cases), recursively walk FTP trees, and list/download all public files.
Entry point branching and FTP integration `pridepy/files/files.py`	Raw file, category, and single-file listing/download methods branch on MassIVE accessions to use FTP enumeration instead of PRIDE API paths. FTP path extraction refactored to use `urlparse` with unquoting instead of string replacement.
Unit tests for MassIVE operations `pridepy/tests/test_massive_files.py`	New test class verifies accession detection, file record construction with category mapping and raw collection handling, raw file filtering, and download-by-name integration.
Documentation and CLI help updates `README.md`, `pridepy/pridepy.py`	README adds MassIVE download capabilities to the feature list and Quick Start examples. CLI help text across multiple commands updated to refer to "PRIDE or MassIVE dataset" accessions.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

PRIDE-Archive/pridepy#67: Modifies FTP download plumbing (URL parsing and path handling) for downloading from different dataset sources, overlapping with the urlparse refactoring in this PR.
PRIDE-Archive/pridepy#93: Updates download_all_raw_files and download_files_by_list entry points with parallelism and resilience; this PR adds MassIVE branching to the same methods.
PRIDE-Archive/pridepy#61: Introduces download-all-public-raw-files and download-all-public-category-files CLI commands; this PR extends the underlying file logic to operate on MassIVE datasets.

Suggested labels

enhancement, feature, tests

Suggested reviewers

selvaebi
chakrabandla

Poem

🐰 A rabbit hops through data trees,
FTP nodes dance in the breeze,
MassIVE files now find their way,
Alongside PRIDE, hip-hip-hooray!
Tests ensure each path rings true. 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title 'Improve MassIVE download' is vague and generic, using non-descriptive language that doesn't convey meaningful information about the specific changes made.	Use a more specific title that describes the actual enhancement, such as 'Add MassIVE dataset download support' or 'Support downloading files directly from MassIVE FTP'.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

README.md (1)

129-129: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix the Quick Start section numbering.

This should be ### 7) instead of ### 6) to maintain sequential numbering after the new MassIVE examples were inserted above.

📝 Proposed fix

-### 6) Download a named subset of files (manifest)
+### 7) Download a named subset of files (manifest)

Also update line 151:

-### 7) Download files from raw URLs
+### 8) Download files from raw URLs

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@README.md` at line 129, The Quick Start heading "### 6) Download a named
subset of files (manifest)" is misnumbered after inserting MassIVE examples;
update the heading text in README.md to "### 7) Download a named subset of files
(manifest)" and also adjust the subsequent Quick Start numbering (the later
heading referenced in the comment) so all section numbers remain sequential;
locate and update the literal headings (e.g., the "### 6) Download a named
subset of files (manifest)" token and the later Quick Start heading token) to
the corrected numeric values.

🧹 Nitpick comments (1)

pridepy/files/files.py (1)

300-301: 💤 Low value

Add debug logging when falling back from mlsd to LIST.

Silent exception handling here hides why the fallback path is taken. Adding a debug log would help troubleshoot FTP server compatibility issues.

💡 Proposed improvement

         except (AttributeError, ftplib.error_perm):
-            pass
+            logging.debug(f"mlsd not supported for {remote_dir}, falling back to LIST")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pridepy/files/files.py` around lines 300 - 301, The except block catching
(AttributeError, ftplib.error_perm) that falls back from using MLSD to LIST
should log a debug message instead of silently passing; update the exception
handler where MLSD -> LIST fallback occurs to call the module/logger debug
method (e.g., logger.debug or processLogger.debug) and include context like
"falling back from MLSD to LIST" plus the exception details and the FTP server
address or path if available; keep behavior the same otherwise so the fallback
continues after logging.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@README.md`:
- Line 129: The Quick Start heading "### 6) Download a named subset of files
(manifest)" is misnumbered after inserting MassIVE examples; update the heading
text in README.md to "### 7) Download a named subset of files (manifest)" and
also adjust the subsequent Quick Start numbering (the later heading referenced
in the comment) so all section numbers remain sequential; locate and update the
literal headings (e.g., the "### 6) Download a named subset of files (manifest)"
token and the later Quick Start heading token) to the corrected numeric values.

---

Nitpick comments:
In `@pridepy/files/files.py`:
- Around line 300-301: The except block catching (AttributeError,
ftplib.error_perm) that falls back from using MLSD to LIST should log a debug
message instead of silently passing; update the exception handler where MLSD ->
LIST fallback occurs to call the module/logger debug method (e.g., logger.debug
or processLogger.debug) and include context like "falling back from MLSD to
LIST" plus the exception details and the FTP server address or path if
available; keep behavior the same otherwise so the fallback continues after
logging.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 90f23e9a-b3a8-4362-8f79-67667310ea1e

📥 Commits

Reviewing files that changed from the base of the PR and between 6c554b5 and 79bace3.

📒 Files selected for processing (4)

README.md
pridepy/files/files.py
pridepy/pridepy.py
pridepy/tests/test_massive_files.py

ypriverol added 4 commits May 27, 2026 09:57

Add direct downloads for MassIVE accessions

94d7563

Merge branch 'dev' into feat/massive-direct-downloads

762144c

Resolves conflict in pridepy/files/files.py: drop the now-obsolete GLOBUS_BASE_URL constant (renamed to PRIDE_ARCHIVE_HTTPS_URL_PREFIX on dev in 7958d9a), and keep the MASSIVE_ARCHIVE_FTP* constants alongside.

Merge pull request #98 from PRIDE-Archive/feat/massive-direct-downloads

79bace3

Add direct download support for MassIVE accessions

coderabbitai Bot reviewed May 27, 2026

View reviewed changes

ypriverol merged commit c09cca5 into master May 27, 2026
10 checks passed

coderabbitai Bot mentioned this pull request May 28, 2026

Major development to include MassIVE; iPROX and PRoteomeXchange #106

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve MassIVE download. #99

Improve MassIVE download. #99
ypriverol merged 4 commits into
masterfrom
dev

ypriverol commented May 27, 2026 •

edited

Loading

Uh oh!

qodo-code-review Bot commented May 27, 2026

Uh oh!

coderabbitai Bot commented May 27, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ypriverol commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qodo-code-review Bot commented May 27, 2026

Qodo reviews are paused for this user.

Uh oh!

coderabbitai Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ypriverol commented May 27, 2026 •

edited

Loading

coderabbitai Bot commented May 27, 2026 •

edited

Loading