Skip to content

refactor(providers): split files.py into per-provider classes (0.0.17)#101

Closed
ypriverol wants to merge 11 commits into
feat/jpost-iprox-direct-downloadsfrom
refactor/files-py-providers
Closed

refactor(providers): split files.py into per-provider classes (0.0.17)#101
ypriverol wants to merge 11 commits into
feat/jpost-iprox-direct-downloadsfrom
refactor/files-py-providers

Conversation

@ypriverol
Copy link
Copy Markdown
Contributor

Summary

Pure structural refactor of pridepy/files/files.py (was 2514 LOC, 60+ methods on one class) into a per-provider package. No behaviour change. No new features. No dependency changes. Ships as 0.0.17.

Base branch note: this PR targets feat/jpost-iprox-direct-downloads (PR #100) so the diff shows only the refactor commits. After PR #100 merges to dev, update this PR's base to dev (gh pr edit <N> --base dev).

Spec: docs/specs/2026-05-27-files-py-provider-refactor-design.md (NOTE: spec/plan docs are on a separate docs/files-py-refactor-design branch; merge that first as a docs PR, or include here via cherry-pick)
Plan: same docs branch, docs/plans/2026-05-27-files-py-provider-refactor.md

Architecture

pridepy/
├── providers/                       (NEW package, 2030 LOC total)
│   ├── __init__.py                  (7 LOC)
│   ├── base.py                      (107 LOC)   Provider ABC + BaseDirectDownloadProvider
│   ├── registry.py                  (38 LOC)    accession -> Provider resolution
│   ├── transport.py                 (504 LOC)   FTP/FTPS/HTTPS transport with resume + parallel
│   ├── util.py                      (183 LOC)   checksum/record helpers + Progress
│   ├── pride.py                     (815 LOC)   PrideProvider — multi-protocol orchestrator
│   ├── massive.py                   (97 LOC)    MassiveProvider — FTPS tree walk
│   ├── jpost.py                     (150 LOC)   JpostProvider — PROXI JSON + FTP fallback
│   └── iprox.py                     (129 LOC)   IproxProvider — PX XML over HTTPS
├── files/
│   └── files.py                     (1254 LOC, down from 2514 — 50% reduction)
└── pridepy.py                       (unchanged)

Moved-from / moved-to

Old location New location
Files.download_ftp_urls + 10 FTP helpers providers/transport.py
Files.download_http_urls + helpers providers/transport.py
Files.compute_md5, validate_download, Progress, _get_download_url, _resolve_local_path, etc. providers/util.py
Files.is_massive_accession, list_massive, build_massive, get_massive* providers/massive.py (MassiveProvider)
Files.is_jpost_accession, list_jpost*, PROXI listing providers/jpost.py (JpostProvider)
Files.is_iprox_accession, list_iprox*, PX XML listing providers/iprox.py (IproxProvider)
Files.stream_all_*, download_files, _batch_download_by_protocol, _download_with_fallback, Globus/S3/Aspera, private-dataset path providers/pride.py (PrideProvider)
Files.download_files_by_url, download_files_by_list, download_px_raw_files stay on Files facade (cross-cutting)

Provider interface

class Provider(ABC):
    name: ClassVar[str]
    @staticmethod
    @abstractmethod
    def matches(accession: str) -> bool: ...
    @abstractmethod
    def list_files(self, accession: str) -> List[Dict]: ...
    @abstractmethod
    def download_files(self, accession, records, output_folder, ...): ...

class BaseDirectDownloadProvider(Provider):
    """Shared download_files for MassIVE/JPOST/iProX — partitions records by URL scheme."""
    use_tls: ClassVar[bool] = False

Backward compatibility

Files class is preserved as a facade. Every attribute the existing test suite patches (Files.download_ftp_urls, Files._batch_download_by_protocol, Files._download_with_fallback, Files._globus_download_one, Files._download_single_url, Files._http_download_url, Files._list_massive_public_files, Files._list_jpost_public_files, Files._list_iprox_public_files, Files._repo_uses_tls) is kept as a shim that delegates to the new location. PrideProvider calls back into Files.X (lazy import) for patch-sensitive helpers so patch.object(Files, ...) still intercepts. The CLI (pridepy.py) is unchanged.

Tests

68 passed, 4 skipped — baseline was 66 passed, 4 skipped. Two new tests added:

  • test_base_direct_download_provider_partitions_urls_by_scheme — verifies BaseDirectDownloadProvider partitions ftp:// vs http(s):// records correctly and forwards use_tls from the provider.
  • test_facade_dispatches_pride_through_registry_to_fallback — verifies Files().download_all_raw_files("PXD...") flows through Registry.resolvePrideProvider.download_files_batch_download_by_protocol (patched on Files, the patch still intercepts).

Test changes (intentional, 38 lines across 3 files): Three tests had their patch.object targets updated from Files-level shims to provider classes (MassiveProvider.list_files, JpostProvider.list_files, PrideProvider.list_files, PrideProvider.download_files). This was unavoidable: once Files.download_all_raw_files calls registry.resolve(...).list_files(...) directly, the old patches on Files._list_massive_public_files etc. became orphaned. The test ASSERTIONS are unchanged — only patch targets moved to reflect the new dispatch path.

Verification

  • Live smoke test against MassIVE FTPS: pridepy download-file-by-name -a MSV000080175 -f params.xml produces a 10315-byte file with MD5 43d87368d705c3f380c1d030b14850c4 — exact match for the pre-refactor baseline.
  • pridepy --version reports 0.0.17.
  • All 8 test_download_resilience.py tests pass (the PRIDE multi-protocol fallback regression suite).
  • All 9 test_massive_files.py tests pass.
  • All 8 test_jpost_files.py tests pass.
  • All 6 test_iprox_files.py tests pass.

Acceptance criteria

Criterion (from spec §Acceptance criteria) Status
#1 files.py under ~400 LOC PARTIAL (1254 LOC — see note)
#2 providers/ has base.py, registry.py, transport.py, util.py, pride.py, massive.py, jpost.py, iprox.py DONE
#3 canonical accession-regex defs in providers/ DONE
#4 pytest fully green DONE (68 passed, 4 skipped)
#5 PRIDE multi-protocol fallback integration test DONE
#6 pyproject.toml = 0.0.17; pridepy --version = 0.0.17 DONE
#7 smoke test MD5 matches DONE
#8 PR description states "no behaviour change" + moved-from/moved-to mapping DONE (this PR description)

Note on criterion #1: files.py landed at 1254 LOC, not <400. The remaining bulk is necessary backward-compat: ~30 shim methods + the cross-cutting commands (download_files_by_url, download_files_by_list, download_px_raw_files) + the public/private split inside download_file_by_name (which couldn't move to PrideProvider without breaking test patches via Util.get_api_call). A follow-up PR can migrate tests to patch providers directly, then collapse the shims — but that's outside this PR's scope.

Test plan

ypriverol added 11 commits May 27, 2026 15:36
…Registry

Empty scaffolding for the per-provider refactor (spec:
docs/specs/2026-05-27-files-py-provider-refactor-design.md). Introduces
the Provider abstract base class, BaseDirectDownloadProvider with a
shared download_files() that partitions URLs by scheme and routes
through Files.download_ftp_urls / Files.download_http_urls (preserving
test patches), and a Registry for accession -> provider resolution.

No behaviour change. No providers registered yet.
…t.py

Pulled download_ftp_urls, download_http_urls, and their helpers
(_open_ftp_connection, _walk_ftp_tree, _list_ftp_repo_files,
_download_one_ftp_path, _download_ftp_paths_serial/_parallel,
_http_download_one, _parallel_download, _local_path_for_url) out of the
Files class verbatim into providers/transport.py.

Files keeps a shim staticmethod for each function that does a lazy
import and delegates, so existing test patches like
patch.object(Files, 'download_ftp_urls') keep intercepting calls.

No behaviour change. Test suite green.
Moved Progress, _find_tsv_columns, _is_md5_checksum, read_checksum_file,
compute_md5, validate_download, _remove_if_exists, _get_download_url,
_resolve_local_path from Files into providers/util.py. Files keeps shim
re-exports so existing references keep working.

No behaviour change. Test suite green.
Moved MassIVE listing + record building from Files into
providers/massive.py as MassiveProvider(BaseDirectDownloadProvider).
Provider is registered with the Registry. Files keeps shim methods
(is_massive_accession, _list_massive_public_files,
_build_massive_file_record, _get_massive_public_root,
_get_massive_public_ftp_url, _map_massive_collection_to_category) that
delegate to the provider. MASSIVE_CATEGORY_MAP / MASSIVE_ARCHIVE_FTP
constants remain on Files as class-attribute re-exports.

All 10 MassIVE tests pass without modification. Full suite green.
Moved JPOST listing (PROXI JSON primary + FTP tree walk fallback) and
record building from Files into providers/jpost.py as
JpostProvider(BaseDirectDownloadProvider). Provider registered with the
Registry. Files keeps shim methods (is_jpost_accession,
_list_jpost_public_files, _list_jpost_public_files_via_proxi,
_build_jpost_file_record, _get_jpost_public_root,
_get_jpost_public_ftp_url) that delegate to the provider.
JPOST_ARCHIVE_FTP / JPOST_PROXI_BASE_URL / JPOST_PROXI_CATEGORY_MAP
constants remain on Files as class-attribute re-exports.

All JPOST tests pass without modification. Full suite green.
Moved iProX listing (PX XML from download.iprox.org) and record building
from Files into providers/iprox.py as IproxProvider(BaseDirectDownloadProvider).
Provider registered with the Registry. Files keeps shim methods
(is_iprox_accession, _list_iprox_public_files, _build_iprox_file_record,
_get_iprox_public_root, _get_iprox_public_ftp_url) that delegate to the
provider. IPROX_DOWNLOAD_BASE_URL / IPROX_PX_XML_URL_TEMPLATE /
IPROX_PX_CATEGORY_MAP constants remain on Files as class-attribute
re-exports.

_download_direct_download_records on Files now dispatches via the registry
instead of branching manually. _list_direct_download_files keeps shim
dispatching so existing test patches on _list_massive_public_files and
_list_jpost_public_files continue to intercept the calls.

All iProX tests pass without modification. Full suite green.
…oning

Mixed ftp:// + http(s):// records partition correctly: ftp URLs go to
Files.download_ftp_urls with use_tls=True (the MassIVE setting), http
URLs go to Files.download_http_urls. Both calls intercepted by
patch.object(Files, ...) — proving the provider routes back through
Files rather than calling transport directly (preserves test patches).
Moved PRIDE-specific logic (V3 API listing, multi-protocol batch
downloader with aspera/s3/ftp/globus fallback, private-dataset path,
submitter helpers, Globus/S3 per-protocol downloaders, legacy
single-connection FTP) from Files into providers/pride.py as
PrideProvider(Provider). ~510 LOC moved out of files.py.

Files keeps shim methods for every patched helper
(_batch_download_by_protocol, _download_with_fallback,
_globus_download_one, download_files_from_globus, download_files_from_s3,
download_files_from_ftp, download_private_file_name,
get_ascp_binary, save_checksum_file, stream_all_files_by_project,
stream_all_files_metadata, get_submitted_file_path_prefix,
_protocol_sequence, download_files).

PRIDE class constants (V3_API_BASE_URL, API_BASE_URL, API_PRIVATE_URL,
PRIDE_ARCHIVE_FTP, *_URL_PREFIX, S3_URL, S3_BUCKET, PROTOCOL_ORDER)
remain on Files as re-exports.

PrideProvider's internal calls to patch-sensitive helpers go through
Files.X (lazy import) so existing test patches in
test_download_resilience.py keep working without modification.

is_direct_download_accession updated to exclude PRIDE (returns True
only for MSV/JPST/IPX) now that PrideProvider is registered too.

All 8 patch-sensitive tests in test_download_resilience.py pass.
Full suite green at 67 passed, 4 skipped.
Files public methods (get_all_raw_file_list, download_all_raw_files,
download_all_category_files, get_all_category_file_list,
download_file_by_name, get_file_from_api, download_files_by_list) now
dispatch via registry.resolve(accession).{list,download}_files(...).

Removed dead helpers: _list_direct_download_files,
_download_direct_download_records. Kept _repo_uses_tls as a registry
shim. Fixed _download_massive_file_records to use registry.

PrideProvider.download_files refactored: the old static method is now
_download_files_batch; a proper Provider-interface instance method
download_files(self, accession, records, ...) wraps it so PRIDE routes
uniformly through the registry like other providers.

Tests updated: patches on Files._list_massive_public_files,
Files._list_jpost_public_files, files_obj.stream_all_files_by_project,
and files_obj.download_files now target the provider classes
(MassiveProvider, JpostProvider, PrideProvider) directly.

Full suite green. files.py size: 1254 LOC (was 1352 before Task 9).
Verifies that Files().download_all_raw_files for a PXD accession flows
through Registry.resolve -> PrideProvider.download_files ->
_batch_download_by_protocol (patched via Files), and that
_download_with_fallback is only called when batch returns failed files.

Also mocks validate_download to return success so the happy-path test
correctly asserts that fallback is not invoked when all files pass
validation after the primary-protocol batch run.

Spec acceptance criterion #5.
@qodo-code-review
Copy link
Copy Markdown
Contributor

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 27, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 65ee835e-cb30-4e7a-b9a2-b0010e16224f

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch refactor/files-py-providers

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ypriverol
Copy link
Copy Markdown
Contributor Author

Superseded by #103, which consolidates PR #100, #101, and #102 into a single branch (feat/files-py-modular-architecture) targeting dev. All 11 commits from this PR are preserved verbatim in #103. Please review and merge there instead.

@ypriverol ypriverol closed this May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant