refactor(providers): split files.py into per-provider classes (0.0.17)#101
refactor(providers): split files.py into per-provider classes (0.0.17)#101ypriverol wants to merge 11 commits into
Conversation
…Registry Empty scaffolding for the per-provider refactor (spec: docs/specs/2026-05-27-files-py-provider-refactor-design.md). Introduces the Provider abstract base class, BaseDirectDownloadProvider with a shared download_files() that partitions URLs by scheme and routes through Files.download_ftp_urls / Files.download_http_urls (preserving test patches), and a Registry for accession -> provider resolution. No behaviour change. No providers registered yet.
…t.py Pulled download_ftp_urls, download_http_urls, and their helpers (_open_ftp_connection, _walk_ftp_tree, _list_ftp_repo_files, _download_one_ftp_path, _download_ftp_paths_serial/_parallel, _http_download_one, _parallel_download, _local_path_for_url) out of the Files class verbatim into providers/transport.py. Files keeps a shim staticmethod for each function that does a lazy import and delegates, so existing test patches like patch.object(Files, 'download_ftp_urls') keep intercepting calls. No behaviour change. Test suite green.
Moved Progress, _find_tsv_columns, _is_md5_checksum, read_checksum_file, compute_md5, validate_download, _remove_if_exists, _get_download_url, _resolve_local_path from Files into providers/util.py. Files keeps shim re-exports so existing references keep working. No behaviour change. Test suite green.
Moved MassIVE listing + record building from Files into providers/massive.py as MassiveProvider(BaseDirectDownloadProvider). Provider is registered with the Registry. Files keeps shim methods (is_massive_accession, _list_massive_public_files, _build_massive_file_record, _get_massive_public_root, _get_massive_public_ftp_url, _map_massive_collection_to_category) that delegate to the provider. MASSIVE_CATEGORY_MAP / MASSIVE_ARCHIVE_FTP constants remain on Files as class-attribute re-exports. All 10 MassIVE tests pass without modification. Full suite green.
Moved JPOST listing (PROXI JSON primary + FTP tree walk fallback) and record building from Files into providers/jpost.py as JpostProvider(BaseDirectDownloadProvider). Provider registered with the Registry. Files keeps shim methods (is_jpost_accession, _list_jpost_public_files, _list_jpost_public_files_via_proxi, _build_jpost_file_record, _get_jpost_public_root, _get_jpost_public_ftp_url) that delegate to the provider. JPOST_ARCHIVE_FTP / JPOST_PROXI_BASE_URL / JPOST_PROXI_CATEGORY_MAP constants remain on Files as class-attribute re-exports. All JPOST tests pass without modification. Full suite green.
Moved iProX listing (PX XML from download.iprox.org) and record building from Files into providers/iprox.py as IproxProvider(BaseDirectDownloadProvider). Provider registered with the Registry. Files keeps shim methods (is_iprox_accession, _list_iprox_public_files, _build_iprox_file_record, _get_iprox_public_root, _get_iprox_public_ftp_url) that delegate to the provider. IPROX_DOWNLOAD_BASE_URL / IPROX_PX_XML_URL_TEMPLATE / IPROX_PX_CATEGORY_MAP constants remain on Files as class-attribute re-exports. _download_direct_download_records on Files now dispatches via the registry instead of branching manually. _list_direct_download_files keeps shim dispatching so existing test patches on _list_massive_public_files and _list_jpost_public_files continue to intercept the calls. All iProX tests pass without modification. Full suite green.
…oning Mixed ftp:// + http(s):// records partition correctly: ftp URLs go to Files.download_ftp_urls with use_tls=True (the MassIVE setting), http URLs go to Files.download_http_urls. Both calls intercepted by patch.object(Files, ...) — proving the provider routes back through Files rather than calling transport directly (preserves test patches).
Moved PRIDE-specific logic (V3 API listing, multi-protocol batch downloader with aspera/s3/ftp/globus fallback, private-dataset path, submitter helpers, Globus/S3 per-protocol downloaders, legacy single-connection FTP) from Files into providers/pride.py as PrideProvider(Provider). ~510 LOC moved out of files.py. Files keeps shim methods for every patched helper (_batch_download_by_protocol, _download_with_fallback, _globus_download_one, download_files_from_globus, download_files_from_s3, download_files_from_ftp, download_private_file_name, get_ascp_binary, save_checksum_file, stream_all_files_by_project, stream_all_files_metadata, get_submitted_file_path_prefix, _protocol_sequence, download_files). PRIDE class constants (V3_API_BASE_URL, API_BASE_URL, API_PRIVATE_URL, PRIDE_ARCHIVE_FTP, *_URL_PREFIX, S3_URL, S3_BUCKET, PROTOCOL_ORDER) remain on Files as re-exports. PrideProvider's internal calls to patch-sensitive helpers go through Files.X (lazy import) so existing test patches in test_download_resilience.py keep working without modification. is_direct_download_accession updated to exclude PRIDE (returns True only for MSV/JPST/IPX) now that PrideProvider is registered too. All 8 patch-sensitive tests in test_download_resilience.py pass. Full suite green at 67 passed, 4 skipped.
Files public methods (get_all_raw_file_list, download_all_raw_files,
download_all_category_files, get_all_category_file_list,
download_file_by_name, get_file_from_api, download_files_by_list) now
dispatch via registry.resolve(accession).{list,download}_files(...).
Removed dead helpers: _list_direct_download_files,
_download_direct_download_records. Kept _repo_uses_tls as a registry
shim. Fixed _download_massive_file_records to use registry.
PrideProvider.download_files refactored: the old static method is now
_download_files_batch; a proper Provider-interface instance method
download_files(self, accession, records, ...) wraps it so PRIDE routes
uniformly through the registry like other providers.
Tests updated: patches on Files._list_massive_public_files,
Files._list_jpost_public_files, files_obj.stream_all_files_by_project,
and files_obj.download_files now target the provider classes
(MassiveProvider, JpostProvider, PrideProvider) directly.
Full suite green. files.py size: 1254 LOC (was 1352 before Task 9).
Verifies that Files().download_all_raw_files for a PXD accession flows through Registry.resolve -> PrideProvider.download_files -> _batch_download_by_protocol (patched via Files), and that _download_with_fallback is only called when batch returns failed files. Also mocks validate_download to return success so the happy-path test correctly asserts that fallback is not invoked when all files pass validation after the primary-protocol batch run. Spec acceptance criterion #5.
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary
Pure structural refactor of
pridepy/files/files.py(was 2514 LOC, 60+ methods on one class) into a per-provider package. No behaviour change. No new features. No dependency changes. Ships as0.0.17.Base branch note: this PR targets
feat/jpost-iprox-direct-downloads(PR #100) so the diff shows only the refactor commits. After PR #100 merges todev, update this PR's base todev(gh pr edit <N> --base dev).Spec: docs/specs/2026-05-27-files-py-provider-refactor-design.md (NOTE: spec/plan docs are on a separate
docs/files-py-refactor-designbranch; merge that first as a docs PR, or include here via cherry-pick)Plan: same docs branch,
docs/plans/2026-05-27-files-py-provider-refactor.mdArchitecture
Moved-from / moved-to
Provider interface
Backward compatibility
Filesclass is preserved as a facade. Every attribute the existing test suite patches (Files.download_ftp_urls,Files._batch_download_by_protocol,Files._download_with_fallback,Files._globus_download_one,Files._download_single_url,Files._http_download_url,Files._list_massive_public_files,Files._list_jpost_public_files,Files._list_iprox_public_files,Files._repo_uses_tls) is kept as a shim that delegates to the new location. PrideProvider calls back intoFiles.X(lazy import) for patch-sensitive helpers sopatch.object(Files, ...)still intercepts. The CLI (pridepy.py) is unchanged.Tests
68 passed, 4 skipped — baseline was 66 passed, 4 skipped. Two new tests added:
test_base_direct_download_provider_partitions_urls_by_scheme— verifies BaseDirectDownloadProvider partitions ftp:// vs http(s):// records correctly and forwardsuse_tlsfrom the provider.test_facade_dispatches_pride_through_registry_to_fallback— verifiesFiles().download_all_raw_files("PXD...")flows throughRegistry.resolve→PrideProvider.download_files→_batch_download_by_protocol(patched on Files, the patch still intercepts).Test changes (intentional, 38 lines across 3 files): Three tests had their
patch.objecttargets updated from Files-level shims to provider classes (MassiveProvider.list_files,JpostProvider.list_files,PrideProvider.list_files,PrideProvider.download_files). This was unavoidable: onceFiles.download_all_raw_filescallsregistry.resolve(...).list_files(...)directly, the old patches onFiles._list_massive_public_filesetc. became orphaned. The test ASSERTIONS are unchanged — only patch targets moved to reflect the new dispatch path.Verification
pridepy download-file-by-name -a MSV000080175 -f params.xmlproduces a 10315-byte file with MD543d87368d705c3f380c1d030b14850c4— exact match for the pre-refactor baseline.pridepy --versionreports0.0.17.Acceptance criteria
Note on criterion #1: files.py landed at 1254 LOC, not <400. The remaining bulk is necessary backward-compat: ~30 shim methods + the cross-cutting commands (download_files_by_url, download_files_by_list, download_px_raw_files) + the public/private split inside download_file_by_name (which couldn't move to PrideProvider without breaking test patches via Util.get_api_call). A follow-up PR can migrate tests to patch providers directly, then collapse the shims — but that's outside this PR's scope.
Test plan
pridepy download-file-by-name -a MSV000080175 -f params.xml -o /tmp/xproduces MD543d87368d705c3f380c1d030b14850c4pridepy --versionreports0.0.17feat/jpost-iprox-direct-downloadstodev