feat(hf): support writing and reading from both http and xet#7185
feat(hf): support writing and reading from both http and xet#7185kszucs wants to merge 34 commits intoapache:mainfrom
Conversation
|
I'm still working on a couple of things, like implementing oio::Write instead of OneShotWrite which requires some additional changes in my xet-core PR. |
b403fa5 to
c8c3e8a
Compare
Xuanwo
left a comment
There was a problem hiding this comment.
Thank you very much for working on this! Great job, here is my first round of review.
core/services/hf/Cargo.toml
Outdated
| serde = { workspace = true, features = ["derive"] } | ||
| serde_json = { workspace = true } | ||
| # XET storage protocol support (optional) | ||
| async-trait = { version = "0.1", optional = true } |
| .with_context("service", HF_SCHEME)), | ||
| None => Ok(RepoType::Model), | ||
| }?; | ||
| let repo_type = self.config.repo_type; |
|
|
||
| /// Read a middle range of a known file. | ||
| #[tokio::test] | ||
| #[ignore = "requires network access"] |
There was a problem hiding this comment.
I suggest adding a tests.rs file and guiding users to run tests in an environment such as OPNEDAL_SERVICE_TEST_HF.
core/services/hf/src/core.rs
Outdated
| // When xet is disabled, preserve whatever HTTP client is already set | ||
| // on `info` (important for mock-based unit tests). | ||
| #[cfg(feature = "xet")] | ||
| let no_redirect_client = if xet_enabled { |
There was a problem hiding this comment.
Service shouldn't build their own http clients. But we can figure this out later.
There was a problem hiding this comment.
We need a client not following the redirects to retrieve the X-Xet-Hash header. If I'm not mistaken we need another http client for this, not sure how else could I handle it.
Since a new revamped Xet client is in the works (which will be publised and this PR should eventually depend on) we may be able to delegate this task to the Xet client's side. What do you think @hoytak?
5cd5a17 to
dc870a8
Compare
| } | ||
|
|
||
| /// Build the XET token API URL for this repository. | ||
| pub fn xet_token_url(&self, endpoint: &str, token_type: &str) -> String { |
There was a problem hiding this comment.
If this fits in use cases of opendal, I'd recommend to add a create_pr: bool parameter to this function, so that if
- someone uses a
HF_TOKENwith only read permission to a repo, and - he wants to submit a pull request to this repo with xet files,
he can obtain a Xet token with write permission by constructing the below xet token url:
format!("{endpoint}/api/{repo_type}/{repo_id}/xet-write-token?create_pr=1")
Replace the custom HfBucketSinkNode (1,050 lines of XET-specific code) with a standard ObjectStore implementation backed by OpenDAL's HF service. HF URLs now flow through the same FileSink path as S3/GCS/Azure, requiring only a thin build_hf() builder in a new hf.rs module. Key changes: - Add crates/polars-io/src/cloud/hf.rs: HF URL parsing, token extraction, and OpenDAL ObjectStore construction (~175 lines) - Wire CloudType::Hf in object_store_setup.rs to call build_hf(), matching the pattern used by build_aws/build_gcp/build_azure - Delete custom sink: hf_bucket/ directory (4 files, 721 lines), HfBucketSinkNode (260 lines), IR lowering special-case, PhysNodeKind::HfBucketSink variant - Rename feature flag hf_bucket_sink -> hf across 9 Cargo.toml files - Bump object_store_opendal compatibility from object_store 0.12 to 0.13 Dependencies: opendal + object_store_opendal (local path deps for now, will switch to published crate versions once apache/opendal#7185 ships). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add an ObjectStore implementation for `hf://` URLs backed by OpenDAL's
HF service, enabling `sink_parquet("hf://buckets/org/name/file.parquet")`
to stream directly to Hugging Face Storage Buckets.
The implementation follows the same pattern as existing cloud backends
(S3/GCS/Azure): a `build_hf()` function in a new `hf.rs` module
constructs the ObjectStore, and `object_store_setup.rs` calls it from
the `CloudType::Hf` match arm. HF URLs flow through the standard
FileSink path with no custom sink node or IR special-casing.
New files:
- crates/polars-io/src/cloud/hf.rs — URL parsing, token resolution,
OpenDAL ObjectStore construction
Feature flag: `hf` (opt-in, propagated through the workspace)
Dependencies: opendal + object_store_opendal (local path deps for now,
will switch to published crate versions once apache/opendal#7185 ships)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rsion of xet-core
…t APIs Replace subxet dependency with direct xet-core crate references (cas_client, data, file_reconstruction, utils). Switch from XetClient to session-based FileUploadSession/FileDownloadSession with OnceCell lazy initialization.
…ntegration tests - Cache Arc<TranslatorConfig> per read/write direction; create a fresh FileUploadSession/FileDownloadSession per operation so that finalize() on one write doesn't poison the next - Factor repeated config construction into xet_config(token_type) - Mark all write/delete integration tests with #[serial] to prevent 412 "commit has happened since" races on the shared test repo - Add serial_test dev-dependency
Replace low-level xet-data/xet-client usage with the new high-level session API from xet_pkg (hf-xet crate, lib name `xet`): - Single hf-xet dependency replaces xet-client + xet-data - ONE cached XetSession per HfCore (write token, serves both reads and writes); removes XetTokenRefresher and TranslatorConfig handling - Writer: XetUploadCommit + XetStreamUpload replace FileUploadSession + SingleFileCleaner; no Mutex needed since oio::Write is &mut self - Reader: XetDownloadStream replaces DownloadStream; range converted from BytesRange to Option<Range<u64>> for new download_stream() API
Migrate to the latest xet-core API where CAS auth tokens move from session-level to per-operation (UploadCommitBuilder/DownloadStreamGroupBuilder). Use token refresh URLs so xet-core manages JWT lifecycle automatically.
Which issue does this PR close?
Work-in-progress, but generally:
I opened the PR for better visibility and early feedback.
Depends on huggingface/xet-core#642
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?
AI Usage Statement