Skip to content

feat(hf): support writing and reading from both http and xet#7185

Open
kszucs wants to merge 34 commits intoapache:mainfrom
kszucs:hf-revamp
Open

feat(hf): support writing and reading from both http and xet#7185
kszucs wants to merge 34 commits intoapache:mainfrom
kszucs:hf-revamp

Conversation

@kszucs
Copy link
Copy Markdown
Member

@kszucs kszucs commented Feb 9, 2026

Which issue does this PR close?

Work-in-progress, but generally:

  • add writing support
  • support reading using the xet protocol (opt-in)
  • support writing using the xet protocol (opt-in)

I opened the PR for better visibility and early feedback.

Depends on huggingface/xet-core#642

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

AI Usage Statement

@kszucs kszucs marked this pull request as ready for review February 10, 2026 09:12
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. releases-note/feat The PR implements a new feature or has a title that begins with "feat" labels Feb 10, 2026
@kszucs
Copy link
Copy Markdown
Member Author

kszucs commented Feb 11, 2026

I'm still working on a couple of things, like implementing oio::Write instead of OneShotWrite which requires some additional changes in my xet-core PR.

Copy link
Copy Markdown
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for working on this! Great job, here is my first round of review.

serde = { workspace = true, features = ["derive"] }
serde_json = { workspace = true }
# XET storage protocol support (optional)
async-trait = { version = "0.1", optional = true }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can avoid this.

.with_context("service", HF_SCHEME)),
None => Ok(RepoType::Model),
}?;
let repo_type = self.config.repo_type;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this change.


/// Read a middle range of a known file.
#[tokio::test]
#[ignore = "requires network access"]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest adding a tests.rs file and guiding users to run tests in an environment such as OPNEDAL_SERVICE_TEST_HF.

// When xet is disabled, preserve whatever HTTP client is already set
// on `info` (important for mock-based unit tests).
#[cfg(feature = "xet")]
let no_redirect_client = if xet_enabled {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Service shouldn't build their own http clients. But we can figure this out later.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a client not following the redirects to retrieve the X-Xet-Hash header. If I'm not mistaken we need another http client for this, not sure how else could I handle it.

Since a new revamped Xet client is in the works (which will be publised and this PR should eventually depend on) we may be able to delegate this task to the Xet client's side. What do you think @hoytak?

@kszucs
Copy link
Copy Markdown
Member Author

kszucs commented Mar 31, 2026

@Xuanwo I keep updating the PR with the new xet changes, @seanses and @hoytak are working on the xet package. I don't have an exact estimate when can we publish it though.

}

/// Build the XET token API URL for this repository.
pub fn xet_token_url(&self, endpoint: &str, token_type: &str) -> String {
Copy link
Copy Markdown

@seanses seanses Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this fits in use cases of opendal, I'd recommend to add a create_pr: bool parameter to this function, so that if

  1. someone uses a HF_TOKEN with only read permission to a repo, and
  2. he wants to submit a pull request to this repo with xet files,
    he can obtain a Xet token with write permission by constructing the below xet token url:
    format!("{endpoint}/api/{repo_type}/{repo_id}/xet-write-token?create_pr=1")

davanstrien added a commit to davanstrien/polars that referenced this pull request Apr 2, 2026
Replace the custom HfBucketSinkNode (1,050 lines of XET-specific code)
with a standard ObjectStore implementation backed by OpenDAL's HF service.

HF URLs now flow through the same FileSink path as S3/GCS/Azure,
requiring only a thin build_hf() builder in a new hf.rs module.

Key changes:
- Add crates/polars-io/src/cloud/hf.rs: HF URL parsing, token
  extraction, and OpenDAL ObjectStore construction (~175 lines)
- Wire CloudType::Hf in object_store_setup.rs to call build_hf(),
  matching the pattern used by build_aws/build_gcp/build_azure
- Delete custom sink: hf_bucket/ directory (4 files, 721 lines),
  HfBucketSinkNode (260 lines), IR lowering special-case,
  PhysNodeKind::HfBucketSink variant
- Rename feature flag hf_bucket_sink -> hf across 9 Cargo.toml files
- Bump object_store_opendal compatibility from object_store 0.12 to 0.13

Dependencies: opendal + object_store_opendal (local path deps for now,
will switch to published crate versions once apache/opendal#7185 ships).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
davanstrien added a commit to davanstrien/polars that referenced this pull request Apr 2, 2026
Add an ObjectStore implementation for `hf://` URLs backed by OpenDAL's
HF service, enabling `sink_parquet("hf://buckets/org/name/file.parquet")`
to stream directly to Hugging Face Storage Buckets.

The implementation follows the same pattern as existing cloud backends
(S3/GCS/Azure): a `build_hf()` function in a new `hf.rs` module
constructs the ObjectStore, and `object_store_setup.rs` calls it from
the `CloudType::Hf` match arm. HF URLs flow through the standard
FileSink path with no custom sink node or IR special-casing.

New files:
- crates/polars-io/src/cloud/hf.rs — URL parsing, token resolution,
  OpenDAL ObjectStore construction

Feature flag: `hf` (opt-in, propagated through the workspace)

Dependencies: opendal + object_store_opendal (local path deps for now,
will switch to published crate versions once apache/opendal#7185 ships)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
kszucs added 29 commits April 4, 2026 17:53
…t APIs

Replace subxet dependency with direct xet-core crate references (cas_client,
data, file_reconstruction, utils). Switch from XetClient to session-based
FileUploadSession/FileDownloadSession with OnceCell lazy initialization.
…ntegration tests

- Cache Arc<TranslatorConfig> per read/write direction; create a fresh
  FileUploadSession/FileDownloadSession per operation so that finalize()
  on one write doesn't poison the next
- Factor repeated config construction into xet_config(token_type)
- Mark all write/delete integration tests with #[serial] to prevent
  412 "commit has happened since" races on the shared test repo
- Add serial_test dev-dependency
Replace low-level xet-data/xet-client usage with the new high-level
session API from xet_pkg (hf-xet crate, lib name `xet`):

- Single hf-xet dependency replaces xet-client + xet-data
- ONE cached XetSession per HfCore (write token, serves both reads
  and writes); removes XetTokenRefresher and TranslatorConfig handling
- Writer: XetUploadCommit + XetStreamUpload replace FileUploadSession
  + SingleFileCleaner; no Mutex needed since oio::Write is &mut self
- Reader: XetDownloadStream replaces DownloadStream; range converted
  from BytesRange to Option<Range<u64>> for new download_stream() API
Migrate to the latest xet-core API where CAS auth tokens move from
session-level to per-operation (UploadCommitBuilder/DownloadStreamGroupBuilder).
Use token refresh URLs so xet-core manages JWT lifecycle automatically.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

releases-note/feat The PR implements a new feature or has a title that begins with "feat" size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants