Skip to content

feat: support HDFS object store#5472

Open
hfutatzhanghb wants to merge 21 commits into
lance-format:mainfrom
hfutatzhanghb:dev-hdfs-support
Open

feat: support HDFS object store#5472
hfutatzhanghb wants to merge 21 commits into
lance-format:mainfrom
hfutatzhanghb:dev-hdfs-support

Conversation

@hfutatzhanghb

@hfutatzhanghb hfutatzhanghb commented Dec 15, 2025

Copy link
Copy Markdown
Contributor

What changed

  • add an optional hdfs feature for lance-io backed by OpenDAL
  • register hdfs:// with the object store registry
  • map HDFS URLs, storage options, and supported environment variables into OpenDAL configuration
  • add unit coverage for HDFS path and configuration handling

Why

This allows Lance datasets to be accessed through HDFS URLs, including deployments using HDFS nameservices.

Validation

  • cargo fmt --all -- --check
  • cargo clippy -p lance-io --all-targets --features hdfs --locked -- -D warnings
  • cargo clippy --all --tests --benches -- -D warnings
  • cargo test -p lance-io --features hdfs --locked object_store::providers::hdfs::tests --no-fail-fast
  • cargo +1.91.0 check -p lance-io --features hdfs --locked

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@github-actions github-actions Bot added enhancement New feature or request A-java Java bindings + JNI labels Dec 15, 2025
@hfutatzhanghb

Copy link
Copy Markdown
Contributor Author

Hi, @jackye1995 @wojiaodoubao @majin1102 . Could you please help review this PR when you have free time? Thanks a lot.

Comment thread rust/lance-io/Cargo.toml Outdated

[features]
default = ["aws", "azure", "gcp"]
default = ["aws", "azure", "gcp", "hdfs"]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't enable HDFS by default. It introduces many new dependencies in Lance and also requires users to have a Java setup. Without it, Lance will fail to start.

}
} else {
// Fall back to system username
config_map.insert("user".to_string(), whoami::username());

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer having users fill this out instead of using whoami.

Comment thread java/pom.xml Outdated
<copyTo>${project.build.directory}/classes/nativelib</copyTo>
<copyWithPlatformDir>true</copyWithPlatformDir>
<environmentVariables>
<CARGO_FEATURES>hdfs</CARGO_FEATURES>

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel the same. It doesn’t seem like a good idea to enable hdfs by default.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Xuanwo Thanks for reviewing. Got it, will push an new version based on your advice laterly.

@github-actions

Copy link
Copy Markdown
Contributor

Thank you for your contribution. This PR has been inactive for a while, so we're closing it to free up bandwidth. Feel free to reopen it if you still find it useful.

@github-actions github-actions Bot closed this May 16, 2026
@BubbleCal BubbleCal reopened this May 28, 2026
@github-actions github-actions Bot removed the Stale label May 29, 2026
@github-actions github-actions Bot added the A-encoding Encoding, IO, file reader/writer label Jun 9, 2026
@hfutatzhanghb hfutatzhanghb changed the title feat: lance supports hdfs scheme feat: support HDFS object storage Jun 9, 2026
@github-actions github-actions Bot added A-deps Dependency updates A-ci CI / build workflows labels Jun 9, 2026
@codecov

codecov Bot commented Jun 9, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@hfutatzhanghb

Copy link
Copy Markdown
Contributor Author

@Xuanwo @BubbleCal @jiaoew1991 Hi, could you please review this PR when free? Thanks very much. We have use HDFS as backend storage for a long time.

@github-actions github-actions Bot added A-docs Documentation A-python Python bindings labels Jun 10, 2026
@hfutatzhanghb

Copy link
Copy Markdown
Contributor Author

@yanghua Hi, could you please help review this PR when have free time? Thanks very much!

@yanghua

yanghua commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

@claude review

@hfutatzhanghb

Copy link
Copy Markdown
Contributor Author

@claude review
😂, No output ? @claude

@yanghua

yanghua commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

It seems the bot does not work. I could take a look at this PR later.

@hfutatzhanghb hfutatzhanghb requested a review from Xuanwo June 11, 2026 06:55
@hfutatzhanghb

Copy link
Copy Markdown
Contributor Author

It seems the bot does not work. I could take a look at this PR later.

Thanks!

@hfutatzhanghb hfutatzhanghb changed the title feat: support HDFS object storage feat: support HDFS object store Jun 13, 2026
@hfutatzhanghb

Copy link
Copy Markdown
Contributor Author

@hfutatzhanghb, fix the conflicts?

have fixed and pushed. Let's wait pipeline ok. Thanks for your reviewing!!!

@yanghua

yanghua commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Will take a look today.

@yanghua yanghua left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 cc @Xuanwo

@Xuanwo Xuanwo left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding HDFS support. I think this needs a few correctness fixes before merge.

The main blocker is that this PR registers hdfs:// as writable, but commit routing still falls through to UnsafeCommitHandler. That means concurrent HDFS writers can overwrite manifest versions without real create-if-absent / conflict detection semantics.

There are also two HDFS identity/listing issues:

  • HDFS is advertised as list_is_lexically_ordered=true, but the backend does not guarantee sorted listings before Lance uses listing order for manifest discovery.
  • store_prefix is derived from the URI authority even when hdfs_name_node / HDFS_NAME_NODE overrides the actual NameNode, so different effective clusters can be treated as the same datastore.

Please route HDFS to a safe commit handler or explicitly reject HDFS writes without one, default HDFS listings to unordered unless sorted in the provider, and derive store_prefix from the resolved effective NameNode.

@hfutatzhanghb

hfutatzhanghb commented Jun 19, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for adding HDFS support. I think this needs a few correctness fixes before merge.

The main blocker is that this PR registers hdfs:// as writable, but commit routing still falls through to UnsafeCommitHandler. That means concurrent HDFS writers can overwrite manifest versions without real create-if-absent / conflict detection semantics.

There are also two HDFS identity/listing issues:

  • HDFS is advertised as list_is_lexically_ordered=true, but the backend does not guarantee sorted listings before Lance uses listing order for manifest discovery.
  • store_prefix is derived from the URI authority even when hdfs_name_node / HDFS_NAME_NODE overrides the actual NameNode, so different effective clusters can be treated as the same datastore.

Please route HDFS to a safe commit handler or explicitly reject HDFS writes without one, default HDFS listings to unordered unless sorted in the provider, and derive store_prefix from the resolved effective NameNode.

@Xuanwo Thanks for reviewing.

  1. For UnsafeCommitHandler, we have already implements a HdfsRenameCommitHandler and planed to push in another PR after this mergeed. Now have pushed in this PR.
  2. For indentity/listing issues: Will fix immediately.

zhanghaobo@kanzhun.com added 17 commits June 19, 2026 23:52
The hdfs feature requires Hadoop native libraries (libhdfs.so) which are not
available on CI runners. Exclude it from ALL_FEATURES computation, following
the same pattern as protoc and slow_tests.

Also update Cargo.lock to include hdfs-related dependencies (hdrs, hdfs-sys,
java-locator, opendal-service-hdfs) so --locked builds don't fail when hdfs
is accidentally enabled.
Add .filter(|v| !v.is_empty()) to hdfs_name_node and hdfs_user
storage options lookups for consistency with the kerberos and
atomic_write_dir fields.
This test was a duplicate of test_hdfs_store_paths in the unit test
file, both exercise extract_path logic with the same URL patterns.
@hfutatzhanghb

hfutatzhanghb commented Jun 19, 2026

Copy link
Copy Markdown
Contributor Author

@Xuanwo @yanghua To provide better rename semantics for the Lance HDFS feature, we also optimized the libhdfs rename operation in apache/hadoop#8411.

Comment thread Cargo.toml
"num-traits",
"std",
] }
hdrs = "0.3.2"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not depend on this

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @Xuanwo . Do you mean we should avoid using hdrs directly and implement the HDFS commit rename through OpenDAL instead? My concern is preserving the rename-if-destination-does-not-exist semantics required for commit conflict detection.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenDAL’s HDFS rename currently overwrites the destination by deleting it first. The manifest commit requires an atomic rename-if-destination-does-not-exist operation for conflict detection. Would you prefer adding conditional rename support to OpenDAL first, or is there another OpenDAL API intended for this semantic?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-ci CI / build workflows A-deps Dependency updates A-docs Documentation A-encoding Encoding, IO, file reader/writer A-java Java bindings + JNI A-python Python bindings enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants