feat: support HDFS object store#5472
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
Hi, @jackye1995 @wojiaodoubao @majin1102 . Could you please help review this PR when you have free time? Thanks a lot. |
|
|
||
| [features] | ||
| default = ["aws", "azure", "gcp"] | ||
| default = ["aws", "azure", "gcp", "hdfs"] |
There was a problem hiding this comment.
I think we shouldn't enable HDFS by default. It introduces many new dependencies in Lance and also requires users to have a Java setup. Without it, Lance will fail to start.
| } | ||
| } else { | ||
| // Fall back to system username | ||
| config_map.insert("user".to_string(), whoami::username()); |
There was a problem hiding this comment.
I prefer having users fill this out instead of using whoami.
| <copyTo>${project.build.directory}/classes/nativelib</copyTo> | ||
| <copyWithPlatformDir>true</copyWithPlatformDir> | ||
| <environmentVariables> | ||
| <CARGO_FEATURES>hdfs</CARGO_FEATURES> |
There was a problem hiding this comment.
I feel the same. It doesn’t seem like a good idea to enable hdfs by default.
There was a problem hiding this comment.
@Xuanwo Thanks for reviewing. Got it, will push an new version based on your advice laterly.
84a92c5 to
c6087b6
Compare
|
Thank you for your contribution. This PR has been inactive for a while, so we're closing it to free up bandwidth. Feel free to reopen it if you still find it useful. |
c6087b6 to
467c486
Compare
467c486 to
e646087
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
@Xuanwo @BubbleCal @jiaoew1991 Hi, could you please review this PR when free? Thanks very much. We have use HDFS as backend storage for a long time. |
07a46bc to
a48caed
Compare
|
@yanghua Hi, could you please help review this PR when have free time? Thanks very much! |
|
@claude review |
|
It seems the bot does not work. I could take a look at this PR later. |
Thanks! |
257563f to
6c924e5
Compare
563d129 to
8b4178b
Compare
have fixed and pushed. Let's wait pipeline ok. Thanks for your reviewing!!! |
|
Will take a look today. |
Xuanwo
left a comment
There was a problem hiding this comment.
Thanks for adding HDFS support. I think this needs a few correctness fixes before merge.
The main blocker is that this PR registers hdfs:// as writable, but commit routing still falls through to UnsafeCommitHandler. That means concurrent HDFS writers can overwrite manifest versions without real create-if-absent / conflict detection semantics.
There are also two HDFS identity/listing issues:
- HDFS is advertised as
list_is_lexically_ordered=true, but the backend does not guarantee sorted listings before Lance uses listing order for manifest discovery. store_prefixis derived from the URI authority even whenhdfs_name_node/HDFS_NAME_NODEoverrides the actual NameNode, so different effective clusters can be treated as the same datastore.
Please route HDFS to a safe commit handler or explicitly reject HDFS writes without one, default HDFS listings to unordered unless sorted in the provider, and derive store_prefix from the resolved effective NameNode.
@Xuanwo Thanks for reviewing.
|
The hdfs feature requires Hadoop native libraries (libhdfs.so) which are not available on CI runners. Exclude it from ALL_FEATURES computation, following the same pattern as protoc and slow_tests. Also update Cargo.lock to include hdfs-related dependencies (hdrs, hdfs-sys, java-locator, opendal-service-hdfs) so --locked builds don't fail when hdfs is accidentally enabled.
Add .filter(|v| !v.is_empty()) to hdfs_name_node and hdfs_user storage options lookups for consistency with the kerberos and atomic_write_dir fields.
This test was a duplicate of test_hdfs_store_paths in the unit test file, both exercise extract_path logic with the same URL patterns.
74d4f33 to
87a6af7
Compare
|
@Xuanwo @yanghua To provide better rename semantics for the Lance HDFS feature, we also optimized the libhdfs rename operation in apache/hadoop#8411. |
| "num-traits", | ||
| "std", | ||
| ] } | ||
| hdrs = "0.3.2" |
There was a problem hiding this comment.
We should not depend on this
There was a problem hiding this comment.
Hi, @Xuanwo . Do you mean we should avoid using hdrs directly and implement the HDFS commit rename through OpenDAL instead? My concern is preserving the rename-if-destination-does-not-exist semantics required for commit conflict detection.
There was a problem hiding this comment.
OpenDAL’s HDFS rename currently overwrites the destination by deleting it first. The manifest commit requires an atomic rename-if-destination-does-not-exist operation for conflict detection. Would you prefer adding conditional rename support to OpenDAL first, or is there another OpenDAL API intended for this semantic?
What changed
hdfsfeature forlance-iobacked by OpenDALhdfs://with the object store registryWhy
This allows Lance datasets to be accessed through HDFS URLs, including deployments using HDFS nameservices.
Validation
cargo fmt --all -- --checkcargo clippy -p lance-io --all-targets --features hdfs --locked -- -D warningscargo clippy --all --tests --benches -- -D warningscargo test -p lance-io --features hdfs --locked object_store::providers::hdfs::tests --no-fail-fastcargo +1.91.0 check -p lance-io --features hdfs --locked