Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions .github/workflows/rust.yml
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ jobs:
sudo apt install -y protobuf-compiler libssl-dev
- name: Get features
run: |
ALL_FEATURES=`cargo metadata --format-version=1 --no-deps | jq -r '.packages[] | .features | keys | .[]' | sort | uniq | paste -s -d "," -`
ALL_FEATURES=`cargo metadata --format-version=1 --no-deps | jq -r '.packages[] | .features | keys | .[]' | grep -v -e protoc -e slow_tests -e hdfs | sort | uniq | paste -s -d "," -`
echo "ALL_FEATURES=${ALL_FEATURES}" >> $GITHUB_ENV
- name: Clippy
run: cargo clippy --profile ci --locked --features ${{ env.ALL_FEATURES }} --all-targets -- -D warnings
Expand Down Expand Up @@ -110,7 +110,7 @@ jobs:
uses: taiki-e/install-action@66068bfca13dcb2ea07c3f613ca2836a37c755d5 # cargo-llvm-cov
- name: Run tests
run: |
ALL_FEATURES=`cargo metadata --format-version=1 --no-deps | jq -r '.packages[] | .features | keys | .[]' | grep -v -e protoc -e slow_tests | sort | uniq | paste -s -d "," -`
ALL_FEATURES=`cargo metadata --format-version=1 --no-deps | jq -r '.packages[] | .features | keys | .[]' | grep -v -e protoc -e slow_tests -e hdfs | sort | uniq | paste -s -d "," -`
cargo +nightly llvm-cov --profile ci --locked --workspace --codecov --output-path coverage.codecov --features ${ALL_FEATURES}
- name: Upload coverage to Codecov
uses: codecov/codecov-action@b9fd7d16f6d7d1b5d2bec1a2887e65ceed900238 # v4
Expand All @@ -137,13 +137,13 @@ jobs:
sudo apt install -y protobuf-compiler libssl-dev pkg-config
- name: Build tests
run: |
ALL_FEATURES=`cargo metadata --format-version=1 --no-deps | jq -r '.packages[] | .features | keys | .[]' | grep -v -e protoc -e slow_tests | sort | uniq | paste -s -d "," -`
ALL_FEATURES=`cargo metadata --format-version=1 --no-deps | jq -r '.packages[] | .features | keys | .[]' | grep -v -e protoc -e slow_tests -e hdfs | sort | uniq | paste -s -d "," -`
cargo test --profile ci --locked --features ${ALL_FEATURES} --no-run
- name: Start DynamodDB and S3
run: docker compose -f docker-compose.yml up -d --wait
- name: Run tests
run: |
ALL_FEATURES=`cargo metadata --format-version=1 --no-deps | jq -r '.packages[] | .features | keys | .[]' | grep -v -e protoc -e slow_tests | sort | uniq | paste -s -d "," -`
ALL_FEATURES=`cargo metadata --format-version=1 --no-deps | jq -r '.packages[] | .features | keys | .[]' | grep -v -e protoc -e slow_tests -e hdfs | sort | uniq | paste -s -d "," -`
cargo test --profile ci --locked --features ${ALL_FEATURES}
query-integration-tests:
runs-on: ubuntu-24.04-4x
Expand Down Expand Up @@ -195,7 +195,7 @@ jobs:
sudo apt install -y protobuf-compiler libssl-dev
- name: Build all
run: |
ALL_FEATURES=`cargo metadata --format-version=1 --no-deps | jq -r '.packages[] | .features | keys | .[]' | grep -v -e protoc -e slow_tests | sort | uniq | paste -s -d "," -`
ALL_FEATURES=`cargo metadata --format-version=1 --no-deps | jq -r '.packages[] | .features | keys | .[]' | grep -v -e protoc -e slow_tests -e hdfs | sort | uniq | paste -s -d "," -`
cargo build --profile ci --benches --features ${ALL_FEATURES} --tests
mac-build:
runs-on: warp-macos-14-arm64-6x
Expand Down Expand Up @@ -280,5 +280,5 @@ jobs:
env:
RUSTUP_TOOLCHAIN: ${{ matrix.msrv }}
run: |
ALL_FEATURES=`cargo metadata --format-version=1 --no-deps | jq -r '.packages[] | .features | keys | .[]' | grep -v -e protoc -e slow_tests | sort | uniq | paste -s -d "," -`
ALL_FEATURES=`cargo metadata --format-version=1 --no-deps | jq -r '.packages[] | .features | keys | .[]' | grep -v -e protoc -e slow_tests -e hdfs | sort | uniq | paste -s -d "," -`
cargo check --profile ci --workspace --tests --benches --features ${ALL_FEATURES}
114 changes: 102 additions & 12 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,7 @@ half = { "version" = "2.1", default-features = false, features = [
"num-traits",
"std",
] }
hdrs = "0.3.2"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not depend on this

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @Xuanwo . Do you mean we should avoid using hdrs directly and implement the HDFS commit rename through OpenDAL instead? My concern is preserving the rename-if-destination-does-not-exist semantics required for commit conflict detection.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenDAL’s HDFS rename currently overwrites the destination by deleting it first. The manifest commit requires an atomic rename-if-destination-does-not-exist operation for conflict detection. Would you prefer adding conditional rename support to OpenDAL first, or is there another OpenDAL API intended for this semantic?

lance-bitpacking = { version = "=8.1.0-beta.0", path = "./rust/compression/bitpacking" }
bitpacking = "0.9"
bitvec = "1"
Expand Down
69 changes: 69 additions & 0 deletions docs/src/guide/object_store.md
Original file line number Diff line number Diff line change
Expand Up @@ -249,6 +249,75 @@ ds = lance.dataset(
| `tos_secret_access_key` | Secret access key used for TOS authentication. Optional if credentials are provided by environment. |
| `tos_security_token` | Security token for temporary credentials. Optional. |

## HDFS Configuration

HDFS support is optional and must be enabled when building Lance. For Rust builds,
enable the `hdfs` feature on `lance`. For Java builds, see the
[Java HDFS build instructions](https://github.com/lance-format/lance/tree/main/java#hdfs-enabled-build).
Prebuilt Lance packages may not include HDFS support.

To build the Python package with HDFS support from source:

```bash
cd python
uv run maturin build --release --features hdfs
```

Use an `hdfs://` URI containing a NameNode address or an HDFS high-availability
nameservice:

```python
import lance

ds = lance.dataset("hdfs://namenode:9000/user/lance/my-dataset")
```

For high-availability clusters configured in Hadoop XML files, the URI authority
can be the nameservice:

```python
ds = lance.dataset("hdfs://mycluster/user/lance/my-dataset")
```

Explicit `storage_options` take priority over environment variables. If neither
specifies a NameNode, Lance uses the URI authority.

```python
ds = lance.dataset(
"hdfs://namenode:9000/user/lance/my-dataset",
storage_options={
"hdfs_name_node": "hdfs://namenode:8020",
"hdfs_user": "lance",
"hdfs_kerberos_ticket_cache_path": "/tmp/krb5cc_lance",
"hdfs_atomic_write_dir": "/tmp/lance-hdfs-atomic",
},
)
```

| `storage_options` key | Environment variable | Description |
|-----------------------|----------------------|-------------|
| `hdfs_name_node` | `HDFS_NAME_NODE` | NameNode URI or HA nameservice. Defaults to the `hdfs://` URI authority. |
| `hdfs_user` | `HADOOP_USER_NAME`, then `HDFS_USER` | HDFS user name. The storage option takes priority, followed by the environment variables in the listed order. |
| `hdfs_kerberos_ticket_cache_path` | None | Path to the Kerberos ticket cache used to authenticate with HDFS. |
| `hdfs_atomic_write_dir` | None | HDFS directory used by OpenDAL for atomic writes. |

Lance's HDFS provider uses OpenDAL's HDFS service, which depends on `hdrs` and
`hdfs-sys`. Building and running an HDFS-enabled artifact requires a local Java
and Hadoop native environment:

```bash
export JAVA_HOME=/path/to/java
export HADOOP_HOME=/path/to/hadoop
export CLASSPATH="$(${HADOOP_HOME}/bin/hadoop classpath --glob)"
export LD_LIBRARY_PATH="${JAVA_HOME}/lib/server:${HADOOP_HOME}/lib/native:${LD_LIBRARY_PATH}"
```

`hdfs-sys` dynamically links `libjvm`. If it uses a dynamically linked
`libhdfs`, `${HADOOP_HOME}/lib/native` must also be discoverable through the
platform library path. Ensure the Hadoop configuration directory, commonly
`${HADOOP_HOME}/etc/hadoop`, is included in the Hadoop classpath for HA,
Kerberos, and other cluster-specific settings.

## Tencent Cloud COS Configuration

[COS (Cloud Object Storage)](https://cloud.tencent.com/product/cos) credentials can be set in environment variables prefixed
Expand Down
Loading
Loading