Background
daft-lance already has partial Lance Blob V2 support, especially around descriptor reads, byte materialization through pylance, and opt-in binary-column writes.
However, several Blob V2 workflows are either not fully supported yet or need explicit validation before we can consider the integration production-ready:
- external / reference Blob V2 writes
- adding Blob V2 columns through
merge_columns / merge_columns_df
- compaction after deletes
- cleanup / vacuum behavior for Blob V2 sidecar data and old versions
This issue tracks the current state, known gaps, and proposed development direction.
Current Support
Read Descriptor Columns
daft.read_lance can read lance.blob.v2 columns and expose them as descriptor structs instead of eagerly materializing bytes.
The descriptor shape is roughly:
{
kind: uint8,
position: uint64,
size: uint64,
blob_id: uint32,
blob_uri: string,
}
This lazy behavior is useful and should remain the default.
Materialize Blob V2 Bytes
daft_lance.take_blobs(df, ds, column) can materialize Blob V2 bytes when the DataFrame has _rowid available.
Example pattern:
import lance
import daft
from daft_lance import take_blobs
ds = lance.dataset(uri)
df = daft.read_lance(uri, default_scan_options={"with_row_id": True})
df = take_blobs(df, ds, "blob")
This uses take_blobs(..., ids=...), which is preferable for multi-fragment datasets because it avoids known pitfalls around index-based addressing.
Write Binary Columns as Blob V2
df.write_lance(..., blob_columns=[...]) can promote Daft binary / large_binary columns into logical Lance Blob V2 columns.
Current behavior:
- binary / large_binary columns can be opt-in Blob V2 columns
- the written Lance schema uses
lance.blob.v2
- when Blob V2 columns are requested and no storage version is provided, the sink defaults to
data_storage_version="2.2"
- append mode can auto-detect existing Blob V2 columns and keep writing them as Blob V2 even if
blob_columns is not repeated
Known Gaps
1. External / Reference Blob V2 Writes Are Not Fully Exposed
The current daft-lance write path uses lance.fragment.write_fragments.
That path does not currently expose the full set of Blob V2 write options available through pylance / lance.write_dataset, such as:
initial_bases
target_bases
base_store_params
allow_external_blob_outside_bases
- external URI / range-reference configuration
As a result, writing external / reference Blob V2 data through daft-lance is not yet a supported end-to-end workflow.
Expected future behavior:
- users should be able to write external Blob V2 references through
write_lance
- registered base paths should be preserved in dataset metadata
- external full-object and byte-range references should round trip correctly
- nullable Blob V2 external references should be supported
2. merge_columns / merge_columns_df Cannot Add Blob V2 Columns Yet
merge_columns_df has a fast path that writes raw .lance files via LanceFileWriter and stitches those files into fragment metadata.
That path currently handles ordinary scalar / string / numeric columns, but it does not apply the Blob V2 write policy used by write_lance.
Specifically:
- there is no
blob_columns option for merge_columns_df
- new binary columns are not promoted to
lance.blob.v2
- the fast path does not write Blob V2 sidecar data or descriptor metadata
- the slow keyed-join path also does not currently expose Blob V2-specific schema handling
Expected future behavior:
daft_lance.merge_columns_df(
df,
uri,
blob_columns=["new_blob_column"],
)
should add new_blob_column as a logical Blob V2 field.
Implementation options:
- teach the fast path to write Blob V2-compatible files and metadata
- disable the fast path when Blob V2 columns are requested and route to a safe Blob V2-aware path
- reuse / generalize the existing
BlobV2WritePolicy from the write sink
Acceptance criteria:
- adding a new binary column as Blob V2 produces a Lance schema field with type
lance.blob.v2
- existing columns are unchanged
- row alignment is preserved
- the new Blob V2 column can be read as descriptors through
read_lance
- the new Blob V2 column can be materialized through
take_blobs
3. Compaction After Deletion Materialization Fails for Blob V2
A local regression test currently shows the following flow failing:
- write a multi-fragment Lance dataset with Blob V2 columns
- verify descriptor reads
- verify byte materialization through
take_blobs
- delete one row
- run
compact_files(..., materialize_deletions=True)
- attempt to verify remaining Blob V2 bytes
The failure occurs during compaction execution / decoding with an error like:
there were more fields in the schema than provided column indices / infos
This needs investigation to determine whether the root cause is:
- incorrect schema / field metadata passed through
daft-lance
- incompatibility in the distributed compaction task execution path
- upstream Lance compaction behavior for Blob V2 columns
- a mismatch between Blob V2 sidecar metadata and compacted fragment metadata
Expected future behavior:
- compaction should preserve Blob V2 descriptors
- compaction should preserve materialized bytes exactly
- deleted rows should be materialized away when requested
- remaining rows should still be readable through both descriptor scans and
take_blobs
- fragment count should reduce according to compaction options
4. Cleanup / Vacuum Behavior Is Not Defined at the daft-lance Layer
There is currently no explicit daft-lance API for Blob V2 cleanup / vacuum workflows.
Open questions:
- Should
daft-lance expose a cleanup wrapper, or should users call pylance APIs directly?
- How should old versions and orphaned Blob V2 sidecar files be cleaned?
- What is the expected behavior after compaction?
- Do object-store directory semantics require extra documentation?
- How should cleanup interact with time travel / retained versions?
Expected future behavior:
- document the supported cleanup story
- expose a wrapper if there is a stable pylance API we should delegate to
- clarify whether Blob V2 sidecar files are covered by existing cleanup APIs
- add tests where possible
Proposed Development Plan
Phase 1: Lock Down Read and Materialization Behavior
- Keep descriptor reads lazy by default.
- Prefer
_rowid + take_blobs(..., ids=...) for materialization.
- Add / maintain tests for:
- inline Blob V2
- packed Blob V2
- dedicated Blob V2
- nullable Blob V2
- multi-fragment datasets
- filtered / non-contiguous row materialization
Phase 2: Expand Write Support
Expose the Blob V2 write options needed for external / reference workflows.
Potential API additions:
df.write_lance(
uri,
blob_columns=["blob"],
data_storage_version="2.2",
initial_bases=[...],
target_bases=[...],
base_store_params={...},
allow_external_blob_outside_bases=True,
)
Tests should cover:
- inline bytes
- packed bytes
- dedicated blobs
- external full URI references
- external range references
- nullable values
- append mode
- multiple Blob V2 columns
- mixed scalar / vector / Blob V2 datasets
Phase 3: Add Blob V2-Aware Add Columns Support
Extend merge_columns / merge_columns_df so they can add Blob V2 columns intentionally.
Possible API:
daft_lance.merge_columns_df(
df,
uri,
blob_columns=["new_blob"],
)
Validation should include:
- fast path behavior
- slow path behavior
- multi-fragment datasets
- existing columns unchanged
- new Blob V2 column materializes correctly
- schema metadata remains valid
Phase 4: Fix or Upstream Compaction Issue
Add a regression test for:
write Blob V2 dataset
delete rows
compact with materialize_deletions=True
verify descriptors
verify materialized bytes
If the issue is upstream in Lance, file / link an upstream issue with a minimal pylance-only reproducer.
If the issue is in daft-lance, fix the schema / metadata handling in the compaction path.
Phase 5: Define Cleanup Semantics
Investigate Lance cleanup APIs and decide whether daft-lance should expose them.
At minimum, document:
- what cleanup operation users should call
- whether Blob V2 sidecar files are included
- interaction with old versions / time travel
- object-store caveats
Suggested Regression Tests
Blob V2 Read / Write / Delete / Compact Round Trip
def test_blob_v2_read_write_delete_and_compaction_round_trip(...):
# write inline / packed / dedicated Blob V2 across multiple fragments
# read descriptors through daft.read_lance
# materialize bytes through take_blobs using _rowid
# delete one row
# run compact_files(..., materialize_deletions=True)
# verify remaining row count, fragment count, descriptors, and bytes
Add Blob V2 Column Through merge_columns_df
def test_merge_columns_df_adds_blob_v2_column(...):
# create non-blob Lance dataset
# read with _rowaddr and fragment_id
# add a new binary column in Daft
# merge it back as Blob V2
# verify Lance schema field is lance.blob.v2
# verify bytes materialize correctly
External Blob V2 Write With Registered Base
def test_write_lance_external_blob_v2_with_registered_base(...):
# create external object
# write Blob.from_uri / range reference
# verify descriptor kind == external
# verify base path metadata
# verify materialized bytes
Acceptance Criteria
Blob V2 support can be considered complete when:
- descriptor reads are stable
take_blobs materialization works for multi-fragment datasets
- inline / packed / dedicated writes work through
write_lance
- external / reference writes work through
write_lance
- appending to existing Blob V2 datasets works
- adding new Blob V2 columns works through merge/add-column APIs
- compaction preserves Blob V2 bytes and metadata
- delete + compaction works
- cleanup / vacuum behavior is documented or exposed
- tests cover local filesystem and, where feasible, object-store configurations
Background
daft-lancealready has partial Lance Blob V2 support, especially around descriptor reads, byte materialization through pylance, and opt-in binary-column writes.However, several Blob V2 workflows are either not fully supported yet or need explicit validation before we can consider the integration production-ready:
merge_columns/merge_columns_dfThis issue tracks the current state, known gaps, and proposed development direction.
Current Support
Read Descriptor Columns
daft.read_lancecan readlance.blob.v2columns and expose them as descriptor structs instead of eagerly materializing bytes.The descriptor shape is roughly:
This lazy behavior is useful and should remain the default.
Materialize Blob V2 Bytes
daft_lance.take_blobs(df, ds, column)can materialize Blob V2 bytes when the DataFrame has_rowidavailable.Example pattern:
This uses
take_blobs(..., ids=...), which is preferable for multi-fragment datasets because it avoids known pitfalls around index-based addressing.Write Binary Columns as Blob V2
df.write_lance(..., blob_columns=[...])can promote Daft binary / large_binary columns into logical Lance Blob V2 columns.Current behavior:
lance.blob.v2data_storage_version="2.2"blob_columnsis not repeatedKnown Gaps
1. External / Reference Blob V2 Writes Are Not Fully Exposed
The current
daft-lancewrite path useslance.fragment.write_fragments.That path does not currently expose the full set of Blob V2 write options available through pylance /
lance.write_dataset, such as:initial_basestarget_basesbase_store_paramsallow_external_blob_outside_basesAs a result, writing external / reference Blob V2 data through
daft-lanceis not yet a supported end-to-end workflow.Expected future behavior:
write_lance2.
merge_columns/merge_columns_dfCannot Add Blob V2 Columns Yetmerge_columns_dfhas a fast path that writes raw.lancefiles viaLanceFileWriterand stitches those files into fragment metadata.That path currently handles ordinary scalar / string / numeric columns, but it does not apply the Blob V2 write policy used by
write_lance.Specifically:
blob_columnsoption formerge_columns_dflance.blob.v2Expected future behavior:
should add
new_blob_columnas a logical Blob V2 field.Implementation options:
BlobV2WritePolicyfrom the write sinkAcceptance criteria:
lance.blob.v2read_lancetake_blobs3. Compaction After Deletion Materialization Fails for Blob V2
A local regression test currently shows the following flow failing:
take_blobscompact_files(..., materialize_deletions=True)The failure occurs during compaction execution / decoding with an error like:
This needs investigation to determine whether the root cause is:
daft-lanceExpected future behavior:
take_blobs4. Cleanup / Vacuum Behavior Is Not Defined at the
daft-lanceLayerThere is currently no explicit
daft-lanceAPI for Blob V2 cleanup / vacuum workflows.Open questions:
daft-lanceexpose a cleanup wrapper, or should users call pylance APIs directly?Expected future behavior:
Proposed Development Plan
Phase 1: Lock Down Read and Materialization Behavior
_rowid+take_blobs(..., ids=...)for materialization.Phase 2: Expand Write Support
Expose the Blob V2 write options needed for external / reference workflows.
Potential API additions:
Tests should cover:
Phase 3: Add Blob V2-Aware Add Columns Support
Extend
merge_columns/merge_columns_dfso they can add Blob V2 columns intentionally.Possible API:
Validation should include:
Phase 4: Fix or Upstream Compaction Issue
Add a regression test for:
If the issue is upstream in Lance, file / link an upstream issue with a minimal pylance-only reproducer.
If the issue is in
daft-lance, fix the schema / metadata handling in the compaction path.Phase 5: Define Cleanup Semantics
Investigate Lance cleanup APIs and decide whether
daft-lanceshould expose them.At minimum, document:
Suggested Regression Tests
Blob V2 Read / Write / Delete / Compact Round Trip
Add Blob V2 Column Through
merge_columns_dfExternal Blob V2 Write With Registered Base
Acceptance Criteria
Blob V2 support can be considered complete when:
take_blobsmaterialization works for multi-fragment datasetswrite_lancewrite_lance