Skip to content

Track Lance Blob V2 Support for daft-lance #30

Description

@everySympathy

Background

daft-lance already has partial Lance Blob V2 support, especially around descriptor reads, byte materialization through pylance, and opt-in binary-column writes.

However, several Blob V2 workflows are either not fully supported yet or need explicit validation before we can consider the integration production-ready:

  • external / reference Blob V2 writes
  • adding Blob V2 columns through merge_columns / merge_columns_df
  • compaction after deletes
  • cleanup / vacuum behavior for Blob V2 sidecar data and old versions

This issue tracks the current state, known gaps, and proposed development direction.

Current Support

Read Descriptor Columns

daft.read_lance can read lance.blob.v2 columns and expose them as descriptor structs instead of eagerly materializing bytes.

The descriptor shape is roughly:

{
  kind: uint8,
  position: uint64,
  size: uint64,
  blob_id: uint32,
  blob_uri: string,
}

This lazy behavior is useful and should remain the default.

Materialize Blob V2 Bytes

daft_lance.take_blobs(df, ds, column) can materialize Blob V2 bytes when the DataFrame has _rowid available.

Example pattern:

import lance
import daft
from daft_lance import take_blobs

ds = lance.dataset(uri)
df = daft.read_lance(uri, default_scan_options={"with_row_id": True})
df = take_blobs(df, ds, "blob")

This uses take_blobs(..., ids=...), which is preferable for multi-fragment datasets because it avoids known pitfalls around index-based addressing.

Write Binary Columns as Blob V2

df.write_lance(..., blob_columns=[...]) can promote Daft binary / large_binary columns into logical Lance Blob V2 columns.

Current behavior:

  • binary / large_binary columns can be opt-in Blob V2 columns
  • the written Lance schema uses lance.blob.v2
  • when Blob V2 columns are requested and no storage version is provided, the sink defaults to data_storage_version="2.2"
  • append mode can auto-detect existing Blob V2 columns and keep writing them as Blob V2 even if blob_columns is not repeated

Known Gaps

1. External / Reference Blob V2 Writes Are Not Fully Exposed

The current daft-lance write path uses lance.fragment.write_fragments.

That path does not currently expose the full set of Blob V2 write options available through pylance / lance.write_dataset, such as:

  • initial_bases
  • target_bases
  • base_store_params
  • allow_external_blob_outside_bases
  • external URI / range-reference configuration

As a result, writing external / reference Blob V2 data through daft-lance is not yet a supported end-to-end workflow.

Expected future behavior:

  • users should be able to write external Blob V2 references through write_lance
  • registered base paths should be preserved in dataset metadata
  • external full-object and byte-range references should round trip correctly
  • nullable Blob V2 external references should be supported

2. merge_columns / merge_columns_df Cannot Add Blob V2 Columns Yet

merge_columns_df has a fast path that writes raw .lance files via LanceFileWriter and stitches those files into fragment metadata.

That path currently handles ordinary scalar / string / numeric columns, but it does not apply the Blob V2 write policy used by write_lance.

Specifically:

  • there is no blob_columns option for merge_columns_df
  • new binary columns are not promoted to lance.blob.v2
  • the fast path does not write Blob V2 sidecar data or descriptor metadata
  • the slow keyed-join path also does not currently expose Blob V2-specific schema handling

Expected future behavior:

daft_lance.merge_columns_df(
    df,
    uri,
    blob_columns=["new_blob_column"],
)

should add new_blob_column as a logical Blob V2 field.

Implementation options:

  • teach the fast path to write Blob V2-compatible files and metadata
  • disable the fast path when Blob V2 columns are requested and route to a safe Blob V2-aware path
  • reuse / generalize the existing BlobV2WritePolicy from the write sink

Acceptance criteria:

  • adding a new binary column as Blob V2 produces a Lance schema field with type lance.blob.v2
  • existing columns are unchanged
  • row alignment is preserved
  • the new Blob V2 column can be read as descriptors through read_lance
  • the new Blob V2 column can be materialized through take_blobs

3. Compaction After Deletion Materialization Fails for Blob V2

A local regression test currently shows the following flow failing:

  1. write a multi-fragment Lance dataset with Blob V2 columns
  2. verify descriptor reads
  3. verify byte materialization through take_blobs
  4. delete one row
  5. run compact_files(..., materialize_deletions=True)
  6. attempt to verify remaining Blob V2 bytes

The failure occurs during compaction execution / decoding with an error like:

there were more fields in the schema than provided column indices / infos

This needs investigation to determine whether the root cause is:

  • incorrect schema / field metadata passed through daft-lance
  • incompatibility in the distributed compaction task execution path
  • upstream Lance compaction behavior for Blob V2 columns
  • a mismatch between Blob V2 sidecar metadata and compacted fragment metadata

Expected future behavior:

  • compaction should preserve Blob V2 descriptors
  • compaction should preserve materialized bytes exactly
  • deleted rows should be materialized away when requested
  • remaining rows should still be readable through both descriptor scans and take_blobs
  • fragment count should reduce according to compaction options

4. Cleanup / Vacuum Behavior Is Not Defined at the daft-lance Layer

There is currently no explicit daft-lance API for Blob V2 cleanup / vacuum workflows.

Open questions:

  • Should daft-lance expose a cleanup wrapper, or should users call pylance APIs directly?
  • How should old versions and orphaned Blob V2 sidecar files be cleaned?
  • What is the expected behavior after compaction?
  • Do object-store directory semantics require extra documentation?
  • How should cleanup interact with time travel / retained versions?

Expected future behavior:

  • document the supported cleanup story
  • expose a wrapper if there is a stable pylance API we should delegate to
  • clarify whether Blob V2 sidecar files are covered by existing cleanup APIs
  • add tests where possible

Proposed Development Plan

Phase 1: Lock Down Read and Materialization Behavior

  • Keep descriptor reads lazy by default.
  • Prefer _rowid + take_blobs(..., ids=...) for materialization.
  • Add / maintain tests for:
    • inline Blob V2
    • packed Blob V2
    • dedicated Blob V2
    • nullable Blob V2
    • multi-fragment datasets
    • filtered / non-contiguous row materialization

Phase 2: Expand Write Support

Expose the Blob V2 write options needed for external / reference workflows.

Potential API additions:

df.write_lance(
    uri,
    blob_columns=["blob"],
    data_storage_version="2.2",
    initial_bases=[...],
    target_bases=[...],
    base_store_params={...},
    allow_external_blob_outside_bases=True,
)

Tests should cover:

  • inline bytes
  • packed bytes
  • dedicated blobs
  • external full URI references
  • external range references
  • nullable values
  • append mode
  • multiple Blob V2 columns
  • mixed scalar / vector / Blob V2 datasets

Phase 3: Add Blob V2-Aware Add Columns Support

Extend merge_columns / merge_columns_df so they can add Blob V2 columns intentionally.

Possible API:

daft_lance.merge_columns_df(
    df,
    uri,
    blob_columns=["new_blob"],
)

Validation should include:

  • fast path behavior
  • slow path behavior
  • multi-fragment datasets
  • existing columns unchanged
  • new Blob V2 column materializes correctly
  • schema metadata remains valid

Phase 4: Fix or Upstream Compaction Issue

Add a regression test for:

write Blob V2 dataset
delete rows
compact with materialize_deletions=True
verify descriptors
verify materialized bytes

If the issue is upstream in Lance, file / link an upstream issue with a minimal pylance-only reproducer.

If the issue is in daft-lance, fix the schema / metadata handling in the compaction path.

Phase 5: Define Cleanup Semantics

Investigate Lance cleanup APIs and decide whether daft-lance should expose them.

At minimum, document:

  • what cleanup operation users should call
  • whether Blob V2 sidecar files are included
  • interaction with old versions / time travel
  • object-store caveats

Suggested Regression Tests

Blob V2 Read / Write / Delete / Compact Round Trip

def test_blob_v2_read_write_delete_and_compaction_round_trip(...):
    # write inline / packed / dedicated Blob V2 across multiple fragments
    # read descriptors through daft.read_lance
    # materialize bytes through take_blobs using _rowid
    # delete one row
    # run compact_files(..., materialize_deletions=True)
    # verify remaining row count, fragment count, descriptors, and bytes

Add Blob V2 Column Through merge_columns_df

def test_merge_columns_df_adds_blob_v2_column(...):
    # create non-blob Lance dataset
    # read with _rowaddr and fragment_id
    # add a new binary column in Daft
    # merge it back as Blob V2
    # verify Lance schema field is lance.blob.v2
    # verify bytes materialize correctly

External Blob V2 Write With Registered Base

def test_write_lance_external_blob_v2_with_registered_base(...):
    # create external object
    # write Blob.from_uri / range reference
    # verify descriptor kind == external
    # verify base path metadata
    # verify materialized bytes

Acceptance Criteria

Blob V2 support can be considered complete when:

  • descriptor reads are stable
  • take_blobs materialization works for multi-fragment datasets
  • inline / packed / dedicated writes work through write_lance
  • external / reference writes work through write_lance
  • appending to existing Blob V2 datasets works
  • adding new Blob V2 columns works through merge/add-column APIs
  • compaction preserves Blob V2 bytes and metadata
  • delete + compaction works
  • cleanup / vacuum behavior is documented or exposed
  • tests cover local filesystem and, where feasible, object-store configurations

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions