Skip to content

FastPathFragmentWriter crashes on object-store datasets: os.path.getsize() on an s3:// path #20

Description

@ohbh

Description

merge_columns_from_df's fast path (FastPathFragmentWriter) writes the new column's .lance file to object storage correctly, then calls os.path.getsize() on that same s3:// / gs:// / R2 path to
record file_size_bytes. os.path.getsize is os.stat — local filesystem only — so it raises FileNotFoundError for any non-local dataset. The fast path is therefore broken for all object-store-backed
merges.

The slow path (fragment.merge) is unaffected; it writes through Lance's object-store-aware API.

Package/version: daft-lance==0.3.3
File: daft_lance/lance_merge_column.pyFastPathFragmentWriter.__call__

Affected code

# FastPathFragmentWriter.__call__
filename = uuid.uuid4().hex + ".lance"
filepath = os.path.join(self.uri, "data", filename)        # self.uri is the dataset URI, e.g. "s3://bucket/t.lance"
with LanceFileWriter(filepath, tbl.schema,
                     version=f"{file_major}.{file_minor}",
                     storage_options=self.storage_options) as writer:   # writes to S3 fine
    for b in tbl.to_batches():
        writer.write_batch(b)
file_size = os.path.getsize(filepath)                      # ❌ os.stat("s3://...") -> FileNotFoundError
...
new_file_entry = { ..., "file_size_bytes": file_size, ... }

self.uri is passed straight through from _merge_fast_path(df, lance_ds, uri, ...)FastPathFragmentWriter(lance_ds, str(uri), ...), where uri is the merge target — an s3:///r2/gs:// URI for cloud
datasets.

Steps to reproduce

The fast path runs only when _can_use_fast_path() is True, i.e. all of:

  1. join key is _rowaddr,
  2. df has _rowaddr + fragment_id,
  3. every new column is a primitive (leaf) Arrow type (no list/fixed_size_list/struct),
  4. df covers the whole dataset: len(df.collect()) == lance_ds.count_rows().
# Lance dataset on S3/R2 with N rows.
# df: _rowaddr + fragment_id + one primitive new column, covering all N rows (no filter).
df.write_lance("s3://bucket/table.lance", mode="merge")

Expected behavior

The merge commits successfully and the new fragment file entry records a correct file_size_bytes, regardless of whether the dataset is local or on object storage.

Actual behavior

FileNotFoundError: [Errno 2] No such file or directory: 's3://bucket/table.lance/data/<uuid>.lance'

raised at file_size = os.path.getsize(filepath).

Root cause

os.path.getsize / os.stat does not understand object-store URIs and ignores storage_options. The file is written via Lance's object-store writer, but its size is read back with a local-only API.

Why it isn't caught today

Fast-path tests run against local Lance datasets, where self.uri is a filesystem path and os.path.getsize is valid. There is no object-store (or mocked-remote) coverage for the fast path.

Proposed fix

Read the size with a storage_options-aware filesystem instead of os.path.getsize — mirroring what daft/io/writer.py::FileWriterBase already does. The lookup must be built from the same
storage_options
the writer used (a bare pyarrow.fs.FileSystem.from_uri won't pick up custom endpoints/credentials — e.g. Cloudflare R2 sets endpoint_override + keys in storage_options):

# build fs from the same storage_options / io_config used for the write, then:
file_size = fs.get_file_info(path).size

Prefer obtaining the byte count directly from LanceFileWriter if it exposes bytes-written, which avoids the extra stat round-trip entirely.

Secondary nit: os.path.join(self.uri, "data", filename) only works because no segment is absolute; for a URI, join explicitly with "/" (or a URI-aware joiner).

Acceptance criteria

  • write_lance(mode="merge") with an all-primitive, full-coverage df against an s3:///r2 dataset commits successfully with a correct file_size_bytes.
  • Size lookup honors storage_options (verified against a custom-endpoint backend, e.g. R2).
  • Regression test exercises the fast path on a non-local filesystem (mocked S3 / moto, or a storage_options-routed backend).

Labels: bug, object-store · Severity: high (narrow trigger) — crashes any full-coverage, all-primitive-column merge on object storage; local and slow-path merges unaffected.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions