Description
merge_columns_from_df's fast path (FastPathFragmentWriter) writes the new column's .lance file to object storage correctly, then calls os.path.getsize() on that same s3:// / gs:// / R2 path to
record file_size_bytes. os.path.getsize is os.stat — local filesystem only — so it raises FileNotFoundError for any non-local dataset. The fast path is therefore broken for all object-store-backed
merges.
The slow path (fragment.merge) is unaffected; it writes through Lance's object-store-aware API.
Package/version: daft-lance==0.3.3
File: daft_lance/lance_merge_column.py — FastPathFragmentWriter.__call__
Affected code
# FastPathFragmentWriter.__call__
filename = uuid.uuid4().hex + ".lance"
filepath = os.path.join(self.uri, "data", filename) # self.uri is the dataset URI, e.g. "s3://bucket/t.lance"
with LanceFileWriter(filepath, tbl.schema,
version=f"{file_major}.{file_minor}",
storage_options=self.storage_options) as writer: # writes to S3 fine
for b in tbl.to_batches():
writer.write_batch(b)
file_size = os.path.getsize(filepath) # ❌ os.stat("s3://...") -> FileNotFoundError
...
new_file_entry = { ..., "file_size_bytes": file_size, ... }
self.uri is passed straight through from _merge_fast_path(df, lance_ds, uri, ...) → FastPathFragmentWriter(lance_ds, str(uri), ...), where uri is the merge target — an s3:///r2/gs:// URI for cloud
datasets.
Steps to reproduce
The fast path runs only when _can_use_fast_path() is True, i.e. all of:
- join key is
_rowaddr,
- df has
_rowaddr + fragment_id,
- every new column is a primitive (leaf) Arrow type (no
list/fixed_size_list/struct),
- df covers the whole dataset:
len(df.collect()) == lance_ds.count_rows().
# Lance dataset on S3/R2 with N rows.
# df: _rowaddr + fragment_id + one primitive new column, covering all N rows (no filter).
df.write_lance("s3://bucket/table.lance", mode="merge")
Expected behavior
The merge commits successfully and the new fragment file entry records a correct file_size_bytes, regardless of whether the dataset is local or on object storage.
Actual behavior
FileNotFoundError: [Errno 2] No such file or directory: 's3://bucket/table.lance/data/<uuid>.lance'
raised at file_size = os.path.getsize(filepath).
Root cause
os.path.getsize / os.stat does not understand object-store URIs and ignores storage_options. The file is written via Lance's object-store writer, but its size is read back with a local-only API.
Why it isn't caught today
Fast-path tests run against local Lance datasets, where self.uri is a filesystem path and os.path.getsize is valid. There is no object-store (or mocked-remote) coverage for the fast path.
Proposed fix
Read the size with a storage_options-aware filesystem instead of os.path.getsize — mirroring what daft/io/writer.py::FileWriterBase already does. The lookup must be built from the same
storage_options the writer used (a bare pyarrow.fs.FileSystem.from_uri won't pick up custom endpoints/credentials — e.g. Cloudflare R2 sets endpoint_override + keys in storage_options):
# build fs from the same storage_options / io_config used for the write, then:
file_size = fs.get_file_info(path).size
Prefer obtaining the byte count directly from LanceFileWriter if it exposes bytes-written, which avoids the extra stat round-trip entirely.
Secondary nit: os.path.join(self.uri, "data", filename) only works because no segment is absolute; for a URI, join explicitly with "/" (or a URI-aware joiner).
Acceptance criteria
Labels: bug, object-store · Severity: high (narrow trigger) — crashes any full-coverage, all-primitive-column merge on object storage; local and slow-path merges unaffected.
Description
merge_columns_from_df's fast path (FastPathFragmentWriter) writes the new column's.lancefile to object storage correctly, then callsos.path.getsize()on that sames3:///gs:/// R2 path torecord
file_size_bytes.os.path.getsizeisos.stat— local filesystem only — so it raisesFileNotFoundErrorfor any non-local dataset. The fast path is therefore broken for all object-store-backedmerges.
The slow path (
fragment.merge) is unaffected; it writes through Lance's object-store-aware API.Package/version:
daft-lance==0.3.3File:
daft_lance/lance_merge_column.py—FastPathFragmentWriter.__call__Affected code
self.uriis passed straight through from_merge_fast_path(df, lance_ds, uri, ...)→FastPathFragmentWriter(lance_ds, str(uri), ...), whereuriis the merge target — ans3:///r2/gs://URI for clouddatasets.
Steps to reproduce
The fast path runs only when
_can_use_fast_path()is True, i.e. all of:_rowaddr,_rowaddr+fragment_id,list/fixed_size_list/struct),len(df.collect()) == lance_ds.count_rows().Expected behavior
The merge commits successfully and the new fragment file entry records a correct
file_size_bytes, regardless of whether the dataset is local or on object storage.Actual behavior
raised at
file_size = os.path.getsize(filepath).Root cause
os.path.getsize/os.statdoes not understand object-store URIs and ignoresstorage_options. The file is written via Lance's object-store writer, but its size is read back with a local-only API.Why it isn't caught today
Fast-path tests run against local Lance datasets, where
self.uriis a filesystem path andos.path.getsizeis valid. There is no object-store (or mocked-remote) coverage for the fast path.Proposed fix
Read the size with a
storage_options-aware filesystem instead ofos.path.getsize— mirroring whatdaft/io/writer.py::FileWriterBasealready does. The lookup must be built from the samestorage_optionsthe writer used (a barepyarrow.fs.FileSystem.from_uriwon't pick up custom endpoints/credentials — e.g. Cloudflare R2 setsendpoint_override+ keys instorage_options):Prefer obtaining the byte count directly from
LanceFileWriterif it exposes bytes-written, which avoids the extra stat round-trip entirely.Secondary nit:
os.path.join(self.uri, "data", filename)only works because no segment is absolute; for a URI, join explicitly with"/"(or a URI-aware joiner).Acceptance criteria
write_lance(mode="merge")with an all-primitive, full-coverage df against ans3:///r2dataset commits successfully with a correctfile_size_bytes.storage_options(verified against a custom-endpoint backend, e.g. R2).moto, or astorage_options-routed backend).Labels:
bug,object-store· Severity: high (narrow trigger) — crashes any full-coverage, all-primitive-column merge on object storage; local and slow-path merges unaffected.