Fix ice-disk table scans by aheev · Pull Request #491 · LadybugDB/ladybug

aheev · 2026-05-15T11:51:54Z

fixed nodeID offset in node table scan by calculating calc global row offset using parquet metadata
fixed early break issue in rel table scans by refactoring ice-disk internal scan to full table based rather than rowGroup based
enumized STORAGE_FORMAT

aheev · 2026-05-15T11:54:49Z

@adsharma could you PTAL?

Re: duplicate boundNodes in unordered_map

Two cases:

 1. Source mode (MATCH (a:user)-[:follows]->(b) — direct node scan child): fetchNextBoundNodeBatch generates unique sequential offsets [nextOffset, nextOffset+N). No duplicates by construction.
 2. Non-source mode (multi-hop (a)-[r1]->(b)-[r2]->(c)):
 - r1's scan processes one source node a at a time (the break when boundOffset != activeBoundOffset)
 - So each call to r1.getNextTuple produces neighbors of exactly one a
 - A single source node's neighbor list has no duplicates in a well-formed CSR file
 - IceDisk node table emits 1 node per scan call. Even if it emits in a batch they would be distinct
 - Therefore r2's bound node vector always has distinct b values in each batch

aheev · 2026-05-15T11:56:57Z

dataset PR: LadybugDB/dataset#3

aheev · 2026-05-16T03:05:45Z

@adsharma should we add a get_icebug_disk_supported_version CALL?

adsharma · 2026-05-16T15:31:21Z

We already have db_version() and storage_version(). Users can detect if icebug-disk is supported by trying ATTACH.

adsharma · 2026-05-16T15:55:03Z

    std::string primaryKeyName;
    std::string storage;
-    std::string storageFormat;
+    common::StorageFormat storageFormat = common::StorageFormat::NONE;


This is a backward incompatible change. But since we don't have any known prod usage of previous string storage format, I think we can survive without a storage version bump

Actually, this fails to open a version 41 database:

./build/release/tools/shell/lbug -r /tmp/test1-save.db Terminate called after throwing an instance of std::bad_alloc: std::bad_alloc <while deserializing catalog entry>

Either revert or implement in-place upgrade.

is it created with the bin of this branch?

dbs created from 0.16.0 and from a bin from main (without this PR) will have the same outcome.

The way kuzu dealt with it was to ask users to export/import db after every release. Since we went several releases with version 40 and backward compat, this would be a good contract to maintain going forward.

loads up fine for me with both 0.16.1 and 0.15.3. I haven't removed upgradeLegacyStorageFormat from my previous PR

adsharma · 2026-05-16T16:09:33Z

-    }
-
-    // Load shared indptr data - thread-safe to read
-    if (!indptrFilePath.empty()) {


This guard was significant?

indptr and indices path validation is done during table creation phase

adsharma · 2026-05-16T16:11:19Z

+    // calc current global row index based on assigned row group and local row index within that
+    // group
+    auto metadata = iceDiskScanState.parquetReader->getMetadata();
+    offset_t startOffset = 0;


startOffset for a given nodeGroupIdx is constant?

startOffset for a nodeGroupIdx(rowGroup) is calc just below. We can avoid this repeated calc by populating startOffsets in initGlobalStateInternal. I will add it in refactor post release. Keeping changes minimal right now

adsharma · 2026-05-16T16:13:35Z


 File paths can be relative or absolute and are resolved as `<path-to-dir>/nodes_{tableName}.parquet` for node tables, and `<path-to-dir>/indices_{tableName}.parquet` and `<path-to-dir>/indptr_{tableName}.parquet` for relationship tables.

-Object-store URIs (e.g. `s3://bucket/path`, `https://host/path`) are supported as `storage` values and are passed through unchanged.


Why drop this? Should still hold afterwards?

we can add it later once we support URIs

The VFS has supported URIs for a long time. We haven't touched that code, but we haven't verified them either (because all the s3/httpfs assets were not accessible after kuzu archiving).

I'm sending a small PR to thread the shell through VFS and change path mangling to restore URIs. Fortunately most of it seems to be intact and functional.

are you saying parquet_reader code would work with URIs too?

Yes - tested via tooling in #493. Same result local file or http:// url.

s3:// should work too once we figure out ordering of lbug -i processing vs cred handling.

adsharma · 2026-05-16T16:17:08Z

+
+    // Create DataChunk matching the indices parquet file schema
+    auto numIndicesColumns = indicesReader->getNumColumns();
+    cachedBatchData = std::make_unique<DataChunk>(numIndicesColumns);


Can these allocations be done once on reset() and reused?

DataChunk doesn't offer a reset out of the box. All it offers is resetAuxiliaryBuffer. We need to manually reset state in DataChunk and other state objects in ValueVectors which requires tinkering with ParquetReader and/or ValueVector. Maybe refactor it later?

adsharma · 2026-05-16T16:17:22Z

+    for (uint32_t colIdx = 0; colIdx < numIndicesColumns; ++colIdx) {
+        const auto& columnTypeRef = indicesReader->getColumnType(colIdx);
+        auto columnType = columnTypeRef.copy();
+        auto vector = std::make_shared<ValueVector>(std::move(columnType), memoryManager);


same as above

aheev · 2026-05-17T05:43:14Z

new dataset PR: LadybugDB/dataset#4

aheev added 4 commits May 14, 2026 15:26

fix minor issues

ff6a916

add self-join tests

989846c

fix rowIndex in icebug node table scan

483c898

fix ice-disk rel table scan

c3a0d84

aheev mentioned this pull request May 15, 2026

update icebug-disk demo-db datasets LadybugDB/dataset#3

Merged

aheev added 2 commits May 15, 2026 20:03

enumize storage format

dbabed3

update dataset submodule

7cd4feb

adsharma reviewed May 16, 2026

View reviewed changes

aheev added 2 commits May 17, 2026 10:44

fix scans

8f323bc

add ice_disk complex_queries tests

ba4ade6

aheev requested a review from adsharma May 17, 2026 05:48

update dataset submodule

dc0195c


		File paths can be relative or absolute and are resolved as `<path-to-dir>/nodes_{tableName}.parquet` for node tables, and `<path-to-dir>/indices_{tableName}.parquet` and `<path-to-dir>/indptr_{tableName}.parquet` for relationship tables.

		Object-store URIs (e.g. `s3://bucket/path`, `https://host/path`) are supported as `storage` values and are passed through unchanged.

Conversation

aheev commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aheev commented May 15, 2026

Uh oh!

aheev commented May 15, 2026

Uh oh!

aheev commented May 16, 2026

Uh oh!

adsharma commented May 16, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aheev commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aheev commented May 15, 2026 •

edited

Loading