Fix ice-disk table scans#491
Conversation
|
@adsharma could you PTAL? Re: duplicate boundNodes in unordered_map |
|
dataset PR: LadybugDB/dataset#3 |
|
@adsharma should we add a get_icebug_disk_supported_version CALL? |
|
We already have |
| std::string primaryKeyName; | ||
| std::string storage; | ||
| std::string storageFormat; | ||
| common::StorageFormat storageFormat = common::StorageFormat::NONE; |
There was a problem hiding this comment.
This is a backward incompatible change. But since we don't have any known prod usage of previous string storage format, I think we can survive without a storage version bump
There was a problem hiding this comment.
Actually, this fails to open a version 41 database:
./build/release/tools/shell/lbug -r /tmp/test1-save.db
Terminate called after throwing an instance of std::bad_alloc: std::bad_alloc
<while deserializing catalog entry>
Either revert or implement in-place upgrade.
There was a problem hiding this comment.
is it created with the bin of this branch?
There was a problem hiding this comment.
dbs created from 0.16.0 and from a bin from main (without this PR) will have the same outcome.
The way kuzu dealt with it was to ask users to export/import db after every release. Since we went several releases with version 40 and backward compat, this would be a good contract to maintain going forward.
There was a problem hiding this comment.
loads up fine for me with both 0.16.1 and 0.15.3. I haven't removed upgradeLegacyStorageFormat from my previous PR
| } | ||
|
|
||
| // Load shared indptr data - thread-safe to read | ||
| if (!indptrFilePath.empty()) { |
There was a problem hiding this comment.
This guard was significant?
There was a problem hiding this comment.
indptr and indices path validation is done during table creation phase
| // calc current global row index based on assigned row group and local row index within that | ||
| // group | ||
| auto metadata = iceDiskScanState.parquetReader->getMetadata(); | ||
| offset_t startOffset = 0; |
There was a problem hiding this comment.
startOffset for a given nodeGroupIdx is constant?
There was a problem hiding this comment.
startOffset for a nodeGroupIdx(rowGroup) is calc just below. We can avoid this repeated calc by populating startOffsets in initGlobalStateInternal. I will add it in refactor post release. Keeping changes minimal right now
|
|
||
| File paths can be relative or absolute and are resolved as `<path-to-dir>/nodes_{tableName}.parquet` for node tables, and `<path-to-dir>/indices_{tableName}.parquet` and `<path-to-dir>/indptr_{tableName}.parquet` for relationship tables. | ||
|
|
||
| Object-store URIs (e.g. `s3://bucket/path`, `https://host/path`) are supported as `storage` values and are passed through unchanged. |
There was a problem hiding this comment.
Why drop this? Should still hold afterwards?
There was a problem hiding this comment.
we can add it later once we support URIs
There was a problem hiding this comment.
The VFS has supported URIs for a long time. We haven't touched that code, but we haven't verified them either (because all the s3/httpfs assets were not accessible after kuzu archiving).
I'm sending a small PR to thread the shell through VFS and change path mangling to restore URIs. Fortunately most of it seems to be intact and functional.
There was a problem hiding this comment.
are you saying parquet_reader code would work with URIs too?
There was a problem hiding this comment.
Yes - tested via tooling in #493. Same result local file or http:// url.
s3:// should work too once we figure out ordering of lbug -i processing vs cred handling.
|
|
||
| // Create DataChunk matching the indices parquet file schema | ||
| auto numIndicesColumns = indicesReader->getNumColumns(); | ||
| cachedBatchData = std::make_unique<DataChunk>(numIndicesColumns); |
There was a problem hiding this comment.
Can these allocations be done once on reset() and reused?
There was a problem hiding this comment.
DataChunk doesn't offer a reset out of the box. All it offers is resetAuxiliaryBuffer. We need to manually reset state in DataChunk and other state objects in ValueVectors which requires tinkering with ParquetReader and/or ValueVector. Maybe refactor it later?
| for (uint32_t colIdx = 0; colIdx < numIndicesColumns; ++colIdx) { | ||
| const auto& columnTypeRef = indicesReader->getColumnType(colIdx); | ||
| auto columnType = columnTypeRef.copy(); | ||
| auto vector = std::make_shared<ValueVector>(std::move(columnType), memoryManager); |
|
new dataset PR: LadybugDB/dataset#4 |
context: #476 (review)