Skip to content

[core] introduce Placeholder for Blob File Format#7889

Open
steFaiz wants to merge 7 commits into
apache:masterfrom
steFaiz:placeholder_blob
Open

[core] introduce Placeholder for Blob File Format#7889
steFaiz wants to merge 7 commits into
apache:masterfrom
steFaiz:placeholder_blob

Conversation

@steFaiz
Copy link
Copy Markdown
Contributor

@steFaiz steFaiz commented May 18, 2026

Purpose

This is the first part of #7881
Including:

  1. Bump Blob File Format to V2, introducing a PlaceHolder Blob.
  2. Introduce a fallbackReader for blob to skip placeholders. This is a two-level abstraction:
    a. At first, all data files will be divided according to max_seq_num
    b. within each group, create a sequential reader to logically concat files and fill missing gaps. For example: If the full row range of normal files is [0, 100], but some group only have one file with range [20, 80], the output is: [0, 19] -> filled with placeholders; [20, 80] -> records from files; [81, 100] -> filled with placeholders.
    c. create readers for each group, and read the blob from the max group whose value is NOT a placeholder.

The mechanism can be illustrated as below:
image

Tests

ITCase and Unit tests

@steFaiz steFaiz marked this pull request as draft May 18, 2026 11:17
@steFaiz steFaiz changed the title [core] introduce Placeholder for Blob File Format [wip][core] introduce Placeholder for Blob File Format May 18, 2026
@steFaiz steFaiz marked this pull request as ready for review May 19, 2026 06:19
@steFaiz steFaiz changed the title [wip][core] introduce Placeholder for Blob File Format [core] introduce Placeholder for Blob File Format May 19, 2026
* The placeholder blob, mainly for blob update in data-evolution. It should never be exposed to
* users.
*/
Blob PLACE_HOLDER =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is strange, maybe just use NULL as place holder?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your advise! But in #7125 we supports storing nulls in blob file. I'm not clear how to distinguish placeholders and native NULLs if so.

From the semantics, NULLs are exposed to users, users know that they store some nulls. But placeholders are fully internal used, users should never be aware about them. If users set some rows as nulls, we may fallback those rows to earlier versions, this is not expected in our design.

Could you please give me some advise?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you can consider using row number in blob to determine how to merge? You can just return valid blobs with row number.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The row number is actually the primary key.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that you not only need this class for reading, but also for writing. If you skip these elements, the changes will be significant.

I thin you can just introduce a BlobPlaceHolder implements Blob, Serializable for this, use instance of is better.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'll modify my code!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants