Skip to content

[refactor](table) Refactor table and file reader#63893

Draft
Gabriel39 wants to merge 38 commits into
masterfrom
refact_reader_branch
Draft

[refactor](table) Refactor table and file reader#63893
Gabriel39 wants to merge 38 commits into
masterfrom
refact_reader_branch

Conversation

@Gabriel39
Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

suxiaogang223 and others added 30 commits May 18, 2026 15:52
Co-authored-by: Socrates <suyiteng@selectdb.com>
Problem Summary: NewParquetReaderTest only populated
FileLocalFilter::predicates for local predicate filtering. Parquet row
group pruning still uses predicates, while row filtering now uses
conjunct, so the tests need to populate both with matching semantics.
Issue Number: None

Related PR: None

Problem Summary: Add file-local nested schema metadata and projection plumbing for the new Arrow Parquet reader. Struct child projection is now pushed into the Parquet column reader factory, table scan schema is rebuilt from projected complex types, and the mapper preserves path metadata for future complex schema change handling while explicitly rejecting unsupported child schema evolution for now.

None

- Test: Unit Test
    - Added BE unit coverage for struct projection, nested schema path metadata, and table mapper complex projection generation.
    - Ran clang-format 16 dry-run on modified C++ files.
    - Ran git diff --check.
    - Attempted ./run-be-ut.sh --run '--filter=ParquetColumnReaderTest.*:TableColumnMapperTest.*:NewParquetReaderTest.*:FileReaderTest.*', but local CMake compiler sanity check failed before Doris code compilation because ld could not find library 'c++'.
- Behavior changed: No
- Does this need documentation: Yes (included docs/doris-arrow-parquet-complex-types-implementation.md)
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Fix the non-BE_TEST build failure caused by calling the test-only set_node_type helper from the VSlotRef protected constructor.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran clang-format dry-run and git diff --check for the modified header. Fedora DEBUG BE build was run and exposed the fixed compile failure; full build will be rerun after syncing this commit.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Fix a DEBUG build failure in the new parquet reader by asserting the read batch size before converting it to the selection vector row count type.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran clang-format dry-run and git diff --check for the modified parquet reader file. Fedora DEBUG BE build was run and exposed the fixed compile failure; full build will be rerun after syncing this commit.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Add the next step of complex type support in the new parquet reader by normalizing standard LIST schema to Array(element), allowing nested leaf RecordReader usage, and reading non-empty LIST<required primitive> columns from repetition levels.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran clang-format dry-run and git diff --check for modified files.
    - Ran BUILD_TYPE=DEBUG ./build.sh --be on Fedora successfully with the patch applied.
    - Attempted ParquetColumnReaderTest on Fedora, but stopped the ASAN_UT build because it triggered a fresh full UT build; no test binary execution result was produced.
- Behavior changed: Yes. The new parquet reader can now read a limited non-empty LIST<required primitive> shape and reports NotSupported for unsupported list shapes instead of rejecting all LIST columns.
- Does this need documentation: No
Problem Summary: TableReader could map partition columns to physical
file columns before checking split partition values, and
constant/default expression materialization used the file-local block
row count. For scans where partition values should be filled from split
metadata, especially when the file-local block row count differs from
the batch row count, this could produce incorrect materialized columns.
1. ParquetReader reads a range of a parquet file
2. ParquetReader supports virtual column reader (RowPosition)
3. IcebergReader supports virtual columns
Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Add initial new Parquet MAP reader support for required scalar key/value entries and normalize MAP key_value schema metadata for future complex projection work.

None

- Test: Unit Test / Manual test
    - Added parquet column reader unit test coverage for required Map(Int32, String).
    - Ran git diff --check.
    - BE build will be verified on Fedora after push.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Add shared LIST/MAP level assembly for scalar nested Parquet children, including null parent rows, empty collections, nullable element/value slots, overflow buffering, and parent-row skip/select semantics.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran git diff --check. BE DEBUG build will be run on Fedora after push.
- Behavior changed: Yes. New Parquet reader can read LIST/MAP scalar children with null/empty/nullable-child cases.
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Fix warning-as-error failures in the new parquet map reader caused by shadowed local names and aggregate initialization after adding the nested level assembler.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran git diff --check locally. Fedora DEBUG BE build will be run after pushing this commit.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Support parquet STRUCT reading with scalar children through definition-level assembly, including nullable parent struct handling and projected struct child reads.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran git diff --check locally.
    - Ran BUILD_TYPE=DEBUG ./build.sh --be on Fedora.
- Behavior changed: Yes
    - New parquet reader now supports nullable STRUCT columns with scalar children and projected scalar struct children.
- Does this need documentation: No
Add file-layer DeletePredicate execution for Parquet
row positions and wire IcebergTableReader v2 to convert Iceberg position
deletes and deletion vectors into file-local deleted row positions.
Equality delete files are detected and fail explicitly instead of being
silently ignored.
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Implement Iceberg equality delete filtering in the v2
Iceberg reader by materializing equality delete keys as delete predicate
expressions and applying them through the file reader filter path.

### Release note

Support reading Iceberg equality delete files in the BE Iceberg reader.

### Check List (For Author)

- Test: Unit Test / Manual test
- Added EqualityDeletePredicateTest for single-column, multi-column,
null matching, and error handling.
    - Manual test: git diff --check.
- Not run: run-be-ut.sh failed because this environment only has JDK 11
and requires JDK 17; clang-format script failed because llvm@16 is not
installed.
- Behavior changed: Yes, Iceberg reader now filters equality-deleted
rows.
- Does this need documentation: No

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

### Release note

None

### Check List (For Author)

- Test <!-- At least one of them must be included. -->
    - [ ] Regression test
    - [ ] Unit Test
    - [ ] Manual test (add detailed scripts or steps below)
    - [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [ ] Previous test can cover this change.
        - [ ] No code files have been changed.
        - [ ] Other reason <!-- Add your reason?  -->

- Behavior changed:
    - [ ] No.
    - [ ] Yes. <!-- Explain the behavior change -->

- Does this need documentation?
    - [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->

### Check List (For Reviewer who merge this PR)

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Update the new parquet reader implementation document with the current complex type support status, validation results, remaining gaps, and next implementation priorities.

### Release note

None

### Check List (For Author)

- Test: No need to test
    - Documentation-only change.
- Behavior changed: No
- Does this need documentation: No
suxiaogang223 and others added 7 commits May 29, 2026 10:01
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Add first-stage dictionary predicate pushdown for the new Parquet reader. It conservatively prunes fully dictionary encoded string-like row groups for EQ and IN predicates by evaluating owned dictionary values before reading data pages.

### Release note

None

### Check List (For Author)

- Test: Manual test

    - Ran build-support/clang-format.sh on modified BE files.

    - Ran git diff --check.

    - Local targeted BE UT could not run because the Mac toolchain fails CMake compiler detection with ld: library 'c++' not found.

- Behavior changed: No

- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Avoid relying on an unavailable Arrow Parquet ColumnReader::ReadDictionary API by reading the dictionary page directly and decoding PLAIN byte array dictionaries for row group pruning.

### Release note

None

### Check List (For Author)

- Test: Manual test

    - Ran build-support/clang-format.sh on parquet_statistics.cpp.

    - Ran git diff --check.

    - Fedora DEBUG BE build is rerun after this fix.

- Behavior changed: No

- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Construct Arrow list arrays in ParquetColumnReaderTest with explicit element field nullability so the generated arrays match the declared table schema.

### Release note

None

### Check List (For Author)

- Test: Manual test

    - Ran build-support/clang-format.sh on parquet_column_reader_test.cpp.

    - Ran git diff --check.

    - Fedora ParquetColumnReaderTest is rerun after this fix.

- Behavior changed: No

- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Support required nested scalar leaves that Arrow RecordReader reports without level buffers, and only consume materialized values when nested definition levels reach the leaf max definition level.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran git diff --check. Fedora BE unit test validation follows with ./run-be-ut.sh --run '--filter=ParquetColumnReaderTest.*'.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Avoid stale level buffers for required nested leaves and preserve nullable nested scalar value slot mapping expected by Arrow RecordReader.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran git diff --check. Fedora BE unit test validation follows with ./run-be-ut.sh --run '--filter=ParquetColumnReaderTest.*'.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Read scalar struct children as row-aligned slots so nullable parent struct rows keep child value buffers aligned.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran git diff --check. Fedora BE unit test validation follows with ./run-be-ut.sh --run '--filter=ParquetColumnReaderTest.*'.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Add a metadata-backed MIN/MAX aggregate pushdown path
for external Parquet readers and gate Iceberg v2 aggregate pushdown when
delete files are present.

### Release note

Support min/max aggregate pushdown for eligible external Parquet scans.

### Check List (For Author)

- Test: Unit Test / Manual test

- Added AggregateReaderTest and
ParquetReaderTest.minmax_pushdown_from_statistics.

    - Manual test: git diff --check and git diff --cached --check.

- Not run: run-be-ut.sh failed because this environment only has JDK 11
and requires JDK 17; clang-format script failed because llvm@16 is not
installed.

- Behavior changed: Yes, eligible Parquet scans can return min/max
aggregate rows from footer statistics; unsafe Iceberg delete-file scans
disable aggregate pushdown.

- Does this need documentation: No

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

### Release note

None

### Check List (For Author)

- Test <!-- At least one of them must be included. -->
    - [ ] Regression test
    - [ ] Unit Test
    - [ ] Manual test (add detailed scripts or steps below)
    - [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [ ] Previous test can cover this change.
        - [ ] No code files have been changed.
        - [ ] Other reason <!-- Add your reason?  -->

- Behavior changed:
    - [ ] No.
    - [ ] Yes. <!-- Explain the behavior change -->

- Does this need documentation?
    - [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->

### Check List (For Reviewer who merge this PR)

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@Gabriel39 Gabriel39 marked this pull request as draft May 29, 2026 06:39
Gabriel39 added a commit to Gabriel39/incubator-doris that referenced this pull request May 29, 2026
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: apache#63893

Problem Summary: Add focused BE unit coverage for new table reader and new parquet reader edge cases, including aggregate pushdown over split ranges, Iceberg equality/position deletes, row lineage after delete filtering, Parquet dictionary/statistics pruning, and IOContext release. Also clean up temporary delete predicate expression columns in the new Parquet reader so equality delete predicates with cast children do not alter the returned file block schema.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - Added BE UT cases in table_reader_test and parquet_reader_test.
    - Ran git diff --check.
    - Tried ./run-be-ut.sh with focused filters, but local JAVA_HOME points to JDK 11 and JDK_17 is not set; the runner requires JDK 17.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #63893

Problem Summary: Add focused BE unit coverage for new table reader and
new parquet reader edge cases, including aggregate pushdown over split
ranges, Iceberg equality/position deletes, row lineage after delete
filtering, Parquet dictionary/statistics pruning, and IOContext release.
Also clean up temporary delete predicate expression columns in the new
Parquet reader so equality delete predicates with cast children do not
alter the returned file block schema.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - Added BE UT cases in table_reader_test and parquet_reader_test.
    - Ran git diff --check.
- Tried ./run-be-ut.sh with focused filters, but local JAVA_HOME points
to JDK 11 and JDK_17 is not set; the runner requires JDK 17.
- Behavior changed: No
- Does this need documentation: No

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

### Release note

None

### Check List (For Author)

- Test <!-- At least one of them must be included. -->
    - [ ] Regression test
    - [ ] Unit Test
    - [ ] Manual test (add detailed scripts or steps below)
    - [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [ ] Previous test can cover this change.
        - [ ] No code files have been changed.
        - [ ] Other reason <!-- Add your reason?  -->

- Behavior changed:
    - [ ] No.
    - [ ] Yes. <!-- Explain the behavior change -->

- Does this need documentation?
    - [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->

### Check List (For Reviewer who merge this PR)

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants