[refactor](table) Refactor table and file reader#63893
Draft
Gabriel39 wants to merge 38 commits into
Draft
Conversation
Co-authored-by: Socrates <suyiteng@selectdb.com>
Problem Summary: NewParquetReaderTest only populated FileLocalFilter::predicates for local predicate filtering. Parquet row group pruning still uses predicates, while row filtering now uses conjunct, so the tests need to populate both with matching semantics.
Issue Number: None
Related PR: None
Problem Summary: Add file-local nested schema metadata and projection plumbing for the new Arrow Parquet reader. Struct child projection is now pushed into the Parquet column reader factory, table scan schema is rebuilt from projected complex types, and the mapper preserves path metadata for future complex schema change handling while explicitly rejecting unsupported child schema evolution for now.
None
- Test: Unit Test
- Added BE unit coverage for struct projection, nested schema path metadata, and table mapper complex projection generation.
- Ran clang-format 16 dry-run on modified C++ files.
- Ran git diff --check.
- Attempted ./run-be-ut.sh --run '--filter=ParquetColumnReaderTest.*:TableColumnMapperTest.*:NewParquetReaderTest.*:FileReaderTest.*', but local CMake compiler sanity check failed before Doris code compilation because ld could not find library 'c++'.
- Behavior changed: No
- Does this need documentation: Yes (included docs/doris-arrow-parquet-complex-types-implementation.md)
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Fix the non-BE_TEST build failure caused by calling the test-only set_node_type helper from the VSlotRef protected constructor.
### Release note
None
### Check List (For Author)
- Test: Manual test
- Ran clang-format dry-run and git diff --check for the modified header. Fedora DEBUG BE build was run and exposed the fixed compile failure; full build will be rerun after syncing this commit.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Fix a DEBUG build failure in the new parquet reader by asserting the read batch size before converting it to the selection vector row count type.
### Release note
None
### Check List (For Author)
- Test: Manual test
- Ran clang-format dry-run and git diff --check for the modified parquet reader file. Fedora DEBUG BE build was run and exposed the fixed compile failure; full build will be rerun after syncing this commit.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Add the next step of complex type support in the new parquet reader by normalizing standard LIST schema to Array(element), allowing nested leaf RecordReader usage, and reading non-empty LIST<required primitive> columns from repetition levels.
### Release note
None
### Check List (For Author)
- Test: Manual test
- Ran clang-format dry-run and git diff --check for modified files.
- Ran BUILD_TYPE=DEBUG ./build.sh --be on Fedora successfully with the patch applied.
- Attempted ParquetColumnReaderTest on Fedora, but stopped the ASAN_UT build because it triggered a fresh full UT build; no test binary execution result was produced.
- Behavior changed: Yes. The new parquet reader can now read a limited non-empty LIST<required primitive> shape and reports NotSupported for unsupported list shapes instead of rejecting all LIST columns.
- Does this need documentation: No
Problem Summary: TableReader could map partition columns to physical file columns before checking split partition values, and constant/default expression materialization used the file-local block row count. For scans where partition values should be filled from split metadata, especially when the file-local block row count differs from the batch row count, this could produce incorrect materialized columns.
1. ParquetReader reads a range of a parquet file 2. ParquetReader supports virtual column reader (RowPosition) 3. IcebergReader supports virtual columns
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Add initial new Parquet MAP reader support for required scalar key/value entries and normalize MAP key_value schema metadata for future complex projection work.
None
- Test: Unit Test / Manual test
- Added parquet column reader unit test coverage for required Map(Int32, String).
- Ran git diff --check.
- BE build will be verified on Fedora after push.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Add shared LIST/MAP level assembly for scalar nested Parquet children, including null parent rows, empty collections, nullable element/value slots, overflow buffering, and parent-row skip/select semantics.
### Release note
None
### Check List (For Author)
- Test: Manual test
- Ran git diff --check. BE DEBUG build will be run on Fedora after push.
- Behavior changed: Yes. New Parquet reader can read LIST/MAP scalar children with null/empty/nullable-child cases.
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Fix warning-as-error failures in the new parquet map reader caused by shadowed local names and aggregate initialization after adding the nested level assembler.
### Release note
None
### Check List (For Author)
- Test: Manual test
- Ran git diff --check locally. Fedora DEBUG BE build will be run after pushing this commit.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Support parquet STRUCT reading with scalar children through definition-level assembly, including nullable parent struct handling and projected struct child reads.
### Release note
None
### Check List (For Author)
- Test: Manual test
- Ran git diff --check locally.
- Ran BUILD_TYPE=DEBUG ./build.sh --be on Fedora.
- Behavior changed: Yes
- New parquet reader now supports nullable STRUCT columns with scalar children and projected scalar struct children.
- Does this need documentation: No
Add file-layer DeletePredicate execution for Parquet row positions and wire IcebergTableReader v2 to convert Iceberg position deletes and deletion vectors into file-local deleted row positions. Equality delete files are detected and fail explicitly instead of being silently ignored.
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Implement Iceberg equality delete filtering in the v2
Iceberg reader by materializing equality delete keys as delete predicate
expressions and applying them through the file reader filter path.
### Release note
Support reading Iceberg equality delete files in the BE Iceberg reader.
### Check List (For Author)
- Test: Unit Test / Manual test
- Added EqualityDeletePredicateTest for single-column, multi-column,
null matching, and error handling.
- Manual test: git diff --check.
- Not run: run-be-ut.sh failed because this environment only has JDK 11
and requires JDK 17; clang-format script failed because llvm@16 is not
installed.
- Behavior changed: Yes, Iceberg reader now filters equality-deleted
rows.
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
### Release note
None
### Check List (For Author)
- Test <!-- At least one of them must be included. -->
- [ ] Regression test
- [ ] Unit Test
- [ ] Manual test (add detailed scripts or steps below)
- [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
- [ ] Previous test can cover this change.
- [ ] No code files have been changed.
- [ ] Other reason <!-- Add your reason? -->
- Behavior changed:
- [ ] No.
- [ ] Yes. <!-- Explain the behavior change -->
- Does this need documentation?
- [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->
### Check List (For Reviewer who merge this PR)
- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Update the new parquet reader implementation document with the current complex type support status, validation results, remaining gaps, and next implementation priorities.
### Release note
None
### Check List (For Author)
- Test: No need to test
- Documentation-only change.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Add first-stage dictionary predicate pushdown for the new Parquet reader. It conservatively prunes fully dictionary encoded string-like row groups for EQ and IN predicates by evaluating owned dictionary values before reading data pages.
### Release note
None
### Check List (For Author)
- Test: Manual test
- Ran build-support/clang-format.sh on modified BE files.
- Ran git diff --check.
- Local targeted BE UT could not run because the Mac toolchain fails CMake compiler detection with ld: library 'c++' not found.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Avoid relying on an unavailable Arrow Parquet ColumnReader::ReadDictionary API by reading the dictionary page directly and decoding PLAIN byte array dictionaries for row group pruning.
### Release note
None
### Check List (For Author)
- Test: Manual test
- Ran build-support/clang-format.sh on parquet_statistics.cpp.
- Ran git diff --check.
- Fedora DEBUG BE build is rerun after this fix.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Construct Arrow list arrays in ParquetColumnReaderTest with explicit element field nullability so the generated arrays match the declared table schema.
### Release note
None
### Check List (For Author)
- Test: Manual test
- Ran build-support/clang-format.sh on parquet_column_reader_test.cpp.
- Ran git diff --check.
- Fedora ParquetColumnReaderTest is rerun after this fix.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Support required nested scalar leaves that Arrow RecordReader reports without level buffers, and only consume materialized values when nested definition levels reach the leaf max definition level.
### Release note
None
### Check List (For Author)
- Test: Manual test
- Ran git diff --check. Fedora BE unit test validation follows with ./run-be-ut.sh --run '--filter=ParquetColumnReaderTest.*'.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Avoid stale level buffers for required nested leaves and preserve nullable nested scalar value slot mapping expected by Arrow RecordReader.
### Release note
None
### Check List (For Author)
- Test: Manual test
- Ran git diff --check. Fedora BE unit test validation follows with ./run-be-ut.sh --run '--filter=ParquetColumnReaderTest.*'.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Read scalar struct children as row-aligned slots so nullable parent struct rows keep child value buffers aligned.
### Release note
None
### Check List (For Author)
- Test: Manual test
- Ran git diff --check. Fedora BE unit test validation follows with ./run-be-ut.sh --run '--filter=ParquetColumnReaderTest.*'.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Add a metadata-backed MIN/MAX aggregate pushdown path
for external Parquet readers and gate Iceberg v2 aggregate pushdown when
delete files are present.
### Release note
Support min/max aggregate pushdown for eligible external Parquet scans.
### Check List (For Author)
- Test: Unit Test / Manual test
- Added AggregateReaderTest and
ParquetReaderTest.minmax_pushdown_from_statistics.
- Manual test: git diff --check and git diff --cached --check.
- Not run: run-be-ut.sh failed because this environment only has JDK 11
and requires JDK 17; clang-format script failed because llvm@16 is not
installed.
- Behavior changed: Yes, eligible Parquet scans can return min/max
aggregate rows from footer statistics; unsafe Iceberg delete-file scans
disable aggregate pushdown.
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
### Release note
None
### Check List (For Author)
- Test <!-- At least one of them must be included. -->
- [ ] Regression test
- [ ] Unit Test
- [ ] Manual test (add detailed scripts or steps below)
- [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
- [ ] Previous test can cover this change.
- [ ] No code files have been changed.
- [ ] Other reason <!-- Add your reason? -->
- Behavior changed:
- [ ] No.
- [ ] Yes. <!-- Explain the behavior change -->
- Does this need documentation?
- [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->
### Check List (For Reviewer who merge this PR)
- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
Gabriel39
added a commit
to Gabriel39/incubator-doris
that referenced
this pull request
May 29, 2026
### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#63893 Problem Summary: Add focused BE unit coverage for new table reader and new parquet reader edge cases, including aggregate pushdown over split ranges, Iceberg equality/position deletes, row lineage after delete filtering, Parquet dictionary/statistics pruning, and IOContext release. Also clean up temporary delete predicate expression columns in the new Parquet reader so equality delete predicates with cast children do not alter the returned file block schema. ### Release note None ### Check List (For Author) - Test: Unit Test - Added BE UT cases in table_reader_test and parquet_reader_test. - Ran git diff --check. - Tried ./run-be-ut.sh with focused filters, but local JAVA_HOME points to JDK 11 and JDK_17 is not set; the runner requires JDK 17. - Behavior changed: No - Does this need documentation: No
16 tasks
### What problem does this PR solve? Issue Number: close #xxx Related PR: #63893 Problem Summary: Add focused BE unit coverage for new table reader and new parquet reader edge cases, including aggregate pushdown over split ranges, Iceberg equality/position deletes, row lineage after delete filtering, Parquet dictionary/statistics pruning, and IOContext release. Also clean up temporary delete predicate expression columns in the new Parquet reader so equality delete predicates with cast children do not alter the returned file block schema. ### Release note None ### Check List (For Author) - Test: Unit Test - Added BE UT cases in table_reader_test and parquet_reader_test. - Ran git diff --check. - Tried ./run-be-ut.sh with focused filters, but local JAVA_HOME points to JDK 11 and JDK_17 is not set; the runner requires JDK 17. - Behavior changed: No - Does this need documentation: No ### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: ### Release note None ### Check List (For Author) - Test <!-- At least one of them must be included. --> - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason <!-- Add your reason? --> - Behavior changed: - [ ] No. - [ ] Yes. <!-- Explain the behavior change --> - Does this need documentation? - [ ] No. - [ ] Yes. <!-- Add document PR link here. eg: apache/doris-website#1214 --> ### Check List (For Reviewer who merge this PR) - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label <!-- Add branch pick label that this PR should merge into -->
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)