[refactor](table) Refactor table and file reader by Gabriel39 · Pull Request #63893 · apache/doris

Gabriel39 · 2026-05-29T06:39:29Z

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

Test
- Regression test
- Unit Test
- Manual test (add detailed scripts or steps below)
- No need to test or manual test. Explain why:
  - This is a refactor/code format and no logic has been changed.
  - Previous test can cover this change.
  - No code files have been changed.
  - Other reason
Behavior changed:
- No.
- Yes.
Does this need documentation?
- No.
- Yes.

Check List (For Reviewer who merge this PR)

Confirm the release note
Confirm test cases
Confirm document
Add branch pick label

Co-authored-by: Socrates <suyiteng@selectdb.com>

Problem Summary: NewParquetReaderTest only populated FileLocalFilter::predicates for local predicate filtering. Parquet row group pruning still uses predicates, while row filtering now uses conjunct, so the tests need to populate both with matching semantics.

Issue Number: None Related PR: None Problem Summary: Add file-local nested schema metadata and projection plumbing for the new Arrow Parquet reader. Struct child projection is now pushed into the Parquet column reader factory, table scan schema is rebuilt from projected complex types, and the mapper preserves path metadata for future complex schema change handling while explicitly rejecting unsupported child schema evolution for now. None - Test: Unit Test - Added BE unit coverage for struct projection, nested schema path metadata, and table mapper complex projection generation. - Ran clang-format 16 dry-run on modified C++ files. - Ran git diff --check. - Attempted ./run-be-ut.sh --run '--filter=ParquetColumnReaderTest.*:TableColumnMapperTest.*:NewParquetReaderTest.*:FileReaderTest.*', but local CMake compiler sanity check failed before Doris code compilation because ld could not find library 'c++'. - Behavior changed: No - Does this need documentation: Yes (included docs/doris-arrow-parquet-complex-types-implementation.md)

### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: Fix the non-BE_TEST build failure caused by calling the test-only set_node_type helper from the VSlotRef protected constructor. ### Release note None ### Check List (For Author) - Test: Manual test - Ran clang-format dry-run and git diff --check for the modified header. Fedora DEBUG BE build was run and exposed the fixed compile failure; full build will be rerun after syncing this commit. - Behavior changed: No - Does this need documentation: No

### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: Fix a DEBUG build failure in the new parquet reader by asserting the read batch size before converting it to the selection vector row count type. ### Release note None ### Check List (For Author) - Test: Manual test - Ran clang-format dry-run and git diff --check for the modified parquet reader file. Fedora DEBUG BE build was run and exposed the fixed compile failure; full build will be rerun after syncing this commit. - Behavior changed: No - Does this need documentation: No

### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: Add the next step of complex type support in the new parquet reader by normalizing standard LIST schema to Array(element), allowing nested leaf RecordReader usage, and reading non-empty LIST<required primitive> columns from repetition levels. ### Release note None ### Check List (For Author) - Test: Manual test - Ran clang-format dry-run and git diff --check for modified files. - Ran BUILD_TYPE=DEBUG ./build.sh --be on Fedora successfully with the patch applied. - Attempted ParquetColumnReaderTest on Fedora, but stopped the ASAN_UT build because it triggered a fresh full UT build; no test binary execution result was produced. - Behavior changed: Yes. The new parquet reader can now read a limited non-empty LIST<required primitive> shape and reports NotSupported for unsupported list shapes instead of rejecting all LIST columns. - Does this need documentation: No

Problem Summary: TableReader could map partition columns to physical file columns before checking split partition values, and constant/default expression materialization used the file-local block row count. For scans where partition values should be filled from split metadata, especially when the file-local block row count differs from the batch row count, this could produce incorrect materialized columns.

1. ParquetReader reads a range of a parquet file 2. ParquetReader supports virtual column reader (RowPosition) 3. IcebergReader supports virtual columns

Issue Number: close #xxx Related PR: #xxx Problem Summary: Add initial new Parquet MAP reader support for required scalar key/value entries and normalize MAP key_value schema metadata for future complex projection work. None - Test: Unit Test / Manual test - Added parquet column reader unit test coverage for required Map(Int32, String). - Ran git diff --check. - BE build will be verified on Fedora after push. - Behavior changed: No - Does this need documentation: No

### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: Add shared LIST/MAP level assembly for scalar nested Parquet children, including null parent rows, empty collections, nullable element/value slots, overflow buffering, and parent-row skip/select semantics. ### Release note None ### Check List (For Author) - Test: Manual test - Ran git diff --check. BE DEBUG build will be run on Fedora after push. - Behavior changed: Yes. New Parquet reader can read LIST/MAP scalar children with null/empty/nullable-child cases. - Does this need documentation: No

### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: Fix warning-as-error failures in the new parquet map reader caused by shadowed local names and aggregate initialization after adding the nested level assembler. ### Release note None ### Check List (For Author) - Test: Manual test - Ran git diff --check locally. Fedora DEBUG BE build will be run after pushing this commit. - Behavior changed: No - Does this need documentation: No

### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: Support parquet STRUCT reading with scalar children through definition-level assembly, including nullable parent struct handling and projected struct child reads. ### Release note None ### Check List (For Author) - Test: Manual test - Ran git diff --check locally. - Ran BUILD_TYPE=DEBUG ./build.sh --be on Fedora. - Behavior changed: Yes - New parquet reader now supports nullable STRUCT columns with scalar children and projected scalar struct children. - Does this need documentation: No

Add file-layer DeletePredicate execution for Parquet row positions and wire IcebergTableReader v2 to convert Iceberg position deletes and deletion vectors into file-local deleted row positions. Equality delete files are detected and fail explicitly instead of being silently ignored.

### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: Implement Iceberg equality delete filtering in the v2 Iceberg reader by materializing equality delete keys as delete predicate expressions and applying them through the file reader filter path. ### Release note Support reading Iceberg equality delete files in the BE Iceberg reader. ### Check List (For Author) - Test: Unit Test / Manual test - Added EqualityDeletePredicateTest for single-column, multi-column, null matching, and error handling. - Manual test: git diff --check. - Not run: run-be-ut.sh failed because this environment only has JDK 11 and requires JDK 17; clang-format script failed because llvm@16 is not installed. - Behavior changed: Yes, Iceberg reader now filters equality-deleted rows. - Does this need documentation: No ### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: ### Release note None ### Check List (For Author) - Test  - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason  - Behavior changed: - [ ] No. - [ ] Yes.  - Does this need documentation? - [ ] No. - [ ] Yes.  ### Check List (For Reviewer who merge this PR) - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label

### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: Update the new parquet reader implementation document with the current complex type support status, validation results, remaining gaps, and next implementation priorities. ### Release note None ### Check List (For Author) - Test: No need to test - Documentation-only change. - Behavior changed: No - Does this need documentation: No

### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: Add first-stage dictionary predicate pushdown for the new Parquet reader. It conservatively prunes fully dictionary encoded string-like row groups for EQ and IN predicates by evaluating owned dictionary values before reading data pages. ### Release note None ### Check List (For Author) - Test: Manual test - Ran build-support/clang-format.sh on modified BE files. - Ran git diff --check. - Local targeted BE UT could not run because the Mac toolchain fails CMake compiler detection with ld: library 'c++' not found. - Behavior changed: No - Does this need documentation: No

### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: Avoid relying on an unavailable Arrow Parquet ColumnReader::ReadDictionary API by reading the dictionary page directly and decoding PLAIN byte array dictionaries for row group pruning. ### Release note None ### Check List (For Author) - Test: Manual test - Ran build-support/clang-format.sh on parquet_statistics.cpp. - Ran git diff --check. - Fedora DEBUG BE build is rerun after this fix. - Behavior changed: No - Does this need documentation: No

### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: Construct Arrow list arrays in ParquetColumnReaderTest with explicit element field nullability so the generated arrays match the declared table schema. ### Release note None ### Check List (For Author) - Test: Manual test - Ran build-support/clang-format.sh on parquet_column_reader_test.cpp. - Ran git diff --check. - Fedora ParquetColumnReaderTest is rerun after this fix. - Behavior changed: No - Does this need documentation: No

### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: Support required nested scalar leaves that Arrow RecordReader reports without level buffers, and only consume materialized values when nested definition levels reach the leaf max definition level. ### Release note None ### Check List (For Author) - Test: Manual test - Ran git diff --check. Fedora BE unit test validation follows with ./run-be-ut.sh --run '--filter=ParquetColumnReaderTest.*'. - Behavior changed: No - Does this need documentation: No

### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: Avoid stale level buffers for required nested leaves and preserve nullable nested scalar value slot mapping expected by Arrow RecordReader. ### Release note None ### Check List (For Author) - Test: Manual test - Ran git diff --check. Fedora BE unit test validation follows with ./run-be-ut.sh --run '--filter=ParquetColumnReaderTest.*'. - Behavior changed: No - Does this need documentation: No

### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: Read scalar struct children as row-aligned slots so nullable parent struct rows keep child value buffers aligned. ### Release note None ### Check List (For Author) - Test: Manual test - Ran git diff --check. Fedora BE unit test validation follows with ./run-be-ut.sh --run '--filter=ParquetColumnReaderTest.*'. - Behavior changed: No - Does this need documentation: No

### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: Add a metadata-backed MIN/MAX aggregate pushdown path for external Parquet readers and gate Iceberg v2 aggregate pushdown when delete files are present. ### Release note Support min/max aggregate pushdown for eligible external Parquet scans. ### Check List (For Author) - Test: Unit Test / Manual test - Added AggregateReaderTest and ParquetReaderTest.minmax_pushdown_from_statistics. - Manual test: git diff --check and git diff --cached --check. - Not run: run-be-ut.sh failed because this environment only has JDK 11 and requires JDK 17; clang-format script failed because llvm@16 is not installed. - Behavior changed: Yes, eligible Parquet scans can return min/max aggregate rows from footer statistics; unsafe Iceberg delete-file scans disable aggregate pushdown. - Does this need documentation: No ### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: ### Release note None ### Check List (For Author) - Test  - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason  - Behavior changed: - [ ] No. - [ ] Yes.  - Does this need documentation? - [ ] No. - [ ] Yes.  ### Check List (For Reviewer who merge this PR) - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label

hello-stephen · 2026-05-29T06:39:34Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#63893 Problem Summary: Add focused BE unit coverage for new table reader and new parquet reader edge cases, including aggregate pushdown over split ranges, Iceberg equality/position deletes, row lineage after delete filtering, Parquet dictionary/statistics pruning, and IOContext release. Also clean up temporary delete predicate expression columns in the new Parquet reader so equality delete predicates with cast children do not alter the returned file block schema. ### Release note None ### Check List (For Author) - Test: Unit Test - Added BE UT cases in table_reader_test and parquet_reader_test. - Ran git diff --check. - Tried ./run-be-ut.sh with focused filters, but local JAVA_HOME points to JDK 11 and JDK_17 is not set; the runner requires JDK 17. - Behavior changed: No - Does this need documentation: No

### What problem does this PR solve? Issue Number: close #xxx Related PR: #63893 Problem Summary: Add focused BE unit coverage for new table reader and new parquet reader edge cases, including aggregate pushdown over split ranges, Iceberg equality/position deletes, row lineage after delete filtering, Parquet dictionary/statistics pruning, and IOContext release. Also clean up temporary delete predicate expression columns in the new Parquet reader so equality delete predicates with cast children do not alter the returned file block schema. ### Release note None ### Check List (For Author) - Test: Unit Test - Added BE UT cases in table_reader_test and parquet_reader_test. - Ran git diff --check. - Tried ./run-be-ut.sh with focused filters, but local JAVA_HOME points to JDK 11 and JDK_17 is not set; the runner requires JDK 17. - Behavior changed: No - Does this need documentation: No ### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: ### Release note None ### Check List (For Author) - Test  - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason  - Behavior changed: - [ ] No. - [ ] Yes.  - Does this need documentation? - [ ] No. - [ ] Yes.  ### Check List (For Reviewer who merge this PR) - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label

suxiaogang223 and others added 30 commits May 18, 2026 15:52

Add Iceberg Parquet reader API skeleton

ef45c20

Refine Iceberg reader API boundaries

57178c8

fix compiling (#63368)

1676f2e

refactor table reader (#63397)

e5d17b8

Add unit tests for expr (#63415)

783e740

Framework to do delete filtering (#63442)

dae05ba

[test](be) Add DeletePredicate unit tests (#63455)

2539b0a

cast for schema change (#63477)

0fb11e4

Complete basic parquet reader (#63659)

3dbfc4c

Co-authored-by: Socrates <suyiteng@selectdb.com>

[improvement](be) Reuse table reader file block (#63704)

5dc54d8

Refactor 0527 (#63712)

4fe7254

[parquet] Clarify reader lifecycle comments

aabcbc3

[parquet] Update new reader design docs

e1ed7f2

[feature](be) Build table filters from conjuncts (#63733)

f19f78a

[feature](be) Support expression filters on file reader (#63748)

4073912

[fix](be) Cast localized filter slots for file schema types (#63754)

5ad0921

Materialize Iceberg row lineage virtual columns (#63787)

321134d

1. ParquetReader reads a range of a parquet file 2. ParquetReader supports virtual column reader (RowPosition) 3. IcebergReader supports virtual columns

suxiaogang223 and others added 7 commits May 29, 2026 10:01

Gabriel39 requested review from gavinchou, liaoxin01 and morningman as code owners May 29, 2026 06:39

Gabriel39 marked this pull request as draft May 29, 2026 06:39

Gabriel39 mentioned this pull request May 29, 2026

[test](be) Add table reader edge case unit tests #63895

Merged

16 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[refactor](table) Refactor table and file reader#63893

[refactor](table) Refactor table and file reader#63893
Gabriel39 wants to merge 38 commits into
masterfrom
refact_reader_branch

Gabriel39 commented May 29, 2026

Uh oh!

hello-stephen commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Gabriel39 commented May 29, 2026

What problem does this PR solve?

Release note

Check List (For Author)

Check List (For Reviewer who merge this PR)

Uh oh!

hello-stephen commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants