Skip to content

[GH-105] Add ALP dataset for floating point columns#100

Open
prtkgaur wants to merge 5 commits into
apache:masterfrom
prtkgaur:alpFloatingPointDataset
Open

[GH-105] Add ALP dataset for floating point columns#100
prtkgaur wants to merge 5 commits into
apache:masterfrom
prtkgaur:alpFloatingPointDataset

Conversation

@prtkgaur
Copy link
Copy Markdown

@prtkgaur prtkgaur commented Dec 8, 2025

Add dataset needed for testing performance of floating point values.

@prtkgaur
Copy link
Copy Markdown
Author

Add dataset needed for testing performance of floating point values.
Files are in csv format, I did this for easier testing. Let me know if the you (the reviewer thinks otherwise).

@prtkgaur prtkgaur marked this pull request as ready for review January 13, 2026 18:02
@emkornfield
Copy link
Copy Markdown

Thanks for the PR:

  1. Lets finalize on the spec changes first (I think there is still some easy room for reducing size).
  2. Lets at least have two simple examples, where the content is well defined that covers the following cases:
    a. Null values
    b. 0 Exceptions for blocks
    c. 1 exception for a block
    d. >1 exception for a block.
    e. A block with all exceptions.
    f. differning exponent/value per block.

(are there other data variations that are important to cover?)

Generated from floatingpoint_spotify1.csv (9 double columns, 15000 rows)
and floatingpoint_arade.csv (4 double columns, 15000 rows) using the C++
generate_alp_parquet utility with ALP encoding, no dictionary, uncompressed.

These files are used for cross-language ALP correctness testing: read the
parquet file and verify all values match the expected CSV bit-exactly.
Add C++ and Java-generated float32 ALP parquet files and expected CSVs
for cross-language interop testing of ALP encoding with FLOAT type.
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 25, 2026

Checking it out...

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 30, 2026

Sorry I didn't post this earlier. My biggest concern with this PR is that it adds many bytes of example data. Since this repo is checked out many times by many CI jobs (e.g. arrow, arrow-rs, etc) I think it is important to keep the size down

Furthermore, I am not sure that this covers the major parameter (I need to review the files carefully first)

I will file a ticket that records what I think is needed, and link this PR

@alamb alamb changed the title [GH-539] Add dataset for floating point columns [GH-105] Add dataset for floating point columns Apr 30, 2026
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 30, 2026

I also filed a ticket to track this issue and explain what I think is needed in the example files

@alamb alamb changed the title [GH-105] Add dataset for floating point columns [GH-105] Add ALP dataset for floating point columns Apr 30, 2026
@vinooganesh
Copy link
Copy Markdown

Hey all, quick heads up, I just opened a stacked PR (prtkgaur#1) with six Java-written ALP fixtures from apache/parquet-java#3397 to broaden the cross-language test surface here.

Four of them are dataset-derived (~960 KB combined) and cover both page versions (V1/V2) × both ALP vector sizes (1024/4096) for float and double, using the existing arade and spotify1 source data. The other two are a small corner-case fixture (~60 KB) targeting the design list in #105 — vectors with no exceptions, one exception, all-NaN, NaN+Inf+(-0.0), constant (bit_width=0), differing exponents per vector, and optional columns with nulls, both f32 and f64 — plus a sidecar _expect.csv ground truth emitted directly from the construction recipe, so the corner-case file is independently verifiable without rerunning the Java generator.

I've already run Arrow C++ ALP from apache/arrow#48345 against all six and bit-compared every value against the _expect.csv truth — 342K values, 0 mismatches. The full local matrix (17 fixtures across all V1/V2 × 1024/4096 combinations) comes out to 1.59M values verified, also 0 mismatches.

On size: @alamb raised concerns in #105 about test data bloating CI checkouts, so I deliberately kept this PR small — ~1.1 MB total. I have 11 more fixtures locally from the full matrix (~2.5 MB more) if anyone wants additional coverage, but figured the four representative variants plus the corner cases were enough to verify the axes work without growing the repo unnecessarily.

Happy to add or trim, let me know what would be most useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add example ALP files

5 participants