[GH-105] Add ALP dataset for floating point columns#100
Conversation
|
|
Thanks for the PR:
(are there other data variations that are important to cover?) |
Generated from floatingpoint_spotify1.csv (9 double columns, 15000 rows) and floatingpoint_arade.csv (4 double columns, 15000 rows) using the C++ generate_alp_parquet utility with ALP encoding, no dictionary, uncompressed. These files are used for cross-language ALP correctness testing: read the parquet file and verify all values match the expected CSV bit-exactly.
Add C++ and Java-generated float32 ALP parquet files and expected CSVs for cross-language interop testing of ALP encoding with FLOAT type.
|
Checking it out... |
|
Sorry I didn't post this earlier. My biggest concern with this PR is that it adds many bytes of example data. Since this repo is checked out many times by many CI jobs (e.g. arrow, arrow-rs, etc) I think it is important to keep the size down Furthermore, I am not sure that this covers the major parameter (I need to review the files carefully first) I will file a ticket that records what I think is needed, and link this PR |
|
I also filed a ticket to track this issue and explain what I think is needed in the example files |
|
Hey all, quick heads up, I just opened a stacked PR (prtkgaur#1) with six Java-written ALP fixtures from apache/parquet-java#3397 to broaden the cross-language test surface here. Four of them are dataset-derived (~960 KB combined) and cover both page versions (V1/V2) × both ALP vector sizes (1024/4096) for float and double, using the existing arade and spotify1 source data. The other two are a small corner-case fixture (~60 KB) targeting the design list in #105 — vectors with no exceptions, one exception, all-NaN, NaN+Inf+(-0.0), constant (bit_width=0), differing exponents per vector, and optional columns with nulls, both f32 and f64 — plus a sidecar _expect.csv ground truth emitted directly from the construction recipe, so the corner-case file is independently verifiable without rerunning the Java generator. I've already run Arrow C++ ALP from apache/arrow#48345 against all six and bit-compared every value against the _expect.csv truth — 342K values, 0 mismatches. The full local matrix (17 fixtures across all V1/V2 × 1024/4096 combinations) comes out to 1.59M values verified, also 0 mismatches. On size: @alamb raised concerns in #105 about test data bloating CI checkouts, so I deliberately kept this PR small — ~1.1 MB total. I have 11 more fixtures locally from the full matrix (~2.5 MB more) if anyone wants additional coverage, but figured the four representative variants plus the corner cases were enough to verify the axes work without growing the repo unnecessarily. Happy to add or trim, let me know what would be most useful. |
Add dataset needed for testing performance of floating point values.
related to [Proposal] Add ALP encoding support in parquet file format parquet-format#533
Closes Add example ALP files #105