Commit bc74c71
feat(parquet): add content defined chunking for arrow writer (#9450)
# Which issue does this PR close?
- Closes #NNN.
# Rationale for this change
Rust implementation of apache/arrow#45360
Traditional Parquet writing splits data pages at fixed sizes, so a
single inserted or deleted row causes all subsequent pages to shift —
resulting in nearly every byte being re-uploaded to content-addressable
storage (CAS) systems. CDC determines page boundaries via a rolling
gearhash over column values, so unchanged data produces identical pages
across different writes enabling storage cost reductions and faster
upload times.
See more details in https://huggingface.co/blog/parquet-cdc
The original C++ implementation
apache/arrow#45360
Evaluation tool https://github.com/huggingface/dataset-dedupe-estimator
where I already integrated this PR to verify that deduplication
effectiveness is on par with parquet-cpp (lower is better):
<img width="984" height="411" alt="image"
src="https://github.com/user-attachments/assets/e6e80931-ac76-4bdd-bf9c-ba7e06559411"
/>
# What changes are included in this PR?
- **Content-defined chunker** at `parquet/src/column/chunker/`
- **Arrow writer integration** integrated in `ArrowColumnWriter`
- **Writer properties** via `CdcOptions` struct (`min_chunk_size`,
`max_chunk_size`, `norm_level`)
- **ColumnDescriptor**: added `repeated_ancestor_def_level` field to for
nested field values iteration
# Are these changes tested?
Yes — unit tests are located in `cdc.rs` and ported from the C++
implementation.
# Are there any user-facing changes?
New **experimental** API, disabled by default — no behavior change for
existing code:
```rust
// Simple toggle (256 KiB min, 1 MiB max, norm_level 0)
let props = WriterProperties::builder()
.set_content_defined_chunking(true)
.build();
// Excpliti CDC parameters
let props = WriterProperties::builder()
.set_cdc_options(CdcOptions { min_chunk_size: 128 * 1024, max_chunk_size: 512 * 1024, norm_level: 1 })
.build();
```
---------
Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>1 parent 39dda22 commit bc74c71
12 files changed
Lines changed: 3447 additions & 39 deletions
File tree
- parquet
- benches
- src
- arrow/arrow_writer
- column
- chunker
- writer
- file
- schema
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
19 | 19 | | |
20 | 20 | | |
21 | 21 | | |
22 | | - | |
| 22 | + | |
23 | 23 | | |
24 | 24 | | |
25 | 25 | | |
| |||
33 | 33 | | |
34 | 34 | | |
35 | 35 | | |
36 | | - | |
37 | 36 | | |
38 | | - | |
39 | | - | |
| 37 | + | |
40 | 38 | | |
41 | 39 | | |
42 | 40 | | |
| |||
342 | 340 | | |
343 | 341 | | |
344 | 342 | | |
345 | | - | |
346 | | - | |
347 | | - | |
348 | | - | |
349 | | - | |
350 | | - | |
351 | | - | |
| 343 | + | |
352 | 344 | | |
353 | 345 | | |
354 | | - | |
355 | | - | |
356 | | - | |
357 | | - | |
358 | | - | |
359 | | - | |
360 | | - | |
361 | | - | |
362 | | - | |
363 | | - | |
364 | | - | |
365 | | - | |
366 | | - | |
367 | | - | |
368 | | - | |
369 | | - | |
370 | | - | |
| 346 | + | |
| 347 | + | |
| 348 | + | |
| 349 | + | |
| 350 | + | |
371 | 351 | | |
372 | 352 | | |
373 | 353 | | |
374 | 354 | | |
375 | 355 | | |
376 | 356 | | |
377 | | - | |
| 357 | + | |
378 | 358 | | |
379 | 359 | | |
380 | 360 | | |
| |||
440 | 420 | | |
441 | 421 | | |
442 | 422 | | |
| 423 | + | |
| 424 | + | |
| 425 | + | |
| 426 | + | |
| 427 | + | |
443 | 428 | | |
444 | 429 | | |
445 | 430 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
40 | 40 | | |
41 | 41 | | |
42 | 42 | | |
| 43 | + | |
43 | 44 | | |
44 | 45 | | |
45 | 46 | | |
| |||
801 | 802 | | |
802 | 803 | | |
803 | 804 | | |
| 805 | + | |
| 806 | + | |
| 807 | + | |
| 808 | + | |
| 809 | + | |
| 810 | + | |
| 811 | + | |
| 812 | + | |
| 813 | + | |
| 814 | + | |
| 815 | + | |
| 816 | + | |
| 817 | + | |
| 818 | + | |
| 819 | + | |
| 820 | + | |
| 821 | + | |
| 822 | + | |
| 823 | + | |
| 824 | + | |
| 825 | + | |
| 826 | + | |
| 827 | + | |
| 828 | + | |
| 829 | + | |
| 830 | + | |
| 831 | + | |
| 832 | + | |
| 833 | + | |
| 834 | + | |
| 835 | + | |
| 836 | + | |
| 837 | + | |
| 838 | + | |
| 839 | + | |
| 840 | + | |
| 841 | + | |
| 842 | + | |
| 843 | + | |
| 844 | + | |
| 845 | + | |
| 846 | + | |
| 847 | + | |
| 848 | + | |
| 849 | + | |
| 850 | + | |
804 | 851 | | |
805 | 852 | | |
806 | 853 | | |
807 | 854 | | |
808 | 855 | | |
| 856 | + | |
809 | 857 | | |
810 | 858 | | |
811 | 859 | | |
| |||
2096 | 2144 | | |
2097 | 2145 | | |
2098 | 2146 | | |
| 2147 | + | |
| 2148 | + | |
| 2149 | + | |
| 2150 | + | |
| 2151 | + | |
| 2152 | + | |
| 2153 | + | |
| 2154 | + | |
| 2155 | + | |
| 2156 | + | |
| 2157 | + | |
| 2158 | + | |
| 2159 | + | |
| 2160 | + | |
| 2161 | + | |
| 2162 | + | |
| 2163 | + | |
| 2164 | + | |
| 2165 | + | |
| 2166 | + | |
| 2167 | + | |
| 2168 | + | |
| 2169 | + | |
| 2170 | + | |
| 2171 | + | |
| 2172 | + | |
| 2173 | + | |
| 2174 | + | |
| 2175 | + | |
| 2176 | + | |
| 2177 | + | |
| 2178 | + | |
| 2179 | + | |
| 2180 | + | |
| 2181 | + | |
| 2182 | + | |
| 2183 | + | |
| 2184 | + | |
| 2185 | + | |
| 2186 | + | |
| 2187 | + | |
| 2188 | + | |
| 2189 | + | |
| 2190 | + | |
| 2191 | + | |
| 2192 | + | |
| 2193 | + | |
| 2194 | + | |
| 2195 | + | |
| 2196 | + | |
| 2197 | + | |
| 2198 | + | |
| 2199 | + | |
| 2200 | + | |
| 2201 | + | |
| 2202 | + | |
| 2203 | + | |
| 2204 | + | |
| 2205 | + | |
| 2206 | + | |
| 2207 | + | |
| 2208 | + | |
| 2209 | + | |
| 2210 | + | |
| 2211 | + | |
| 2212 | + | |
| 2213 | + | |
| 2214 | + | |
| 2215 | + | |
| 2216 | + | |
| 2217 | + | |
| 2218 | + | |
| 2219 | + | |
| 2220 | + | |
| 2221 | + | |
| 2222 | + | |
| 2223 | + | |
| 2224 | + | |
| 2225 | + | |
| 2226 | + | |
| 2227 | + | |
| 2228 | + | |
| 2229 | + | |
| 2230 | + | |
| 2231 | + | |
| 2232 | + | |
| 2233 | + | |
| 2234 | + | |
| 2235 | + | |
| 2236 | + | |
| 2237 | + | |
| 2238 | + | |
| 2239 | + | |
| 2240 | + | |
| 2241 | + | |
| 2242 | + | |
| 2243 | + | |
| 2244 | + | |
| 2245 | + | |
| 2246 | + | |
| 2247 | + | |
| 2248 | + | |
| 2249 | + | |
| 2250 | + | |
| 2251 | + | |
| 2252 | + | |
| 2253 | + | |
| 2254 | + | |
| 2255 | + | |
| 2256 | + | |
| 2257 | + | |
| 2258 | + | |
| 2259 | + | |
| 2260 | + | |
| 2261 | + | |
| 2262 | + | |
| 2263 | + | |
| 2264 | + | |
| 2265 | + | |
| 2266 | + | |
| 2267 | + | |
| 2268 | + | |
| 2269 | + | |
| 2270 | + | |
| 2271 | + | |
| 2272 | + | |
| 2273 | + | |
| 2274 | + | |
| 2275 | + | |
| 2276 | + | |
| 2277 | + | |
| 2278 | + | |
| 2279 | + | |
| 2280 | + | |
| 2281 | + | |
| 2282 | + | |
| 2283 | + | |
| 2284 | + | |
| 2285 | + | |
| 2286 | + | |
| 2287 | + | |
| 2288 | + | |
| 2289 | + | |
| 2290 | + | |
| 2291 | + | |
| 2292 | + | |
| 2293 | + | |
| 2294 | + | |
2099 | 2295 | | |
0 commit comments