Commit e52ff2d
Stephen Buck
Implement write.parquet.row-group-size-bytes in the pyarrow writer
The pyiceberg writer has historically ignored
write.parquet.row-group-size-bytes (logging 'not implemented') and used
only write.parquet.row-group-limit (rows). For wide tables that means a
single row group ends up at gigabytes — e.g. 337 cols × 1,048,576 default
rows ≈ 1.7 GiB uncompressed per row group — which drives the polars /
pyarrow reader's decode peak into the tens of GiB on production reads.
Now write_file resolves row_group_size as
min(row_group_limit, row_group_size_bytes / bytes_per_row), where
bytes_per_row is approximated from the in-memory arrow_table's nbytes.
This matches Spark / parquet-mr 'whichever limit fires first' semantics
and lets the existing PARQUET_ROW_GROUP_SIZE_BYTES_DEFAULT (128 MiB)
actually take effect.1 parent 5da8186 commit e52ff2d
2 files changed
Lines changed: 99 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2604 | 2604 | | |
2605 | 2605 | | |
2606 | 2606 | | |
| 2607 | + | |
| 2608 | + | |
| 2609 | + | |
| 2610 | + | |
| 2611 | + | |
| 2612 | + | |
| 2613 | + | |
| 2614 | + | |
2607 | 2615 | | |
2608 | 2616 | | |
2609 | 2617 | | |
2610 | 2618 | | |
2611 | | - | |
| 2619 | + | |
2612 | 2620 | | |
2613 | 2621 | | |
2614 | 2622 | | |
2615 | 2623 | | |
| 2624 | + | |
| 2625 | + | |
| 2626 | + | |
| 2627 | + | |
| 2628 | + | |
2616 | 2629 | | |
2617 | 2630 | | |
2618 | 2631 | | |
| |||
2636 | 2649 | | |
2637 | 2650 | | |
2638 | 2651 | | |
| 2652 | + | |
2639 | 2653 | | |
2640 | 2654 | | |
2641 | 2655 | | |
| |||
2819 | 2833 | | |
2820 | 2834 | | |
2821 | 2835 | | |
2822 | | - | |
2823 | 2836 | | |
2824 | 2837 | | |
2825 | 2838 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
74 | 74 | | |
75 | 75 | | |
76 | 76 | | |
| 77 | + | |
77 | 78 | | |
78 | 79 | | |
79 | 80 | | |
| |||
3045 | 3046 | | |
3046 | 3047 | | |
3047 | 3048 | | |
| 3049 | + | |
| 3050 | + | |
| 3051 | + | |
| 3052 | + | |
| 3053 | + | |
| 3054 | + | |
| 3055 | + | |
| 3056 | + | |
| 3057 | + | |
| 3058 | + | |
| 3059 | + | |
| 3060 | + | |
| 3061 | + | |
| 3062 | + | |
| 3063 | + | |
| 3064 | + | |
| 3065 | + | |
| 3066 | + | |
| 3067 | + | |
| 3068 | + | |
| 3069 | + | |
| 3070 | + | |
| 3071 | + | |
| 3072 | + | |
| 3073 | + | |
| 3074 | + | |
| 3075 | + | |
| 3076 | + | |
| 3077 | + | |
| 3078 | + | |
| 3079 | + | |
| 3080 | + | |
| 3081 | + | |
| 3082 | + | |
| 3083 | + | |
| 3084 | + | |
| 3085 | + | |
| 3086 | + | |
| 3087 | + | |
| 3088 | + | |
| 3089 | + | |
| 3090 | + | |
| 3091 | + | |
| 3092 | + | |
| 3093 | + | |
| 3094 | + | |
| 3095 | + | |
| 3096 | + | |
| 3097 | + | |
| 3098 | + | |
| 3099 | + | |
| 3100 | + | |
| 3101 | + | |
| 3102 | + | |
| 3103 | + | |
| 3104 | + | |
| 3105 | + | |
| 3106 | + | |
| 3107 | + | |
| 3108 | + | |
| 3109 | + | |
| 3110 | + | |
| 3111 | + | |
| 3112 | + | |
| 3113 | + | |
| 3114 | + | |
| 3115 | + | |
| 3116 | + | |
| 3117 | + | |
| 3118 | + | |
| 3119 | + | |
| 3120 | + | |
| 3121 | + | |
| 3122 | + | |
| 3123 | + | |
| 3124 | + | |
| 3125 | + | |
| 3126 | + | |
| 3127 | + | |
| 3128 | + | |
| 3129 | + | |
| 3130 | + | |
| 3131 | + | |
3048 | 3132 | | |
3049 | 3133 | | |
3050 | 3134 | | |
| |||
0 commit comments