Reduce copies in Arrow IPC writer by Rich-T-kid · Pull Request #10044 · apache/arrow-rs

Rich-T-kid · 2026-06-01T19:27:04Z

Which issue does this PR close?

Closes Optimize arrow-ipc #10029.
A document that provides a bit of context

Rationale for this change

Compression is the most compute and memory intensive part of the arrow-ipc encoding pipeline. It runs per buffer, not per record batch. For a Flight stream of 10 batches with 5 primitive arrays each, that is 100 compression calls minimum, more for string and struct arrays. Each of those calls produced an owned compressed Vec that was then copied a second time into a flat arrow_data accumulator before being written to the output. For the uncompressed path the situation was the same: Arc-backed buffer slices that required no compression were still copied into that accumulator unnecessarily.

Separately, the original write_message() function flushed after every dictionary and every record batch, causing repeated small OS write calls per batch. ( for non vector backed writer implementations )
The goal was to eliminate both problems: stop copying buffers that do not need to be copied, and stop flushing on every message.

What changes are included in this PR?

Introduced EncodedBuffer, an enum that wraps either a raw Arc-backed Buffer for the uncompressed path or an owned Vec for the compressed path, so both can be held in a uniform collection without an extra copy into a flat accumulator
Changed write_array_data to push EncodedBuffer segments instead of copying bytes into arrow_data
FileWriter and StreamWriter both now call write_batch_direct(), eliminating the flush-per-message behavior and the intermediate copy on the hot path

Are these changes tested?

These changes are intended to be completely seamless. I didn't write new unit test for the code as nothing externally changed. all test still pass

benchmarks

[main -> cargo bench --bench ipc_writer -- "StreamWriter/write_10$" --sample-size 100]
[my branch -> cargo bench --bench ipc_writer -- "StreamWriter/write_10$" --sample-size 100 ]

[main -> cargo bench --bench ipc_writer -- --sample-size 1000]
[my branch -> cargo bench --bench ipc_writer -- --sample-size 1000]

Are there any user-facing changes?

no

gabotechs

This is looking pretty good. Good job! left some comments mainly directed towards exploring more reuse and bringing a bit more clarity to this file, let me know if you have other ideas.

gabotechs · 2026-06-03T14:20:29Z

        }

-        let (encoded_dictionaries, encoded_message) = self.data_gen.encode(
+        let (dict_sizes, (meta, data)) = self.data_gen.write_batch_direct(


The two last (meta, data) fields returned by write_batch_direct are named (aligned_size, body_len) in that function.

Is this correct? not sure if this is just a naming thing, but it's hard to know if this is correct given the different naming. Is meta == aligned_size and data == body_len? they sound like completely different things.

I updated the struct & variable names to hopefully make this clearer.

gabotechs · 2026-06-03T16:23:59Z

+    /// each buffer is compressed into a per-buffer scratch `Vec<u8>` and written from
+    /// there, eliminating the extra copy that `write_buffer` -> `arrow_data` ->
+    /// `write_body_buffers` would otherwise incur.
+    fn write_batch_direct<W: Write>(


I see most contents of this function are essentially copy-pastes from record_batch_to_bytes, duplication seems too much here. Is there any chance to:

Completely replace record_batch_to_bytes and keep just a single function for writing batches in IPC format

Factoring out some ergonomic helpers that could be reused in both functions?

Also, it seems like the .encode() method and the new .write_batch_direct() are both doing the same thing with slightly different ergonomics. Do you see any opportunity to collapse them into just 1 method?

This file is overall pretty bloated with complex logic and a relatively arbitrary separation of concerns between methods, the more we can do for debloating it the better it will be for future maintainers.

This makes sense to me. the main issue is that FileWriter needs metadata while both StreamWriter and arrow-flight do not. Its better to not compute metadata the caller will not use but the slowdown should be negligible.

Rich-T-kid · 2026-06-03T17:50:51Z

Benchmark results from #10031

ran cargo bench --bench flight encode -- --sample-size 100

Rich-T-kid · 2026-06-03T17:53:05Z

going to look into why fixed/9182x1 regressed. Might just be noise

Rich-T-kid · 2026-06-03T18:49:45Z

I think its worth mentioning that no dictionary optimizations were made in the PR, could make to make that a follow up ticket.

Rich-T-kid · 2026-06-03T19:03:14Z

Ran benchmarks again and the results still look good. Test are passing for arrow-flight & arrow-ipc.
@alamb could you run the benchmarks for this PR on the CI bot when you get a chance? thank you!

alamb · 2026-06-04T10:11:54Z

run benchmark flight

adriangbot · 2026-06-04T10:14:11Z

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4621154654-431-g2mb7 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing rich-T-kid/optimize-arrow-ipc-copies (04e7992) to 97f4b14 (merge-base) diff
BENCH_NAME=flight
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench flight
BENCH_FILTER=
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-06-04T10:26:39Z

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

group                         main                                   rich-T-kid_optimize-arrow-ipc-copies
-----                         ----                                   ------------------------------------
encode/dict/65536x1           1.02    274.7±1.07µs   915.2 MB/sec    1.00    270.3±1.55µs   930.2 MB/sec
encode/dict/65536x8           1.30      5.5±0.04ms   365.9 MB/sec    1.00      4.2±0.06ms   474.4 MB/sec
encode/dict/8192x1            1.01     35.3±0.04µs   924.9 MB/sec    1.00     35.1±0.05µs   930.1 MB/sec
encode/dict/8192x8            1.10    315.3±1.66µs   828.7 MB/sec    1.00    286.1±2.01µs   913.3 MB/sec
encode/fixed/65536x1          1.00     10.0±0.02µs    48.8 GB/sec    1.00     10.0±0.04µs    49.1 GB/sec
encode/fixed/65536x8          1.01   1092.1±2.65µs     3.6 GB/sec    1.00   1076.9±5.83µs     3.6 GB/sec
encode/fixed/8192x1           1.03      3.2±0.01µs    19.3 GB/sec    1.00      3.1±0.02µs    19.8 GB/sec
encode/fixed/8192x8           1.07     17.5±0.04µs    27.9 GB/sec    1.00     16.4±0.07µs    29.8 GB/sec
encode/nested/65536x1         1.00     29.0±0.24µs    42.1 GB/sec    1.04     30.2±0.30µs    40.4 GB/sec
encode/nested/65536x8         1.12      2.5±0.08ms     3.9 GB/sec    1.00      2.2±0.13ms     4.4 GB/sec
encode/nested/8192x1          1.00      5.8±0.01µs    26.5 GB/sec    1.00      5.8±0.01µs    26.6 GB/sec
encode/nested/8192x8          1.01     46.3±0.10µs    26.4 GB/sec    1.00     45.9±0.56µs    26.6 GB/sec
encode/variable/65536x1       1.04     51.5±0.55µs    42.7 GB/sec    1.00     49.3±0.36µs    44.6 GB/sec
encode/variable/65536x8       1.24      5.7±0.09ms     3.1 GB/sec    1.00      4.6±0.06ms     3.8 GB/sec
encode/variable/8192x1        1.18      7.1±0.01µs    38.8 GB/sec    1.00      6.0±0.01µs    45.8 GB/sec
encode/variable/8192x8        1.32     83.3±1.96µs    26.4 GB/sec    1.00     63.3±0.20µs    34.7 GB/sec
roundtrip/dict/65536x1        1.01  1288.9±47.00µs   195.1 MB/sec    1.00  1279.9±51.20µs   196.4 MB/sec
roundtrip/dict/65536x8        1.00     14.4±0.55ms   139.8 MB/sec    1.01     14.5±0.53ms   138.4 MB/sec
roundtrip/dict/8192x1         1.00    206.6±5.60µs   158.1 MB/sec    1.00    206.0±5.73µs   158.6 MB/sec
roundtrip/dict/8192x8         1.00  1322.8±44.85µs   197.5 MB/sec    1.00  1324.0±44.79µs   197.3 MB/sec
roundtrip/fixed/65536x1       1.02    314.2±4.02µs  1591.6 MB/sec    1.00    307.7±4.21µs  1625.2 MB/sec
roundtrip/fixed/65536x8       1.00      2.2±0.03ms  1856.3 MB/sec    1.15      2.5±0.13ms  1608.9 MB/sec
roundtrip/fixed/8192x1        1.00     89.9±0.94µs   696.0 MB/sec    1.00     89.6±1.04µs   698.6 MB/sec
roundtrip/fixed/8192x8        1.01    333.8±3.84µs  1499.9 MB/sec    1.00    330.7±3.14µs  1514.2 MB/sec
roundtrip/nested/65536x1      1.00   859.6±40.22µs  1454.4 MB/sec    1.00   862.4±44.79µs  1449.6 MB/sec
roundtrip/nested/65536x8      1.01      8.6±0.36ms  1156.8 MB/sec    1.00      8.6±0.36ms  1168.2 MB/sec
roundtrip/nested/8192x1       1.01    159.3±5.82µs   981.9 MB/sec    1.00    157.6±5.86µs   993.0 MB/sec
roundtrip/nested/8192x8       1.02   925.9±41.73µs  1351.8 MB/sec    1.00   910.3±44.09µs  1374.9 MB/sec
roundtrip/variable/65536x1    1.00  1246.2±34.07µs  1805.6 MB/sec    1.54  1922.3±132.51µs  1170.5 MB/sec
roundtrip/variable/65536x8    1.17     16.1±0.50ms  1119.3 MB/sec    1.00     13.7±0.50ms  1313.5 MB/sec
roundtrip/variable/8192x1     1.00    204.0±5.40µs  1379.8 MB/sec    1.00    203.1±5.65µs  1385.4 MB/sec
roundtrip/variable/8192x8     1.00  1211.3±26.93µs  1858.7 MB/sec    1.57  1897.2±120.59µs  1186.7 MB/sec

Resource Usage

base (merge-base)

Metric	Value
Wall time	345.1s
Peak memory	3.4 GiB
Avg memory	3.4 GiB
CPU user	350.2s
CPU sys	74.8s
Peak spill	0 B

branch

Metric	Value
Wall time	335.1s
Peak memory	3.4 GiB
Avg memory	3.4 GiB
CPU user	335.9s
CPU sys	78.4s
Peak spill	0 B

File an issue against this benchmark runner

gabotechs · 2026-06-04T11:41:52Z

run benchmarks ipc_writer

adriangbot · 2026-06-04T11:45:38Z

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4621805130-432-xw4gv 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing rich-T-kid/optimize-arrow-ipc-copies (04e7992) to 97f4b14 (merge-base) diff
BENCH_NAME=ipc_writer
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench ipc_writer
BENCH_FILTER=
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-06-04T11:46:59Z

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

group                                                 main                                   rich-T-kid_optimize-arrow-ipc-copies
-----                                                 ----                                   ------------------------------------
arrow_ipc_stream_writer/FileWriter/write_10           1.90    185.2±2.17µs        ? ?/sec    1.00     97.3±4.50µs        ? ?/sec
arrow_ipc_stream_writer/StreamWriter/write_10         1.94    185.1±1.96µs        ? ?/sec    1.00     95.4±4.87µs        ? ?/sec
arrow_ipc_stream_writer/StreamWriter/write_10/zstd    1.01      7.3±0.02ms        ? ?/sec    1.00      7.2±0.03ms        ? ?/sec

Resource Usage

base (merge-base)

Metric	Value
Wall time	30.0s
Peak memory	2.7 GiB
Avg memory	2.6 GiB
CPU user	27.5s
CPU sys	0.6s
Peak spill	0 B

branch

Metric	Value
Wall time	30.0s
Peak memory	2.7 GiB
Avg memory	2.6 GiB
CPU user	29.7s
CPU sys	0.1s
Peak spill	0 B

File an issue against this benchmark runner

Rich-T-kid · 2026-06-04T13:43:36Z

🤔 results look good, Im curious as to why two of the roundtrip benchmarks were slightly slower even thought encode() is faster across the board.

gabotechs · 2026-06-04T13:53:21Z

run benchmark flight

adriangbot · 2026-06-04T13:56:57Z

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4622787465-441-9sh6d 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing rich-T-kid/optimize-arrow-ipc-copies (04e7992) to 97f4b14 (merge-base) diff
BENCH_NAME=flight
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench flight
BENCH_FILTER=
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-06-04T14:09:30Z

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

group                         main                                   rich-T-kid_optimize-arrow-ipc-copies
-----                         ----                                   ------------------------------------
encode/dict/65536x1           1.01    271.4±1.78µs   926.3 MB/sec    1.00    267.9±0.66µs   938.6 MB/sec
encode/dict/65536x8           1.00      5.1±0.12ms   395.2 MB/sec    1.11      5.6±0.21ms   357.0 MB/sec
encode/dict/8192x1            1.04     36.4±0.04µs   896.4 MB/sec    1.00     35.0±0.03µs   932.9 MB/sec
encode/dict/8192x8            1.06    305.3±1.34µs   855.9 MB/sec    1.00    287.6±1.80µs   908.7 MB/sec
encode/fixed/65536x1          1.04     10.3±0.02µs    47.6 GB/sec    1.00      9.9±0.02µs    49.3 GB/sec
encode/fixed/65536x8          1.00   1084.9±6.70µs     3.6 GB/sec    1.01   1096.8±4.18µs     3.6 GB/sec
encode/fixed/8192x1           1.00      3.0±0.01µs    20.3 GB/sec    1.03      3.1±0.01µs    19.7 GB/sec
encode/fixed/8192x8           1.06     17.4±0.02µs    28.0 GB/sec    1.00     16.4±0.06µs    29.8 GB/sec
encode/nested/65536x1         1.29     38.5±0.19µs    31.7 GB/sec    1.00     29.8±0.58µs    41.0 GB/sec
encode/nested/65536x8         1.32      2.7±0.04ms     3.6 GB/sec    1.00      2.1±0.20ms     4.8 GB/sec
encode/nested/8192x1          1.01      5.8±0.01µs    26.2 GB/sec    1.00      5.8±0.02µs    26.4 GB/sec
encode/nested/8192x8          1.04     47.3±0.08µs    25.8 GB/sec    1.00     45.5±0.14µs    26.9 GB/sec
encode/variable/65536x1       1.63     80.5±0.47µs    27.3 GB/sec    1.00     49.3±0.34µs    44.6 GB/sec
encode/variable/65536x8       1.10      5.4±0.19ms     3.2 GB/sec    1.00      4.9±0.11ms     3.6 GB/sec
encode/variable/8192x1        1.81     10.7±0.01µs    25.7 GB/sec    1.00      5.9±0.01µs    46.6 GB/sec
encode/variable/8192x8        1.37     87.5±0.23µs    25.1 GB/sec    1.00     63.9±0.26µs    34.4 GB/sec
roundtrip/dict/65536x1        1.00  1319.9±43.70µs   190.5 MB/sec    1.00  1316.4±41.25µs   191.0 MB/sec
roundtrip/dict/65536x8        1.00     14.4±0.59ms   139.9 MB/sec    1.02     14.6±0.57ms   137.9 MB/sec
roundtrip/dict/8192x1         1.01    213.3±5.87µs   153.1 MB/sec    1.00    211.4±6.11µs   154.5 MB/sec
roundtrip/dict/8192x8         1.01  1346.6±44.16µs   194.0 MB/sec    1.00  1338.9±42.85µs   195.1 MB/sec
roundtrip/fixed/65536x1       1.00    318.9±4.39µs  1568.0 MB/sec    1.00    319.6±4.24µs  1564.8 MB/sec
roundtrip/fixed/65536x8       1.00      2.2±0.03ms  1821.4 MB/sec    1.19      2.6±0.16ms  1532.6 MB/sec
roundtrip/fixed/8192x1        1.00     93.5±1.03µs   669.7 MB/sec    1.01     94.6±1.56µs   661.8 MB/sec
roundtrip/fixed/8192x8        1.01    342.0±6.09µs  1464.3 MB/sec    1.00    338.7±4.37µs  1478.2 MB/sec
roundtrip/nested/65536x1      1.01   885.1±37.56µs  1412.4 MB/sec    1.00   879.6±39.35µs  1421.4 MB/sec
roundtrip/nested/65536x8      1.00     10.0±0.39ms  1002.2 MB/sec    1.02     10.2±0.43ms   979.4 MB/sec
roundtrip/nested/8192x1       1.01    163.0±5.56µs   960.0 MB/sec    1.00    161.1±5.13µs   971.1 MB/sec
roundtrip/nested/8192x8       1.00   927.1±39.57µs  1350.1 MB/sec    1.02   941.8±46.61µs  1329.0 MB/sec
roundtrip/variable/65536x1    1.00  1285.3±51.72µs  1750.7 MB/sec    1.49  1910.5±128.29µs  1177.8 MB/sec
roundtrip/variable/65536x8    1.00     14.8±0.74ms  1217.3 MB/sec    1.01     15.0±0.53ms  1199.3 MB/sec
roundtrip/variable/8192x1     1.00    210.4±5.49µs  1337.6 MB/sec    1.00    209.4±6.81µs  1344.0 MB/sec
roundtrip/variable/8192x8     1.00  1251.7±29.66µs  1798.6 MB/sec    1.53  1914.9±122.75µs  1175.7 MB/sec

Resource Usage

base (merge-base)

Metric	Value
Wall time	345.1s
Peak memory	3.4 GiB
Avg memory	3.4 GiB
CPU user	352.2s
CPU sys	71.6s
Peak spill	0 B

branch

Metric	Value
Wall time	340.1s
Peak memory	3.4 GiB
Avg memory	3.4 GiB
CPU user	333.1s
CPU sys	84.7s
Peak spill	0 B

File an issue against this benchmark runner

alamb · 2026-06-04T14:58:59Z

🤔 results look good, Im curious as to why two of the roundtrip benchmarks were slightly slower even thought encode() is faster across the board.

Looks like the results are reproducable -- next step would be to profile it to see if you can find the answer

I wonder if we are missing a Vec::with_capacity or Vec::reserve to avoid extra allocations / copies 🤔

Rich-T-kid · 2026-06-04T15:41:05Z

i'm suspecting the issue has to do with
let (client, server) = tokio::io::duplex(1024 * 1024);
per the docs "The max_buf_size argument is the maximum amount of bytes that can be written to a side before the write returns Poll::Pending."

The two regression cases both involve large variable-length data where the encoded payload can be huge:
roundtrip/variable/8192x8 — 8 columns × 8192 rows
roundtrip/variable/65536x1 — 65536 rows, large values buffer

This also shows up in the regression cases,
roundtrip/variable/8192x8 1.00 1251.7±29.66µs 1798.6 MB/sec 1.53 1914.9±122.75µs 1175.7 MB/sec
roundtrip/variable/65536x1 1.00 1285.3±51.72µs 1750.7 MB/sec 1.49 1910.5±128.29µs 1177.8 MB/sec
throughput falls flat.

taking a look at the other benchmark results this seems consistant,
roundtrip/fixed/65536x8 1.00 2.2±0.03ms 1821.4 MB/sec 1.19 2.6±0.16ms 1532.6 MB/sec throughput shrinks and as such causes more blocking to happen.

Even in the event where this isn't the reason for the slow down I think 1MB is still to small for realistic max throughput.

Rich-T-kid · 2026-06-04T15:54:17Z

I wonder if we are missing a Vec::with_capacity or Vec::reserve to avoid extra allocations / copies

I think this is a strong possibility after looking at the profile, from my understanding this is mostly in arrow-flight itself and not arrow-ipc. Since arrow-flight is very dependent on arrow-ipc it make sense to start from the ground up with these optimizations.

# Which issue does this PR close?  - Closes #10029. # Rationale for this change Increase the duplex buffer from 1 MB to 64 MB to eliminate artificial back-pressure in the roundtrip benchmarks. See rational in this [comment](#10044 (comment))  # What changes are included in this PR? bumps `max_buf_size` to 64**MB**  # Are these changes tested? n/a  # Are there any user-facing changes? n/a

gabotechs · 2026-06-08T06:37:08Z

-            ipc_message: finished_data.to_vec(),
-            arrow_data,
-        })
+        if let Some(w) = writer {


We need to find another way of modeling this without falling into "if-driven-development" patterns.

Whenever you find yourself coding something switching execution branches with completely different bodies over booleans, it's an indicator that you are falling into an "if-driven-development" pattern, and this is will greatly hurt maintenance in the future.

One way of trying to improve this situation can be:

Refactor things preferring code duplication, and split into different functions with different responsibilities, even if that implies copy-pasting big quantities of code.

Once you have the different functions with a fair amount of copy-pasted code, try to see what are the common bits, and factor them out little by little, trying to progressively reduce the LOC count in each one.

If you still see functions that need to accept bool or Option parameters that have the capability of switching how the function behaves, that means the function was the wrong abstraction on the first place, so don't be afraid of tearing existing functions apart if that allows you to reuse smaller bits in different places without switching on if statements.

gabotechs · 2026-06-08T06:39:38Z

    arrow_data: &mut Vec<u8>,
+    encoded_buffers: &mut Vec<EncodedBuffer>,
    nodes: &mut Vec<crate::FieldNode>,
    offset: i64,
    num_rows: usize,
    null_count: usize,
    compression_codec: Option<CompressionCodec>,
    compression_context: &mut CompressionContext,
    write_options: &IpcWriteOptions,
+    is_direct: bool,
 ) -> Result<i64, ArrowError> {
    let mut offset = offset;


This is another example of my comment above, but this time reinforced by a big Clippy bypass at the top.

Ideally, we should not be contributing towards stronger Clippy violations. If you try to follow the steps above, there are good chances that we can either maintain the width of this function's signature, or even reduce it.

Rich-T-kid · 2026-06-08T18:07:33Z

Replaced the (arrow_data(), encoded_buffers(), is_direct()) triple threaded through write_array_data() with a single BufferSink passed by the caller, eliminating every if is_direct { collect_encoded_buffers(...) } else { write_buffer(...) } branch that was repeated at each buffer write site. All path-specific logic is now encapsulated once in encode_sink_buffer(), which write_array_data() calls uniformly regardless of whether the output is headed for an in-memory EncodedData or a direct stream write. About -400 LOC @gabotechs

gabotechs · 2026-06-08T19:22:29Z

run benchmarks flight

adriangbot · 2026-06-08T19:27:15Z

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4652613611-485-hmsf5 6.12.68+ #1 SMP Sat May 2 07:49:07 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing rich-T-kid/optimize-arrow-ipc-copies (70a7567) to d7ef673 (merge-base) diff
BENCH_NAME=flight
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench flight
BENCH_FILTER=
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-06-08T19:39:08Z

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

group                         main                                   rich-T-kid_optimize-arrow-ipc-copies
-----                         ----                                   ------------------------------------
encode/dict/65536x1           1.00    285.2±1.80µs   881.4 MB/sec    1.00    285.3±1.50µs   881.1 MB/sec
encode/dict/65536x8           1.34      9.6±0.14ms   209.0 MB/sec    1.00      7.2±0.15ms   280.1 MB/sec
encode/dict/8192x1            1.05     37.1±0.06µs   881.5 MB/sec    1.00     35.2±0.11µs   926.7 MB/sec
encode/dict/8192x8            1.00    305.6±1.28µs   854.9 MB/sec    1.03    315.1±1.20µs   829.1 MB/sec
encode/fixed/65536x1          1.01     10.4±0.04µs    47.1 GB/sec    1.00     10.2±0.01µs    47.8 GB/sec
encode/fixed/65536x8          1.00   1085.1±7.18µs     3.6 GB/sec    1.03   1112.7±9.29µs     3.5 GB/sec
encode/fixed/8192x1           1.00      3.0±0.01µs    20.1 GB/sec    1.06      3.2±0.01µs    18.9 GB/sec
encode/fixed/8192x8           1.02     17.8±0.04µs    27.5 GB/sec    1.00     17.5±0.03µs    28.0 GB/sec
encode/nested/65536x1         1.35     38.4±0.26µs    31.8 GB/sec    1.00     28.5±0.18µs    42.8 GB/sec
encode/nested/65536x8         1.00      3.1±0.09ms     3.1 GB/sec    1.03      3.2±0.08ms     3.0 GB/sec
encode/nested/8192x1          1.01      5.8±0.01µs    26.2 GB/sec    1.00      5.8±0.01µs    26.5 GB/sec
encode/nested/8192x8          1.00     45.8±0.19µs    26.7 GB/sec    1.05     48.1±0.14µs    25.4 GB/sec
encode/variable/65536x1       1.00     62.9±0.37µs    35.0 GB/sec    1.11     69.6±0.43µs    31.6 GB/sec
encode/variable/65536x8       1.00      5.7±0.24ms     3.1 GB/sec    1.03      5.9±0.16ms     3.0 GB/sec
encode/variable/8192x1        1.00      7.1±0.01µs    38.4 GB/sec    1.00      7.1±0.02µs    38.5 GB/sec
encode/variable/8192x8        1.00     79.7±0.58µs    27.6 GB/sec    1.00     79.6±0.26µs    27.6 GB/sec
roundtrip/dict/65536x1        1.01  1319.3±49.97µs   190.6 MB/sec    1.00  1311.8±42.32µs   191.7 MB/sec
roundtrip/dict/65536x8        1.00     15.0±0.58ms   134.3 MB/sec    1.07     16.0±0.67ms   125.9 MB/sec
roundtrip/dict/8192x1         1.00    207.5±5.56µs   157.4 MB/sec    1.00    208.5±5.51µs   156.7 MB/sec
roundtrip/dict/8192x8         1.00  1346.9±43.49µs   194.0 MB/sec    1.00  1345.0±45.04µs   194.3 MB/sec
roundtrip/fixed/65536x1       1.00    305.0±4.83µs  1639.8 MB/sec    1.03    314.3±3.73µs  1591.2 MB/sec
roundtrip/fixed/65536x8       1.00      2.2±0.04ms  1791.3 MB/sec    1.00      2.2±0.05ms  1799.0 MB/sec
roundtrip/fixed/8192x1        1.00     90.4±1.16µs   692.6 MB/sec    1.01     91.7±1.52µs   682.7 MB/sec
roundtrip/fixed/8192x8        1.01    329.1±3.56µs  1521.5 MB/sec    1.00    327.4±4.78µs  1529.3 MB/sec
roundtrip/nested/65536x1      1.00   871.2±51.26µs  1435.0 MB/sec    1.00   868.4±50.87µs  1439.7 MB/sec
roundtrip/nested/65536x8      1.00     10.1±0.67ms   992.0 MB/sec    1.05     10.6±0.49ms   946.4 MB/sec
roundtrip/nested/8192x1       1.00    158.4±5.60µs   987.7 MB/sec    1.00    158.5±5.00µs   987.2 MB/sec
roundtrip/nested/8192x8       1.00   916.8±42.83µs  1365.2 MB/sec    1.00   915.9±45.38µs  1366.6 MB/sec
roundtrip/variable/65536x1    1.00  1247.8±36.58µs  1803.3 MB/sec    1.05  1304.0±100.89µs  1725.6 MB/sec
roundtrip/variable/65536x8    1.00     16.5±0.56ms  1092.2 MB/sec    1.04     17.1±0.62ms  1051.2 MB/sec
roundtrip/variable/8192x1     1.00    206.9±6.52µs  1360.3 MB/sec    1.01    209.6±6.14µs  1342.6 MB/sec
roundtrip/variable/8192x8     1.00  1237.0±25.68µs  1820.1 MB/sec    1.00  1241.4±39.06µs  1813.6 MB/sec

Resource Usage

base (merge-base)

Metric	Value
Wall time	345.1s
Peak memory	3.4 GiB
Avg memory	3.4 GiB
CPU user	348.1s
CPU sys	76.0s
Peak spill	0 B

branch

Metric	Value
Wall time	335.1s
Peak memory	3.4 GiB
Avg memory	3.4 GiB
CPU user	339.6s
CPU sys	73.9s
Peak spill	0 B

File an issue against this benchmark runner

alamb · 2026-06-08T20:26:37Z

I am excited by this one -- I am starting to check it out

gabotechs · 2026-06-09T06:51:56Z

run benchmarks ipc_writer

adriangbot · 2026-06-09T06:55:16Z

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4656945442-501-9447r 6.12.68+ #1 SMP Sat May 2 07:49:07 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing rich-T-kid/optimize-arrow-ipc-copies (70a7567) to d7ef673 (merge-base) diff
BENCH_NAME=ipc_writer
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench ipc_writer
BENCH_FILTER=
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-06-09T06:57:08Z

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

group                                                 main                                   rich-T-kid_optimize-arrow-ipc-copies
-----                                                 ----                                   ------------------------------------
arrow_ipc_stream_writer/FileWriter/write_10           2.08    190.0±3.19µs        ? ?/sec    1.00     91.4±0.30µs        ? ?/sec
arrow_ipc_stream_writer/StreamWriter/write_10         2.14    190.1±3.80µs        ? ?/sec    1.00     89.0±0.41µs        ? ?/sec
arrow_ipc_stream_writer/StreamWriter/write_10/zstd    1.00      7.3±0.05ms        ? ?/sec    1.00      7.2±0.03ms        ? ?/sec

Resource Usage

base (merge-base)

Metric	Value
Wall time	30.0s
Peak memory	2.0 GiB
Avg memory	2.0 GiB
CPU user	27.8s
CPU sys	0.7s
Peak spill	0 B

branch

Metric	Value
Wall time	35.0s
Peak memory	2.0 GiB
Avg memory	2.0 GiB
CPU user	32.0s
CPU sys	0.1s
Peak spill	0 B

File an issue against this benchmark runner

gabotechs · 2026-06-09T07:11:54Z

    buffers: &mut Vec<crate::Buffer>,
-    arrow_data: &mut Vec<u8>,
+    sink: &mut BufferSink<'_>,
    nodes: &mut Vec<crate::FieldNode>,


This is looking very good!

There's one thing that might confuse readers though:

Here, we are passing these two bits:

buffers: &mut Vec<crate::Buffer>, sink: &mut BufferSink<'_>,

Which at first sight, it raises some questions:

Why do we need to pass two "buffer sinks"? isn't buffers already a sink for crate::Buffer objects?

Why do we need to dump the buffers in two places, one for buffer and other for sink, are we unnecessarily copying data because of that?

After reading a bit more carefully, one notices that there are two kinds of buffers:

crate::Buffer: an autogenerated struct from the Arrow IPC spec

arrow_buffer::Buffer: The classical in-memory Arrow buffer

Maybe there's a way of making this distinction a bit clearer with some more descriptive names for BufferSink? like naming it ArrowDataSink maybe?

this makes sense to me, I've renamed it to IpcBodySink to help new readers understand the distinction better.

gabotechs

Just left some comments mostly about naming and some other nits, but overall, it looks good to me!

It was pretty hard navigating through this file, and you did a very good job at shipping this performance optimization without making it worst, nice job @Rich-T-kid!

alamb

Thank you so much @Rich-T-kid -- this looks great

It was pretty hard navigating through this file, and you did a very good job at shipping this performance optimization without making it worst, nice job @Rich-T-kid!

I actually think it is slightly better now (though I have ideas on how to make it even better :) -- thank you @gabotechs for the reviews

I think it would be good to try and avoid the extra copy (comments inline) but we could also do that as a follow on if you prefer

While reviewing this, claude and I found some missing coverage, which I have proposed adding here:

#10097

alamb · 2026-06-09T15:12:53Z

-    buffers: &mut Vec<crate::Buffer>, // output buffer descriptors
-    arrow_data: &mut Vec<u8>,         // output stream
-    offset: i64,                      // current output stream offset
+fn encode_sink_buffer(


similarly to @gabotechs above I found the relationship between buffer, buffers and sink confusing.

Could you perhaps (can be a follow on PR) to add documentation here explaining what this is doing -- specifically that the IpcBodySink is used for the actual arrow data and the buffers is an in-progresss list what will eventually become the IPC metadata

As a follow on, we might even be able to make this clearer by encapsulating buffers in some sort of other struct

struct IpcMetadataBuilder { buffers: Vec<crate::Buffer>, nodes: Vec<crate::FieldNode>, }

And then pass that through 🤔

We could probably also make the code less verbose by making buffers, sink, and field, fields on the IPC writer 🤔

Added some doc comments above both write_array_data() & encode_sink_buffer(). I think encapsulating buffers into a struct would be worth placing in a separate small PR. that PR would/should also include some other minor tweaks like removing the allow(clippy::too_many_arguments) on write_array_data()

Rich-T-kid · 2026-06-09T18:21:55Z

@alamb introduced the changes you mentioned. I'll open up a small PR tomorrow that refactors this file to be easier to understand. Thank you!

alamb · 2026-06-09T20:39:35Z

run benchmarks ipc_writer

adriangbot · 2026-06-09T20:42:27Z

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4663815058-525-bgdlw 6.12.68+ #1 SMP Sat May 2 07:49:07 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing rich-T-kid/optimize-arrow-ipc-copies (70d5802) to d7ef673 (merge-base) diff
BENCH_NAME=ipc_writer
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench ipc_writer
BENCH_FILTER=
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-06-09T20:44:03Z

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

group                                                 main                                   rich-T-kid_optimize-arrow-ipc-copies
-----                                                 ----                                   ------------------------------------
arrow_ipc_stream_writer/FileWriter/write_10           2.01    186.5±1.56µs        ? ?/sec    1.00     92.9±0.41µs        ? ?/sec
arrow_ipc_stream_writer/StreamWriter/write_10         2.06    187.0±1.58µs        ? ?/sec    1.00     90.8±0.28µs        ? ?/sec
arrow_ipc_stream_writer/StreamWriter/write_10/zstd    1.00      7.2±0.03ms        ? ?/sec    1.01      7.3±0.03ms        ? ?/sec

Resource Usage

base (merge-base)

Metric	Value
Wall time	30.0s
Peak memory	11.6 MiB
Avg memory	10.0 MiB
CPU user	26.5s
CPU sys	0.0s
Peak spill	0 B

branch

Metric	Value
Wall time	30.0s
Peak memory	10.1 MiB
Avg memory	9.0 MiB
CPU user	28.0s
CPU sys	0.0s
Peak spill	0 B

File an issue against this benchmark runner

github-actions Bot added the arrow Changes to the arrow crate label Jun 1, 2026

Rich-T-kid commented Jun 1, 2026

View reviewed changes

Comment thread arrow-ipc/src/writer.rs Outdated

Rich-T-kid mentioned this pull request Jun 2, 2026

[#10029][benchmarks] arrow-flight roundtrip as well as encode/decode #10031

Merged

gabotechs reviewed Jun 3, 2026

View reviewed changes

Rich-T-kid force-pushed the rich-T-kid/optimize-arrow-ipc-copies branch from fc1dbb8 to ecb37f2 Compare June 3, 2026 18:46

Rich-T-kid mentioned this pull request Jun 4, 2026

Bump max throughput in flight benchmark before blocking #10070

Merged

Rich-T-kid added 6 commits June 4, 2026 14:05

remove repeated buffer copies in encoding pipeline

c3c87a8

fixed some linting issues

fd3a475

minor clean up. good to push

213c8e7

fix CI

6df331c

refactored PR to be easier to read/ dedupe logic

781c75c

linting fix

7acd810

gabotechs reviewed Jun 8, 2026

View reviewed changes

Rich-T-kid force-pushed the rich-T-kid/optimize-arrow-ipc-copies branch from 9ec873d to 83102e2 Compare June 8, 2026 18:07

removed un-needed code duplication in favor of an enum

70a7567

Rich-T-kid force-pushed the rich-T-kid/optimize-arrow-ipc-copies branch from 83102e2 to 70a7567 Compare June 8, 2026 18:14

alamb changed the title ~~Avoid Arrow IPC Copies~~ Avoid some copies in Arrow IPC Jun 8, 2026

gabotechs reviewed Jun 9, 2026

View reviewed changes

gabotechs approved these changes Jun 9, 2026

View reviewed changes

PR revision & make file easier to understand

ae03023

Rich-T-kid force-pushed the rich-T-kid/optimize-arrow-ipc-copies branch from ca764ea to ae03023 Compare June 9, 2026 13:53

alamb mentioned this pull request Jun 9, 2026

Add arrow-flight test coverage for IPC compression #10097

Open

alamb changed the title ~~Avoid some copies in Arrow IPC~~ Reduce copies in Arrow IPC writer Jun 9, 2026

alamb approved these changes Jun 9, 2026

View reviewed changes

Rich-T-kid added 2 commits June 9, 2026 13:47

add doc comments and add standard interface for IpcBodySink

fdc6ba0

fix clippy

a674bed

Rich-T-kid force-pushed the rich-T-kid/optimize-arrow-ipc-copies branch from 3bfccfd to a674bed Compare June 9, 2026 17:51

Rich-T-kid added 2 commits June 9, 2026 14:00

run ci

6d178ce

forgot to add padding

70d5802

alamb approved these changes Jun 9, 2026

View reviewed changes

Conversation

Rich-T-kid commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

benchmarks

Are there any user-facing changes?

Uh oh!

Uh oh!

gabotechs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Rich-T-kid commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark results from #10031

Uh oh!

Rich-T-kid commented Jun 3, 2026

Uh oh!

Rich-T-kid commented Jun 3, 2026

Uh oh!

Rich-T-kid commented Jun 3, 2026

Uh oh!

alamb commented Jun 4, 2026

Uh oh!

adriangbot commented Jun 4, 2026

Uh oh!

adriangbot commented Jun 4, 2026

Uh oh!

gabotechs commented Jun 4, 2026

Uh oh!

adriangbot commented Jun 4, 2026

Uh oh!

adriangbot commented Jun 4, 2026

Uh oh!

Rich-T-kid commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gabotechs commented Jun 4, 2026

Uh oh!

adriangbot commented Jun 4, 2026

Uh oh!

adriangbot commented Jun 4, 2026

Uh oh!

alamb commented Jun 4, 2026

Uh oh!

Rich-T-kid commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rich-T-kid commented Jun 4, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Rich-T-kid commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gabotechs commented Jun 8, 2026

Uh oh!

adriangbot commented Jun 8, 2026

Uh oh!

adriangbot commented Jun 8, 2026

Uh oh!

alamb commented Jun 8, 2026

Uh oh!

gabotechs commented Jun 9, 2026

Uh oh!

adriangbot commented Jun 9, 2026

Rich-T-kid commented Jun 1, 2026 •

edited

Loading

Rich-T-kid commented Jun 3, 2026 •

edited

Loading

Rich-T-kid commented Jun 4, 2026 •

edited

Loading

Rich-T-kid commented Jun 4, 2026 •

edited

Loading

Rich-T-kid commented Jun 8, 2026 •

edited

Loading