feat: apache-arrow read_arrow() function via bundled nanoarrow extension by deem0n · Pull Request #3 · relytcloud/pg_duckdb

deem0n · 2026-06-03T14:17:19Z

Adds first-class Apache Arrow IPC reading to pg_duckdb, mirroring the read_parquet() pattern. read_arrow() function is the natural complement to read_parquet() for pipelines whose producer is JavaScript/Node, for Arrow Flight handoff, for streaming columnar data, or for notebook-style PyArrow/R-Arrow exchange. Smaller audience than Parquet today, growing as the Arrow IPC wire ecosystem (Flight, ADBC, JS dashboards) matures.

Adds first-class Apache Arrow IPC reading to pg_duckdb, mirroring the read_parquet pattern:

Statically links paleolimbot/duckdb-nanoarrow PR #47 (commit 42e4199, DuckDB-1.5 compatible)
Registers read_arrow(text) / read_arrow(text[]) as duckdb_only_function stubs
Adds read_arrow to the planner-routing strstr filter and the metadata cache's known-function list
Round-trip regression test (arrow.sql) covers both Arrow IPC stream + file formats and the array variant
Version bump 1.1.0 → 1.2.0 with matching upgrade script

Tested against this fork's pinned DuckDB v1.5.3 on Postgres 15. The regression test runs cleanly.

Note: nanoarrow rejects Arrow files with Dictionary-encoded columns. A common JS producer pitfall is apache-arrow's tableFromArrays(), which auto-builds Dictionary<Int32, Utf8> for string columns. Producers should emit plain Utf8 vectors via vectorFromArray() and stream-format IPC.

Adds first-class Apache Arrow IPC reading to pg_duckdb, mirroring the read_parquet pattern: * third_party/pg_duckdb_extensions.cmake: link paleolimbot/duckdb-nanoarrow PR duckdb#47 (commit 42e4199), which targets DuckDB 1.5. * sql/pg_duckdb--1.1.0--1.2.0.sql: register read_arrow(text) and read_arrow(text[]) as duckdb_only_function stubs. The planner reroutes any query containing these calls to DuckDB, which executes the function natively via the statically-linked nanoarrow extension. * pg_duckdb.control: bump default_version 1.1.0 -> 1.2.0. * src/pgduckdb_hooks.cpp:ContainsDuckdbRowReturningFunction(): add read_arrow to the strstr filter so the planner-routing pre-check recognises it without requiring duckdb.force_execution = true. * src/pgduckdb_metadata_cache.cpp:BuildDuckdbOnlyFunctions(): add read_arrow to the known-function list so duckdb_only_function lookups cache its OID. * test/regression/sql/arrow.sql + expected/arrow.out: round-trip both Arrow IPC stream (.arrows) and Arrow IPC file (.arrow) formats through COPY ... TO + read_arrow(). Array-variant test included. * test/regression/schedule: register the arrow test. Note: nanoarrow rejects Arrow files with Dictionary-encoded columns. A common JS producer pitfall is apache-arrow's tableFromArrays(), which auto-builds Dictionary<Int32, Utf8> for string columns. Producers should emit plain Utf8 vectors via vectorFromArray() and stream-format IPC. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: apache-arrow read_arrow() function via bundled nanoarrow extension#3

feat: apache-arrow read_arrow() function via bundled nanoarrow extension#3
deem0n wants to merge 1 commit into
relytcloud:mainfrom
deem0n:feat-apache-arrow

deem0n commented Jun 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

deem0n commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

deem0n commented Jun 3, 2026 •

edited

Loading