Skip to content

feat: apache-arrow read_arrow() function via bundled nanoarrow extension#3

Open
deem0n wants to merge 1 commit into
relytcloud:mainfrom
deem0n:feat-apache-arrow
Open

feat: apache-arrow read_arrow() function via bundled nanoarrow extension#3
deem0n wants to merge 1 commit into
relytcloud:mainfrom
deem0n:feat-apache-arrow

Conversation

@deem0n

@deem0n deem0n commented Jun 3, 2026

Copy link
Copy Markdown

Adds first-class Apache Arrow IPC reading to pg_duckdb, mirroring the read_parquet() pattern. read_arrow() function is the natural complement to read_parquet() for pipelines whose producer is JavaScript/Node, for Arrow Flight handoff, for streaming columnar data, or for notebook-style PyArrow/R-Arrow exchange. Smaller audience than Parquet today, growing as the Arrow IPC wire ecosystem (Flight, ADBC, JS dashboards) matures.

Adds first-class Apache Arrow IPC reading to pg_duckdb, mirroring the read_parquet pattern:

  • Statically links paleolimbot/duckdb-nanoarrow PR #47 (commit 42e4199, DuckDB-1.5 compatible)
  • Registers read_arrow(text) / read_arrow(text[]) as duckdb_only_function stubs
  • Adds read_arrow to the planner-routing strstr filter and the metadata cache's known-function list
  • Round-trip regression test (arrow.sql) covers both Arrow IPC stream + file formats and the array variant
  • Version bump 1.1.0 → 1.2.0 with matching upgrade script

Tested against this fork's pinned DuckDB v1.5.3 on Postgres 15. The regression test runs cleanly.

Note: nanoarrow rejects Arrow files with Dictionary-encoded columns. A common JS producer pitfall is apache-arrow's tableFromArrays(), which auto-builds Dictionary<Int32, Utf8> for string columns. Producers should emit plain Utf8 vectors via vectorFromArray() and stream-format IPC.

Adds first-class Apache Arrow IPC reading to pg_duckdb, mirroring the
read_parquet pattern:

* third_party/pg_duckdb_extensions.cmake: link paleolimbot/duckdb-nanoarrow
  PR duckdb#47 (commit 42e4199), which targets DuckDB 1.5.

* sql/pg_duckdb--1.1.0--1.2.0.sql: register read_arrow(text) and
  read_arrow(text[]) as duckdb_only_function stubs. The planner reroutes
  any query containing these calls to DuckDB, which executes the
  function natively via the statically-linked nanoarrow extension.

* pg_duckdb.control: bump default_version 1.1.0 -> 1.2.0.

* src/pgduckdb_hooks.cpp:ContainsDuckdbRowReturningFunction(): add
  read_arrow to the strstr filter so the planner-routing pre-check
  recognises it without requiring duckdb.force_execution = true.

* src/pgduckdb_metadata_cache.cpp:BuildDuckdbOnlyFunctions(): add
  read_arrow to the known-function list so duckdb_only_function lookups
  cache its OID.

* test/regression/sql/arrow.sql + expected/arrow.out: round-trip both
  Arrow IPC stream (.arrows) and Arrow IPC file (.arrow) formats
  through COPY ... TO + read_arrow(). Array-variant test included.

* test/regression/schedule: register the arrow test.

Note: nanoarrow rejects Arrow files with Dictionary-encoded columns.
A common JS producer pitfall is apache-arrow's tableFromArrays(),
which auto-builds Dictionary<Int32, Utf8> for string columns. Producers
should emit plain Utf8 vectors via vectorFromArray() and stream-format
IPC.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant