refactor: unify ResultSet implementations on Arrow-backed path by mkaufmann · Pull Request #175 · forcedotcom/datacloud-jdbc

mkaufmann · 2026-04-24T19:05:38Z

NOTE: I'm not happy with the PR and the approach, this is for me to make reviews and steer the agent

Summary

Collapse the two ResultSet families (streaming Arrow + row-based metadata) into a single Arrow-backed implementation so there is one accessor pipeline, one set of type semantics, and one place to fix bugs. Also tightens root-allocator hygiene.

Built on top of #moritz/centralize-types-via-hypertype.

What changed

Unified result set. DataCloudMetadataResultSet, SimpleResultSet, and ColumnAccessor are removed. JDBC metadata calls (getTables, getColumns, getSchemas, getTypeInfo, and all empty-metadata helpers) now funnel through StreamingResultSet via a new MetadataArrowBuilder that materialises List<List<Object>> metadata rows into a populated Arrow VectorSchemaRoot. MetadataResultSets is the factory callers use.

Source-agnostic cursor. ArrowStreamReaderCursor now accepts either a streaming ArrowStreamReader or a pre-populated in-memory VectorSchemaRoot, driven by a pluggable BatchLoader. The cursor owns an AutoCloseable holding the backing resources and closes it on cursor close.

Root allocator hygiene.

QueryResultArrowStream.toArrowStreamReader previously leaked a 100 MB RootAllocator — ArrowStreamReader.close() only tears down vectors, not the allocator. It now returns a Result holder that pairs the reader with the allocator and closes both in the correct order (reader first, so ArrowBuf accounting clears before the allocator's budget check).
StreamingResultSet.ofInMemory(root, owned, queryId, zone, cols) similarly takes ownership of the allocator + VSR through an AutoCloseable, so every code path closes its allocator.

typeName preservation. ofInMemory accepts an optional columns override so JDBC-spec labels like \"TEXT\" / \"INTEGER\" / \"SHORT\" survive a round-trip through Arrow (the derived HyperType names would otherwise be \"VARCHAR\" etc.).

StreamingResultSet.getObject(int, Class) gains an isInstance fallback so getObject(col, String.class) on a VARCHAR works without each accessor implementing typed getObject.

Behavior changes worth calling out

Accessor semantics on metadata rows are now the same as on query results, which is stricter than the old row-based SimpleResultSet:

getBoolean / getDate / getTime / getTimestamp on an integer column throw SQLException instead of loose-coercing.
getByte on an integer column is now supported (previously threw in the metadata path).

DataCloudDatabaseMetadataTest assertions were updated accordingly.

Test plan

./gradlew clean build passes (includes spotlessCheck, all tests, JaCoCo coverage, verification).
./gradlew :jdbc-core:test — 1222 tests, 0 failed.
Spot-check downstream: Spark datasource still compiles (covered by full build).

…ArrowStreamReader Rework the ResultSet unification to address two reviewer requests on #175: 1. Share the vector-building code with the parameter-encoding path instead of having a dedicated MetadataArrowBuilder. VectorPopulator now exposes a row-indexed primitive (setCell) used by both callers. The existing single-row parameter-binding overload and a new many-row metadata overload both funnel through it, and all the individual vector setters are parameterised by row index. 2. Keep ArrowStreamReaderCursor on its original ArrowStreamReader-only interface. The metadata path now serialises a populated VSR to Arrow IPC bytes and wraps the result in a ByteArrayInputStream-backed ArrowStreamReader, so both streaming and metadata result sets travel through exactly the same reader/cursor plumbing. Supporting changes: - typeName overrides (e.g. "TEXT" for JDBC-spec metadata columns) now round-trip through Arrow via a jdbc:type_name field-metadata key rather than a columns-override parameter on StreamingResultSet. HyperTypeToArrow stamps the key on write; ArrowToHyperTypeMapper.toColumnMetadata reads it back. - StreamingResultSet drops the ofInMemory(...) factory and the columns override; callers construct an ArrowStreamReader + BufferAllocator pair and hand them to of(reader, allocator, queryId, zone). The cursor owns both and closes reader-then-allocator on close. - QueryResultArrowStream.toArrowStreamReader returns a simple Result holder (reader + allocator) instead of an AutoCloseable bundle. - MetadataResultSets is the single entry point for Arrow-backed metadata result sets; MetadataArrowBuilder is deleted. - Empty metadata results skip writeBatch() entirely so ArrowStreamReaderCursor doesn't interpret a zero-row batch as "at least one row available". - Tests updated to the new API; StreamingResultSetMethodTest builds its in-memory ResultSet the same way as the metadata path (IPC round-trip).

codecov · 2026-05-11T13:44:21Z

Codecov Report

❌ Patch coverage is 76.53631% with 84 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.64%. Comparing base (1bda7d4) to head (b83e8b7).

Files with missing lines	Patch %	Lines
...sforce/datacloud/jdbc/core/DataCloudResultSet.java	76.28%	45 Missing and 1 partial ⚠️
.../datacloud/jdbc/protocol/data/VectorPopulator.java	72.05%	12 Missing and 7 partials ⚠️
...tacloud/jdbc/core/metadata/MetadataResultSets.java	72.91%	9 Missing and 4 partials ⚠️
...atacloud/jdbc/core/accessor/QueryJDBCAccessor.java	75.00%	2 Missing and 1 partial ⚠️
...datacloud/jdbc/protocol/data/HyperTypeToArrow.java	81.81%	1 Missing and 1 partial ⚠️
...oud/jdbc/protocol/data/ArrowToHyperTypeMapper.java	75.00%	0 Missing and 1 partial ⚠️

❌ Your patch check has failed because the patch coverage (76.53%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@             Coverage Diff              @@
##               main     #175      +/-   ##
============================================
- Coverage     82.37%   80.64%   -1.73%     
+ Complexity     1867     1706     -161     
============================================
  Files           125      123       -2     
  Lines          5009     4939      -70     
  Branches        537      521      -16     
============================================
- Hits           4126     3983     -143     
- Misses          641      725      +84     
+ Partials        242      231      -11

Components	Coverage Δ
JDBC Core	`81.02% <76.53%> (-2.13%)`	⬇️
JDBC Main	`40.69% <ø> (ø)`
JDBC HTTP	`90.30% <ø> (ø)`
JDBC Utilities	`65.25% <ø> (ø)`
Spark Datasource	`∅ <ø> (∅)`

Files with missing lines	Coverage Δ
...e/datacloud/jdbc/core/ArrowStreamReaderCursor.java	`91.17% <100.00%> (+0.55%)`	⬆️
...force/datacloud/jdbc/core/DataCloudConnection.java	`57.04% <100.00%> (ø)`
...datacloud/jdbc/core/DataCloudDatabaseMetadata.java	`98.34% <100.00%> (-0.01%)`	⬇️
...sforce/datacloud/jdbc/core/DataCloudStatement.java	`81.09% <100.00%> (ø)`
...esforce/datacloud/jdbc/core/QueryMetadataUtil.java	`95.40% <100.00%> (ø)`
...oud/jdbc/core/SQLExceptionQueryResultIterator.java	`75.00% <ø> (ø)`
...atacloud/jdbc/protocol/QueryResultArrowStream.java	`87.50% <100.00%> (+0.83%)`	⬆️
...oud/jdbc/protocol/data/ArrowToHyperTypeMapper.java	`67.79% <75.00%> (+5.97%)`	⬆️
...datacloud/jdbc/protocol/data/HyperTypeToArrow.java	`76.92% <81.81%> (-1.34%)`	⬇️
...atacloud/jdbc/core/accessor/QueryJDBCAccessor.java	`93.87% <75.00%> (-3.50%)`	⬇️
... and 3 more

... and 5 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

mkaufmann · 2026-05-11T14:19:42Z

+        } catch (SQLException ex) {
+            // Accessor does not implement typed getObject — fall back to raw + isInstance check.
+            val raw = accessor.getObject();
+            updateWasNull(accessor);
+            if (raw == null) {
+                return null;
+            }
+            if (type.isInstance(raw)) {
+                return type.cast(raw);
+            }
+            throw new SQLException(
+                    "Cannot convert column value to " + type.getName() + "; actual type is "
+                            + raw.getClass().getName(),
+                    ex);
+        }


Isn't this putting a bandaid over bad behavior of the accessors? Shouldn't the accessor already implement this behavior so that we don't need the bandaid here? If true check if if this proper fix can be factored into a separate commit (before the others on this branch) so that I can maybe ship it independently

Agreed — fixed in 322663e by giving QueryJDBCAccessor.getObject(Class) a default raw + isInstance implementation, so accessors that don't need typed conversion just inherit the right behavior. Then a6da614 collapses StreamingResultSet.getObject(int, Class) to a direct dispatch (no try/catch).

Per your "factor into a separate commit at the start of the branch" ask: 322663e is now the first commit on the branch (touches only QueryJDBCAccessor.java, applies cleanly on main), so it can be cherry-picked and shipped independently of the rest of this PR. The follow-up that drops the bandaid (a6da614) sits at the end since it depends on the rest of the unified StreamingResultSet.

mkaufmann · 2026-05-11T14:19:54Z

+     * <p>The column metadata (including any {@link ColumnMetadata#getTypeName()} override
+     * stamped under {@link com.salesforce.datacloud.jdbc.protocol.data.HyperTypeToArrow#JDBC_TYPE_NAME_METADATA_KEY})
+     * is derived from the Arrow schema via {@link ArrowToHyperTypeMapper#toColumnMetadata(org.apache.arrow.vector.types.pojo.Field)}.


This is too low level comment, remove

Removed in e329860 — the paragraph that mentioned JDBC_TYPE_NAME_METADATA_KEY is gone; the docstring now stays at the level of "this is what of(reader, allocator, queryId, sessionZone) does" and leaves the field-metadata details documented at HyperTypeToArrow / ColumnMetadata where they live.

mkaufmann · 2026-05-11T14:21:16Z

+    /**
+     * Hand the reader + allocator pair from {@link QueryResultArrowStream.Result} to {@link
+     * #of(ArrowStreamReader, BufferAllocator, String, ZoneId)} and close both on construction
+     * failure. Without this, an {@code of} call that throws (for example {@code SQLException}
+     * wrapping an unsupported Arrow type) would leak the 100 MB
+     * {@link org.apache.arrow.memory.RootAllocator} held by {@code Result}.
+     */
+    public static StreamingResultSet ofClosingOnFailure(
+            QueryResultArrowStream.Result arrowStream, String queryId, ZoneId sessionZone) throws SQLException {
+        try {
+            return of(arrowStream.getReader(), arrowStream.getAllocator(), queryId, sessionZone);
+        } catch (SQLException | RuntimeException ex) {
+            try {
+                arrowStream.getReader().close();
+            } catch (Exception suppressed) {
+                ex.addSuppressed(suppressed);
+            }
+            arrowStream.getAllocator().close();
+            throw ex;
+        }
+    }
+


This looks like an independent fix, if yes factor (and the tests of course) into a seperate commit at the start of the branch so that I can potentially ship it independently

Looked at this carefully — the fix is already in its own commit (26bdc37 fix: close allocator on StreamingResultSet.of failure in query path with its regression test in StreamingResultSetMethodTest), but I left it where it is on the branch rather than moving it to the very front. The reason: as coded, this fix is not independent of the unify refactor. Specifically, it adds StreamingResultSet.ofClosingOnFailure(QueryResultArrowStream.Result, ...) and switches the four call sites to it — and on main:

QueryResultArrowStream doesn't return a Result wrapping (reader, allocator); toArrowStreamReader just returns an ArrowStreamReader.

StreamingResultSet.of(...) only takes (reader, queryId) — there's no allocator to leak on construction failure, because the allocator isn't owned by the result set on main at all.

So "pull this fix to the front of the branch" would mean rewriting the fix as a different fix against the pre-unify surface, which feels like it deserves its own decision rather than a sneaky reorder here. Two options if you'd like to ship the leak fix independently of this PR:

I can carve out a minimal "pre-unify" leak fix against main directly (different shape — likely closes the allocator that today lives inside QueryResultArrowStream rather than in StreamingResultSet) and open it as its own PR. Happy to do this.

Land it as part of this PR and treat the unify+leak-fix as a unit.

Lean toward (1) if you want to ship it ahead of this PR, otherwise (2) is fine. Let me know.

Try to move it to a separate commit (along with test coverage). Would it make sense to move the QueryResultArrowStream as well then?

Honest answer: yes, that would make the leak fix sit on a clean base — but it's a non-trivial split because the unify refactor is what introduced the allocator-ownership concept (Result holder, cursor's BufferAllocator field, StreamingResultSet.of taking an allocator). To pull "QueryResultArrowStream introduces Result" + "StreamingResultSet/cursor own the allocator" + "ofClosingOnFailure leak fix" forward as a single commit before the unify refactor, I'd be effectively breaking the unify commit into two halves and rebuilding the unify on top of the pre-extracted half. Doable, but it means rewriting a ~800-line refactor.

Given (a) you have one substantive comment left to evaluate (whether commit 2088116 lands as a separate PR), and (b) the leak fix is already in its own commit (4d04fbb, regression test included), I'd propose:

Now: I land 2088116 (cursor zero-row-skip fix + tests against main's signature) as the cherry-pickable first commit. Same for 093d692 (getObject(Class) base-class fallback).

If you want the leak fix shipped independently: I cherry-pick 4d04fbb's intent into a new PR against main, reshaping the fix to whatever the pre-unify code's leak surface actually is (likely a fix in QueryResultArrowStream.toArrowStreamReader to wrap the ArrowStreamReader so its close also closes the RootAllocator — cleaner than threading the allocator through StreamingResultSet.of).

This PR stays as-is for the leak fix: it lands as part of the unified refactor.

Lean toward not splitting the unify commit unless there's a strong reason. Want me to open the separate "main-targeted leak fix" PR per (2)?

mkaufmann · 2026-05-11T14:23:55Z

+    /**
+     * Load the next batch that has at least one row, skipping any zero-row batches in between.
+     * Empty IPC batches are valid per the Arrow spec and can be emitted by upstream producers
+     * (long-running queries returning empty initial chunks, schema-only metadata streams), so
+     * "successfully loaded" must not be conflated with "row available."
+     */
+    private boolean loadNextNonEmptyBatch() throws IOException {


Why do we need to skip empty IPC batches if they are valid? Shouldn't callers handle them correctly?

Fair question. The cursor is the caller for a JDBC ResultSet — there's no layer above it that has a sensible "I just consumed an empty batch, please advance" hook to expose. Sharpened the docstring on this method (e329860) to explain it directly:

Empty IPC batches are valid per the Arrow spec, and producers in the wild emit them — Hyper / query-federator can send a schema-only initial chunk, and async chunked-query paths use empty batches as a keep-alive between data chunks. JDBC has no notion of a "batch": ResultSet.next() only knows "is there another row?". This cursor is the single place that translates batch-level signals from ArrowStreamReader.loadNextBatch() into row-level advances, so it must consume the empty batches itself rather than push that responsibility outward.

Same fix is also independently shipped as commit 4d2e47f (fix: skip zero-row batches in ArrowStreamReaderCursor). I left that commit where it is on the branch rather than moving it to the start, because as coded it also removes the now-redundant empty-batch workaround in MetadataResultSets.writeArrowStream — which doesn't exist on main, so the pre-unify version of the fix would have a different shape (just the cursor side, no metadata-side cleanup). Happy to carve that out if you want a main-targeted preview.

Can you move this to an independent commit along with test coverage?

Done — carved out as commit 2088116 fix: skip zero-row batches in ArrowStreamReaderCursor, now the first commit on the branch (touches only ArrowStreamReaderCursor.java + ArrowStreamReaderCursorTest.java, applies cleanly on main's pre-unify cursor signature (reader, sessionZone)). It can be cherry-picked and shipped independently.

The downstream commit 049e82d (formerly the same-named "skip zero-row batches" commit) now contains only the residual: removing the now-redundant MetadataResultSets.writeArrowStream "skip writeBatch when rowCount==0" workaround, since the cursor handles the empty-only case at the seam.

Test coverage that ships with the pre-unify commit:

skipsZeroRowBatchAndYieldsSubsequentNonEmptyRows — real Arrow IPC stream of {0-row, 1-row} → next() reports the one real row, then false.

zeroRowOnlyBatchYieldsNoRows — real Arrow IPC stream of {0-row} → next() returns false.

firstNextReturnsTrueWhenInitialBatchHasRows / firstNextReturnsFalseWhenStreamHasNoBatches — pure-mock pinning of the first-batch logic, replacing the old forwardsLoadNextBatch(true|false) parameterised test that would loop forever under the new control flow.

…ArrowStreamReader Rework the ResultSet unification to address two reviewer requests on #175: 1. Share the vector-building code with the parameter-encoding path instead of having a dedicated MetadataArrowBuilder. VectorPopulator now exposes a row-indexed primitive (setCell) used by both callers. The existing single-row parameter-binding overload and a new many-row metadata overload both funnel through it, and all the individual vector setters are parameterised by row index. 2. Keep ArrowStreamReaderCursor on its original ArrowStreamReader-only interface. The metadata path now serialises a populated VSR to Arrow IPC bytes and wraps the result in a ByteArrayInputStream-backed ArrowStreamReader, so both streaming and metadata result sets travel through exactly the same reader/cursor plumbing. Supporting changes: - typeName overrides (e.g. "TEXT" for JDBC-spec metadata columns) now round-trip through Arrow via a jdbc:type_name field-metadata key rather than a columns-override parameter on StreamingResultSet. HyperTypeToArrow stamps the key on write; ArrowToHyperTypeMapper.toColumnMetadata reads it back. - StreamingResultSet drops the ofInMemory(...) factory and the columns override; callers construct an ArrowStreamReader + BufferAllocator pair and hand them to of(reader, allocator, queryId, zone). The cursor owns both and closes reader-then-allocator on close. - QueryResultArrowStream.toArrowStreamReader returns a simple Result holder (reader + allocator) instead of an AutoCloseable bundle. - MetadataResultSets is the single entry point for Arrow-backed metadata result sets; MetadataArrowBuilder is deleted. - Empty metadata results skip writeBatch() entirely so ArrowStreamReaderCursor doesn't interpret a zero-row batch as "at least one row available". - Tests updated to the new API; StreamingResultSetMethodTest builds its in-memory ResultSet the same way as the metadata path (IPC round-trip).

Now that QueryJDBCAccessor.getObject(Class) provides the raw + isInstance fallback as its base-class default, StreamingResultSet no longer needs the catch-and-retry path that worked around accessors which threw "Operation not supported." Collapse getObject(int, Class) to direct dispatch and update the regression test's WHY comment to point at the accessor base class as the load-bearing layer. Addresses: review comment on PR #175 line 388.

Three small follow-ups from PR #175 review: - StreamingResultSet.of: drop the paragraph that pointed at the HyperTypeToArrow.JDBC_TYPE_NAME_METADATA_KEY field-metadata key. The docstring spilled implementation detail of the metadata-stamping path into a generic "create a result set from a reader" entry-point; the type-name override is documented at HyperTypeToArrow / ColumnMetadata where it's relevant. - ArrowStreamReaderCursor.loadNextNonEmptyBatch: rewrite the rationale to answer "why does the cursor consume empty batches instead of the caller?" directly. Empty IPC batches are valid Arrow and producers emit them; JDBC's next() only knows rows, so this cursor is the seam that translates batch-level signals into row-level advances. - MetadataResultSetsTest: drop the JDBC ResultSet-shape slice (next / isClosed / getStatement / unwrap / isWrapperFor / getHoldability / getFetchSize / setFetchSize / getWarnings / getConcurrency / getType / getFetchDirection). Those test the StreamingResultSet plumbing shared by every result set on this branch and are already covered by StreamingResultSetMethodTest. Keep the arity-contract slice (short/long/right/null/empty rows) — that is the metadata-result-set-specific behavior. Addresses: review comments on PR #175.

…ArrowStreamReader Rework the ResultSet unification to address two reviewer requests on #175: 1. Share the vector-building code with the parameter-encoding path instead of having a dedicated MetadataArrowBuilder. VectorPopulator now exposes a row-indexed primitive (setCell) used by both callers. The existing single-row parameter-binding overload and a new many-row metadata overload both funnel through it, and all the individual vector setters are parameterised by row index. 2. Keep ArrowStreamReaderCursor on its original ArrowStreamReader-only interface. The metadata path now serialises a populated VSR to Arrow IPC bytes and wraps the result in a ByteArrayInputStream-backed ArrowStreamReader, so both streaming and metadata result sets travel through exactly the same reader/cursor plumbing. Supporting changes: - typeName overrides (e.g. "TEXT" for JDBC-spec metadata columns) now round-trip through Arrow via a jdbc:type_name field-metadata key rather than a columns-override parameter on StreamingResultSet. HyperTypeToArrow stamps the key on write; ArrowToHyperTypeMapper.toColumnMetadata reads it back. - StreamingResultSet drops the ofInMemory(...) factory and the columns override; callers construct an ArrowStreamReader + BufferAllocator pair and hand them to of(reader, allocator, queryId, zone). The cursor owns both and closes reader-then-allocator on close. - QueryResultArrowStream.toArrowStreamReader returns a simple Result holder (reader + allocator) instead of an AutoCloseable bundle. - MetadataResultSets is the single entry point for Arrow-backed metadata result sets; MetadataArrowBuilder is deleted. - Empty metadata results skip writeBatch() entirely so ArrowStreamReaderCursor doesn't interpret a zero-row batch as "at least one row available". - Tests updated to the new API; StreamingResultSetMethodTest builds its in-memory ResultSet the same way as the metadata path (IPC round-trip).

Now that QueryJDBCAccessor.getObject(Class) provides the raw + isInstance fallback as its base-class default, StreamingResultSet no longer needs the catch-and-retry path that worked around accessors which threw "Operation not supported." Collapse getObject(int, Class) to direct dispatch and update the regression test's WHY comment to point at the accessor base class as the load-bearing layer. Addresses: review comment on PR #175 line 388.

Three small follow-ups from PR #175 review: - StreamingResultSet.of: drop the paragraph that pointed at the HyperTypeToArrow.JDBC_TYPE_NAME_METADATA_KEY field-metadata key. The docstring spilled implementation detail of the metadata-stamping path into a generic "create a result set from a reader" entry-point; the type-name override is documented at HyperTypeToArrow / ColumnMetadata where it's relevant. - ArrowStreamReaderCursor.loadNextNonEmptyBatch: rewrite the rationale to answer "why does the cursor consume empty batches instead of the caller?" directly. Empty IPC batches are valid Arrow and producers emit them; JDBC's next() only knows rows, so this cursor is the seam that translates batch-level signals into row-level advances. - MetadataResultSetsTest: drop the JDBC ResultSet-shape slice (next / isClosed / getStatement / unwrap / isWrapperFor / getHoldability / getFetchSize / setFetchSize / getWarnings / getConcurrency / getType / getFetchDirection). Those test the StreamingResultSet plumbing shared by every result set on this branch and are already covered by StreamingResultSetMethodTest. Keep the arity-contract slice (short/long/right/null/empty rows) — that is the metadata-result-set-specific behavior. Addresses: review comments on PR #175.

mkaufmann · 2026-05-11T17:06:39Z

Per the two review threads, split out the cherry-pickable fixes as their own PRs against main:

chore: skip zero-row batches in ArrowStreamReaderCursor #185 — fix: skip zero-row batches in ArrowStreamReaderCursor (commit 2088116 from this branch, reshaped against main's 2-arg cursor signature). Self-contained: only ArrowStreamReaderCursor.java + ArrowStreamReaderCursorTest.java.
fix: support getObject(Class) with identity class type in QueryJDBCAccessor #186 — fix: provide default getObject(Class) fallback in QueryJDBCAccessor (commit 093d692 from this branch, applies cleanly on main). Self-contained: only QueryJDBCAccessor.java + the bundled StreamingResultSetMethodTest.getObjectWithClassUsesAccessorBaseFallback regression test.

This PR (#175) keeps the same fixes as the first two commits — when #185 / #186 land, those commits will collapse to no-ops at rebase time.

For the remaining "should QueryResultArrowStream allocator-ownership move pre-unify too?" thread (#175 review): waiting on your call before I do that split. As I noted there, it's a non-trivial surgery on the unify commit and I'd rather get your sign-off before rewriting ~800 lines of refactor.