[SPARK-57100][SQL] Add columnar (ColumnVector) support for nanosecond timestamp types by MaxGekk · Pull Request #56198 · apache/spark

MaxGekk · 2026-05-29T09:03:26Z

What changes were proposed in this pull request?

Implement columnar storage support for TimestampNTZNanosType and TimestampLTZNanosType across the column-vector stack. The layout mirrors CalendarInterval: each column gets two child vectors — a Long child for epochMicros and a Short child for nanosWithinMicro (range [0, 999]).

Concretely:

ColumnVector — getTimestampNTZNanos / getTimestampLTZNanos now read from child vectors instead of throwing SparkUnsupportedOperationException.
WritableColumnVector — allocates the two child columns in the constructor; adds putTimestampNTZNanos / putTimestampLTZNanos write methods.
ConstantColumnVector — same child-column allocation; adds setTimestampNanosVal for the constant-value (partition-column) path.
RowToColumnConverter (Columnar.scala) — adds TimestampNTZNanosConverter / TimestampLTZNanosConverter objects (append epochMicros + nanosWithinMicro to children via appendStruct); routes nullable columns through StructNullableTypeConverter.
ColumnVectorUtils — handles both types in populate (constant-column path) and in appendValue (null and non-null branches).

Why are the changes needed?

SPARK-56981 added row-level physical representation for nanosecond timestamps, but columnar execution could not hold or move these values — any attempt to build a ColumnarBatch from rows containing nanosecond timestamps threw an unsupported-operation exception. This PR closes that gap.

Does this PR introduce any user-facing change?

Yes. ColumnarBatch can now be built from InternalRows containing TimestampNTZNanosType / TimestampLTZNanosType values. Previously this threw SparkUnsupportedOperationException.

How was this patch tested?

Added four unit tests to RowToColumnConverterSuite:

TimestampNTZNanosType column roundtrip — non-null values survive the row→column→read cycle.
TimestampNTZNanosType column with nulls — null slots are preserved correctly.
TimestampLTZNanosType column roundtrip — same for the LTZ variant.
TimestampLTZNanosType column with nulls — same for the LTZ variant.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Sonnet 4.6 (claude.ai/code)

… timestamp types Implement read/write/append support for TimestampNTZNanosType and TimestampLTZNanosType in column vectors, following the CalendarInterval two-child-vector pattern (Long for epochMicros, Short for nanosWithinMicro). Co-authored-by: Max Gekk <max.gekk@gmail.com>

…-vector support Four issues found in code review: 1. appendStruct(true) null-propagation: extend the StructType|VariantType guard in WritableColumnVector to also recurse for CalendarIntervalType, TimestampNTZNanosType, and TimestampLTZNanosType children, so that a nullable struct field of these types correctly propagates nulls into their own child sub-columns, preventing index divergence. 2. MutableColumnarRow: add copy(), get(), and update() branches for TimestampNTZNanosType and TimestampLTZNanosType, plus setTimestampNTZNanos and setTimestampLTZNanos setters. 3. ColumnVector Javadoc: fix "int vector" -> "short vector" for child 1 of the nanosecond timestamp layout. 4. Test coverage: add testVectors (OnHeap + OffHeap) for both nanos types to ColumnVectorSuite; add populate tests to ColumnVectorUtilsSuite; add nanos columns to the ColumnarBatchSuite RowToColumnConverter end-to-end test. Co-authored-by: Max Gekk <max.gekk@gmail.com>

Co-authored-by: Max Gekk <max.gekk@gmail.com>

MaxGekk · 2026-05-30T05:25:03Z

@dongjoon-hyun @peter-toth May I ask you to review this PR. We need this to support timestamps with nanosecond precision in Parquet vectorized reader and in ORC.

peter-toth

Summary

Closes the SPARK-56981 follow-up gap so that ColumnarBatch can hold nanosecond timestamps. Each parent column gets two children — Long epochMicros + Short nanosWithinMicro — mirroring the existing CalendarInterval pattern.

Prior state and problem. SPARK-56981 added the row-layer plumbing (SpecializedGetters, UnsafeRow, UnsafeArrayData) for TimestampNTZNanosType/TimestampLTZNanosType, but the column-vector stack was untouched: the default ColumnVector.getTimestamp{NTZ,LTZ}Nanos threw SparkUnsupportedOperationException, WritableColumnVector had no putters or constructor child allocation, and every type-dispatch site (MutableColumnarRow.update, ColumnVectorUtils.populate/appendValue, RowToColumnConverter.getConverterForType) lacked a branch. Any path materializing a ColumnarBatch from rows containing nanos values blew up. Structurally this is because each dispatch site predates the nanos types and fans out independently — adding the type means walking each one.

Design approach. Treat the new types as structurally identical to CalendarInterval (struct-shaped parent with two children) and follow the existing dispatch shape rather than introduce a new pattern. The ColumnVector defaults read via getChild(0).getLong + getChild(1).getShort; child columns are auto-allocated in the WritableColumnVector and ConstantColumnVector constructors; nullable wrappers are routed to StructNullableTypeConverter. NTZ and LTZ remain as parallel methods even with identical bodies, matching the row-layer convention from SPARK-56981.

Key design decisions.

The ColumnVector default getters now read the layout instead of throwing, so concrete subclasses that don't override but happen to have child vectors of the right primitive types could silently return values rather than fail loudly. Same risk profile as getInterval/getVariant; callers are expected to dispatch on dataType() first.
ConstantColumnVector exposes a single type-agnostic setTimestampNanosVal, while WritableColumnVector and MutableColumnarRow keep parallel NTZ/LTZ setters. Each surface follows its neighbour's existing convention.
The appendStruct(true) recursion in WritableColumnVector now treats CalendarIntervalType, TimestampNTZNanosType, TimestampLTZNanosType as structurally-childed alongside StructType/VariantType, so a null parent struct cascades to grandchild cursors.

Implementation sketch. Java side: ColumnVector default getters; WritableColumnVector putters + constructor + appendStruct recursion; ConstantColumnVector constructor + setter; MutableColumnarRow setters + update/get/copy dispatch; ColumnVectorUtils populate + appendValue. Scala side: two new RowToColumnConverter cases and the routing branch in getConverterForType for nullable wrappers.

Behavioral changes worth calling out.

ColumnarBatch can now hold nanosecond timestamp columns where it previously threw SparkUnsupportedOperationException.
WritableColumnVector.appendStruct(true) now recurses into CalendarIntervalType child columns, fixing a previously-latent grandchild-cursor skew (see #1 below). Not flagged in the PR description.

General

The PR description doesn't mention the CalendarIntervalType change in WritableColumnVector.appendStruct. It's a real fix for nested struct-of-interval scenarios, narrowly latent before this PR. A one-line note in the description would help reviewers focused on nanos not miss the interval-semantics shift.

Suggested improvements

appendStruct CalendarIntervalType is a separate latent fix. The recursion now treats CalendarInterval child columns as structurally-childed; previously they took the primitive path and would have skewed grandchild cursors for null parent rows. Worth either splitting out or adding a struct-of-interval test. [sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java:766]
Two identical dispatch branches in populate. appendValue already collapses both nanos types into one branch; populate does not. [sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVectorUtils.java:110]
Missing MutableColumnarRow test for TimestampLTZNanosType. The new setTimestampLTZNanos/update/get/copy paths for LTZ aren't exercised; the NTZ test mirrors them straightforwardly. [sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnVectorSuite.scala:424]

peter-toth · 2026-05-30T09:18:10Z

      for (WritableColumnVector c: childColumns) {
-        if (c.type instanceof StructType || c.type instanceof VariantType) {
+        if (c.type instanceof StructType || c.type instanceof VariantType
+            || c.type instanceof CalendarIntervalType


Adding CalendarIntervalType here isn't just supporting the new nanos types — it also fixes a previously-latent bug for nested struct-of-interval. Pre-PR, when an outer struct column was appended as null and one of its child columns was a CalendarInterval, the interval child took the else branch (c.appendNull()), advancing only the interval's own cursor and leaving its three grandchild primitive columns (months/days/microseconds) un-advanced. Subsequent rows would then write into the wrong grandchild slots — silent skew.

The fix is correct, but:

The PR description doesn't mention this. Worth one line so reviewers don't miss the interval-semantics change.

There's no test exercising the new recursion. The minimum case is a struct-of-interval column with at least one null parent row, then read back the next non-null row's children to verify they aren't shifted. Same shape extends to struct-of-TimestampNanos.

Up to you whether to split out into a separate commit or keep bundled.

peter-toth · 2026-05-30T09:18:10Z

      } else if (pdt instanceof PhysicalCalendarIntervalType) {
        // The value of `numRows` is irrelevant.
        col.setCalendarInterval((CalendarInterval) row.get(fieldIdx, t));
+      } else if (pdt instanceof PhysicalTimestampNTZNanosType) {


The two else if branches are identical. appendValue below at line 178 already collapses both nanos types into one condition with ||. Suggest the same here:

} else if (pdt instanceof PhysicalTimestampNTZNanosType || pdt instanceof PhysicalTimestampLTZNanosType) { col.setTimestampNanosVal((TimestampNanosVal) row.get(fieldIdx, t)); }

peter-toth · 2026-05-30T09:18:11Z

+      assert(mutableRow.get(0, TimestampNTZNanosType(9)) === v)
+      assert(mutableRow.copy().get(0, TimestampNTZNanosType(9)) === v)
+    }
+  }


The PR adds a MutableColumnarRow test for TimestampNTZNanosType but not for TimestampLTZNanosType. The LTZ paths (setTimestampLTZNanos at MutableColumnarRow.java:348, the update(TimestampLTZNanosType) and get(TimestampLTZNanosType) dispatches at MutableColumnarRow.java:240,272, and the copy() branch at MutableColumnarRow.java:104) aren't exercised. Adding a parallel mutable ColumnarRow with TimestampLTZNanosType block right after this one closes the gap.

peter-toth · 2026-05-30T09:23:15Z

@MaxGekk, probably we should extract the CalendarIntervalType related fix to a separate PR, but the other 2 are just minor nits.

MaxGekk changed the title ~~[SPARK-57100][SQL] Add columnar (ColumnVector) support for nanosecond timestamp types~~ [WIP][SPARK-57100][SQL] Add columnar (ColumnVector) support for nanosecond timestamp types May 29, 2026

MaxGekk added 3 commits May 29, 2026 11:27

[SPARK-57100][SQL] Fix Scalastyle import order in ColumnarBatchSuite

e46c1f2

Co-authored-by: Max Gekk <max.gekk@gmail.com>

Remove unused import

7a10e70

MaxGekk changed the title ~~[WIP][SPARK-57100][SQL] Add columnar (ColumnVector) support for nanosecond timestamp types~~ [SPARK-57100][SQL] Add columnar (ColumnVector) support for nanosecond timestamp types May 29, 2026

peter-toth reviewed May 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57100][SQL] Add columnar (ColumnVector) support for nanosecond timestamp types#56198

[SPARK-57100][SQL] Add columnar (ColumnVector) support for nanosecond timestamp types#56198
MaxGekk wants to merge 4 commits into
apache:masterfrom
MaxGekk:nanos-column-vector

MaxGekk commented May 29, 2026

Uh oh!

MaxGekk commented May 30, 2026

Uh oh!

peter-toth left a comment

Uh oh!

peter-toth May 30, 2026

Uh oh!

peter-toth May 30, 2026

Uh oh!

peter-toth May 30, 2026

Uh oh!

peter-toth commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MaxGekk commented May 29, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

MaxGekk commented May 30, 2026

Uh oh!

peter-toth left a comment

Choose a reason for hiding this comment

Summary

General

Suggested improvements

Uh oh!

peter-toth May 30, 2026

Choose a reason for hiding this comment

Uh oh!

peter-toth May 30, 2026

Choose a reason for hiding this comment

Uh oh!

peter-toth May 30, 2026

Choose a reason for hiding this comment

Uh oh!

peter-toth commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants