Skip to content

[SPARK-57103][SQL] Add hashing for nanosecond timestamp types#56203

Open
stevomitric wants to merge 1 commit into
apache:masterfrom
stevomitric:stevomitric/SPARK-57103-hash
Open

[SPARK-57103][SQL] Add hashing for nanosecond timestamp types#56203
stevomitric wants to merge 1 commit into
apache:masterfrom
stevomitric:stevomitric/SPARK-57103-hash

Conversation

@stevomitric
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Add hashing support for the nanosecond timestamp types TimestampNTZNanosType(p) and TimestampLTZNanosType(p), in both the interpreted and codegen paths of hash.scala:

  • Murmur3Hash / XxHash64: mix both fields, following the existing CalendarInterval pattern - hashInt(nanosWithinMicro, hashLong(epochMicros, seed)).
  • HiveHash: a dedicated hashTimestampNanos that extends the existing hashTimestamp with the sub-microsecond nanoseconds using the same * 37 + field idiom as hashCalendarInterval.

Why are the changes needed?

hash-based GROUP BY / DISTINCT / joins - failed on nanosecond timestamp columns.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New unit tests.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.7

Add Murmur3/XxHash64/HiveHash support (interpreted + codegen) for
TimestampNTZNanosType and TimestampLTZNanosType, whose physical value is
TimestampNanosVal (epochMicros + nanosWithinMicro). Murmur3/XxHash64 mix
both fields following the CalendarInterval precedent; HiveHash adds a
dedicated hashTimestampNanos extending hashTimestamp. New HashExpressionsSuite
tests cover interpreted/codegen/unsafe agreement and equality consistency.

Co-authored-by: Isaac
@stevomitric stevomitric force-pushed the stevomitric/SPARK-57103-hash branch from 6fc3185 to ff23dd1 Compare May 29, 2026 13:14
@stevomitric
Copy link
Copy Markdown
Contributor Author

cc @MaxGekk please take a look at this PR.

Copy link
Copy Markdown
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! The implementation looks correct and the new suite passes locally (interpreted / codegen / optimized paths agree for Murmur3, XxHash64 and HiveHash across both nanos types at precisions 7 and 9). My main feedback is about test coverage of the real code path that motivates this change (hash-based GROUP BY / shuffle / joins), plus a couple of smaller notes. Inline comments below.


Seq(TimestampNTZNanosType(9), TimestampLTZNanosType(9),
TimestampNTZNanosType(7), TimestampLTZNanosType(7)).foreach { dt =>
(values.map(Literal.create(_, dt)) :+ Literal.create(null, dt)).foreach { lit =>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All hash inputs here are Literals, so the generated code embeds the TimestampNanosVal as a reference object and the ordinal-read codegen path (CodeGenerator.getValue -> getTimestampNTZNanos/getTimestampLTZNanos) is never exercised, and the value never round-trips through an UnsafeRow as a hash input. checkEvaluationWithUnsafeProjection here only projects the resulting int/long hash, not the nanos input.

Since the motivation is hash-based GROUP BY / shuffle / joins (where the input is a BoundReference reading from a possibly-unsafe row), could we add a BoundReference-over-row case, e.g.:

val row = InternalRow(TimestampNanosVal.fromParts(1234567890L, 999.toShort))
val ref = BoundReference(0, dt, nullable = true)
checkEvaluation(Murmur3Hash(Seq(ref), 42), Murmur3Hash(Seq(ref), 42).eval(row), row)

This drives the row read + unsafe round-trip that the literal-based tests skip.

Seq(TimestampNTZNanosType(9), TimestampLTZNanosType(9),
TimestampNTZNanosType(7), TimestampLTZNanosType(7)).foreach { dt =>
(values.map(Literal.create(_, dt)) :+ Literal.create(null, dt)).foreach { lit =>
// checkEvaluation asserts the interpreted, codegen, and unsafe paths all agree.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: with a Literal child, the unsafe projection only stores the scalar hash result, not the TimestampNanosVal input, so the unsafe input path isn't actually covered here. Either reword this comment or add the BoundReference-over-row case suggested above so the comment becomes accurate.

TimestampNTZNanosType(7), TimestampLTZNanosType(7)).foreach { dt =>
(values.map(Literal.create(_, dt)) :+ Literal.create(null, dt)).foreach { lit =>
// checkEvaluation asserts the interpreted, codegen, and unsafe paths all agree.
checkEvaluation(Murmur3Hash(Seq(lit), 42), Murmur3Hash(Seq(lit), 42).eval())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expected value is Murmur3Hash(Seq(lit), 42).eval(), i.e. computed by the same expression under test, so this only proves the eval paths agree with each other (and, via the second test, that both fields contribute). A bug shared across all paths (e.g. a wrong constant, or a symmetric field swap) wouldn't be caught. The existing tests in this suite pin literals (e.g. checkEvaluation(HiveHash(Seq(time)), -1567775210)). Could we pin at least one golden constant per algorithm for a fixed (epochMicros, nanosWithinMicro) pair?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants