Skip to content

fix(manifest): encode day-transform partition fields with Avro date logical type#915

Open
artbcf wants to merge 3 commits into
apache:mainfrom
artbcf:fix/day-transform-avro-date-encoding
Open

fix(manifest): encode day-transform partition fields with Avro date logical type#915
artbcf wants to merge 3 commits into
apache:mainfrom
artbcf:fix/day-transform-avro-date-encoding

Conversation

@artbcf

@artbcf artbcf commented Apr 17, 2026

Copy link
Copy Markdown

DayTransform.Apply() returns an int32 value representing days since the epoch. For manifest partition data, Iceberg implementations commonly write day(...) partition fields as Avro int with the date logical type, while readers should be able to accept both plain Avro int and Avro date.

Previously, DayTransform.ResultType() returned Int32, which caused PartitionSpec.PartitionType() to model day-transform partition fields as plain Int32. As a result, Iceberg Go wrote day-partition fields as plain Avro int, instead of Avro date, diverging from the writer behavior of other Iceberg implementations.

This PR changes DayTransform.ResultType() to return DateType. That lets the existing partition type and Avro schema conversion paths naturally emit Avro date for day-transform partition fields.

Tests added:

  • Verify that day(ts) partition fields are encoded with the Avro date logical type.
  • Verify that non-day Int32 partition fields still use plain Avro int encoding.

@artbcf artbcf requested a review from zeroshade as a code owner April 17, 2026 02:06

@laskoviymishka laskoviymishka left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix works, it's safe, and it unblocks Trino/Spark readers, i think it's fine to merge as-is.

That said, the real divergence is in DayTransform.ResultType returning Int32 instead of DateType (Java and PyIceberg both return Date). This PR patches the symptom in partitionTypeToAvroSchema, but the same ResultType divergence also leaves table/internal/utils.go building a wrong fieldIDToLogicalType for day partitions — masked today only because ManifestWriter.addEntry overrides it. Any consumer using DataFileBuilder directly sees the mismatch.

So: fine to land as-is, but worth a follow-up issue (or a second PR) for the root fix — either at ResultType or inside PartitionSpec.PartitionType(). That also drops the new spec parameter here and clears utils.go in one move.

The convertDateValue no-op branch and the empty-spec tests are worth tightening regardless.

Comment thread manifest.go Outdated
Comment thread schema_conversions.go Outdated
Comment on lines 28 to 44
// partitionTypeToAvroSchema converts the partition struct type to an Avro
// schema. spec is used to resolve the transform for each partition field so
// that Int32 fields produced by DayTransform are encoded with the "date"
// logical type, matching what other engines (Trino, Spark, etc.) write.
func partitionTypeToAvroSchema(t *StructType, spec PartitionSpec) (*avro.Schema, error) {
// Build a field-ID → Transform map so we can detect DayTransform below.
transformByFieldID := make(map[int]Transform, spec.NumFields())
for i := range spec.NumFields() {
pf := spec.Field(i)
transformByFieldID[pf.FieldID] = pf.Transform
}

fields := make([]avro.SchemaField, len(t.FieldList))
for i, f := range t.FieldList {
var node avro.SchemaNode
switch typ := f.Type.(type) {
case Int32Type:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partitionTypeToAvroSchema(t StructType, spec PartitionSpec) was a pure type-to-schema mapping. It now takes the partition spec and builds a transform lookup to decide one field's Avro type. That coupling exists only because Int32Type is hiding what should be a DateType (see context above).

A few alternatives, roughly from cleanest to most surgical:

  1. Fix DayTransform.ResultType to return DateType. Matches Java and PyIceberg. PartitionSpec.PartitionType() then produces a DateType field for day partitions, the existing case DateType: branch fires, the new parameter goes away, and table/internal/utils.go fixes itself. Wider blast (anyone reading ResultType or BoundTransform.Type() sees Date now), but that's the correct answer per spec.

  2. Fix inside PartitionSpec.PartitionType(schema). Same effect on the Avro mapping without touching ResultType: when building the partition struct, override the field type to DateType for day transforms. partitionTypeToAvroSchema stays pure, utils.go sees the right type, blast radius stays narrow to the partition-struct construction path.

  3. Narrow the parameter. If option 1 or 2 isn't on the table, pass map[int]Transform (or a func(fieldID int) Transform closure) instead of the full PartitionSpec. Keeps the coupling minimal and makes the dependency explicit at the call site.

Option 2 is probably the best trade-off — fixes the root symptom without the wider implications of changing ResultType.

What's you take on this? I don't have strong opinion, but current approach seems to me a bit too hacky.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

went with option 2

Comment thread manifest_test.go Outdated

got, ok := decoded["bucket_id_identity"]
require.True(t, ok)
assert.IsType(t, int32(0), got, "plain Int32 partition must decode as int32, not time.Time")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"date" is a common substring: a field named update_date or dated_id would make this pass accidentally. The assert.IsType(time.Time{}, got) check a few lines above is a stronger signal (it only decodes as time.Time when the logical type is applied).

Drop the Contains/NotContains and rely on the decode-type assertion?
Or marshal the schema to JSON and assert on the structured logicalType field?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@artbcf artbcf force-pushed the fix/day-transform-avro-date-encoding branch from bf8e4d0 to ac8bc4a Compare April 17, 2026 14:33

@laskoviymishka laskoviymishka left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Nice work!

@zeroshade

Copy link
Copy Markdown
Member

DayTransform.Apply() returns an int32 (days since epoch) but partitionTypeToAvroSchema was encoding all Int32 partition fields as plain Avro int, without the 'date' logical type. This caused manifest files to be unreadable by Trino, Spark, and other Iceberg engines that require the date logical type on day-partition columns.

The actual spec says that the result type should be int, not date. If both iceberg-java and pyiceberg are using date, and the existing engines expect it to be a date, then we should probably have a discussion about the spec. @Fokko @kevinjqliu would either of you be up for commenting on whether or not this is something I should bring up on the mailing list? I'd prefer to get clarity here before having iceberg-go implement something that's contrary to the spec.

@kevinjqliu

Copy link
Copy Markdown
Contributor

ok i had a few paragraphs i was going to write here but ended up deleting it because i think this warrants a bigger discussion 😄

i think every implementation has been tripped up by this haha

java

python

rust

and likely many more threads.

@kevinjqliu

kevinjqliu commented May 19, 2026

Copy link
Copy Markdown
Contributor

Ok i created an umbrella issue here apache/iceberg#16414 (and devlist thread https://lists.apache.org/thread/qqw5oog5swmswxqqmp693vz1rw132xb6)

For iceberg-go, i would suggest to align with the other implementation:

  • write as Avro date (with logical annotation)
  • accept both Avro int and date

This follows Postel’s Law, "Be liberal in what you accept, and conservative in what you send."

@kevinjqliu

Copy link
Copy Markdown
Contributor

This caused manifest files to be unreadable by Trino, Spark, and other Iceberg engines that require the date logical type on day-partition columns.

btw based on my research, it seems that Java should be able to read the Avro int. Curious what was the error you saw.

Comment thread partitions.go Outdated
continue
}
resultType := field.Transform.ResultType(sourceType)
// DayTransform.ResultType returns Int32 (days since epoch), but the

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would suggest changing the DayTransform's ResultType to PrimitiveTypes.Date instead

func (DayTransform) ResultType(Type) Type { return PrimitiveTypes.Int32 }

so iceberg-go is better aligned with other implementations (see apache/iceberg#16414)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, let's get that updated first before we merge this!

@mkuznets

Copy link
Copy Markdown
Contributor

btw based on my research, it seems that Java should be able to read the Avro int. Curious what was the error you saw.

I can confirm, Trino can read both "int" and "int + date logical type"

The issue was on the ingestion side. A compaction (based on iceberg-rust) normalized everything to "int + date logical type", which caused iceberg-go to refuse to append new files with the partition column encoded as "int".

@kevinjqliu

Copy link
Copy Markdown
Contributor

The issue was on the ingestion side. A compaction (based on iceberg-rust) normalized everything to "int + date logical type", which caused iceberg-go to refuse to append new files with the partition column encoded as "int".

thanks for the additional context! This makes sense. All the other implementations can accept Avro int and Avro date (per apache/iceberg#16414)
Iceberg-go only accepts Avro int, and this PR should allow it to also accept Avro date.

Orthogonally, Iceberg-go should change to write as Avro date instead of Avro int, to align with the other implementations.
And we will clarify the spec regarding this, i raised a devlist thread already (https://lists.apache.org/thread/qqw5oog5swmswxqqmp693vz1rw132xb6)

@zeroshade

Copy link
Copy Markdown
Member

okay cool, thanks @kevinjqliu It's funny because we originally did use the date type, then switched it to int when it was pointed out that the spec said it should be int and not date. lol So if we can get the spec fixed that would be awesome. In the meantime, I'm good with letting this get merged.

Comment thread partitions.go Outdated
continue
}
resultType := field.Transform.ResultType(sourceType)
// DayTransform.ResultType returns Int32 (days since epoch), but the

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, let's get that updated first before we merge this!

@artbcf

artbcf commented May 20, 2026

Copy link
Copy Markdown
Author

updated

artbcf added 2 commits May 20, 2026 11:31
…ogical type

DayTransform.Apply() returns int32 (days since epoch) but the Avro
encoding of a day-partition column must carry the 'date' logical type
so that Trino, Spark, and other Iceberg engines can read manifests.

Root fix: PartitionSpec.PartitionType() now overrides DayTransform's
ResultType (Int32) to DateType for the partition struct it produces.
The existing DateType branch in partitionTypeToAvroSchema then emits
DateNode (int + date logical type) automatically — no coupling between
the Avro layer and transform internals.

This also fixes DataFileStatistics.ToDataFile() in table/internal/utils.go,
which builds fieldIDToLogicalType by switching on ResultType: the
case DateType branch now fires correctly for day-partition fields.

Tests verify that day-partition fields decode as time.Time (confirming
the date logical type is applied) while plain Int32 fields decode as
int32.
@artbcf artbcf force-pushed the fix/day-transform-avro-date-encoding branch from b7d3dee to 4ec520e Compare May 20, 2026 15:32
Comment on lines +201 to +208
// TestDayTransformPartitionAvroDateEncoding verifies that a day(ts) partition
// field is encoded with the Avro "date" logical type, not as a plain integer.
// This is required for interoperability with Trino, Spark, and other Iceberg
// engines that reject manifests where day-partition columns lack the date type.
//
// The fix lives in PartitionSpec.PartitionType: it overrides DayTransform's
// ResultType (Int32) to DateType so the existing DateType branch in
// partitionTypeToAvroSchema emits DateNode automatically.

@kevinjqliu kevinjqliu May 20, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slight nit here, i think the problem is the iceberg-go implementation not being able to read Avro date (written by iceberg-rust during compaction)
Other engines (Trino/Spark) and implementations (java/python/rust) can read both Avro int and Avro date, but choose to only write Avro date.

I think its a bit confusing to say that Avro date is needed for interoperability.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should check that iceberg-go can read both Avro int and Avro date. Might be out of scope for this PR if we're dealing with the writing default here

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its a bit confusing to say that Avro date is needed for interoperability.

If we're not saying that Avro date is needed for interoperability, then the spec needs to state that int is allowed but date should be written 😄

If we're aligned on that, then I'm okay with us being able to read both but only write date.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@artbcf can you ensure that we can read both int and date, but only write date?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be fine to merge this once you update with ensuring we can read both but only write date. We don't have to wait for the spec change.

@laskoviymishka

Copy link
Copy Markdown
Contributor

Filed a related PR to disambiguate spec: apache/iceberg#16446

@laskoviymishka

Copy link
Copy Markdown
Contributor

@zeroshade - can we merge this? see #1176, people still catch this in prod.

@kevinjqliu kevinjqliu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

lets merge this and add the tests as a follow up.

  • test day transform writes as avro date
  • test can read partition value encoded as either avro int or avro date

we also dont necessarily need TestNonDayInt32PartitionAvroIntEncoding, bucket transform encoding is likely tested elsewhere

Comment thread schema_conversions_test.go Outdated
@kevinjqliu

Copy link
Copy Markdown
Contributor

i updated the pr description as well
@laskoviymishka i think we can move this forward

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants