Skip to content

[HSTACK] Fix MAP schema evolution: allow additive value-struct changes#5

Open
cochescu wants to merge 4 commits intomainfrom
hstack/map-schema-evolution-fix
Open

[HSTACK] Fix MAP schema evolution: allow additive value-struct changes#5
cochescu wants to merge 4 commits intomainfrom
hstack/map-schema-evolution-fix

Conversation

@cochescu
Copy link
Copy Markdown
Collaborator

@cochescu cochescu commented Apr 6, 2026

Problem

When a Delta table's logical schema has evolved additively — e.g. identityMap: Map<String, List<Struct<id>>> in older Parquet files vs Map<String, List<Struct<id, primary, authenticatedState>>> in the Delta log — DataFusion failed with:

Execution error: Cannot cast column 'identityMap' from
  'Map(key_value: Struct(key: Utf8, value: List(Struct(id: Utf8))))' (physical)
to
  'Map(key_value: Struct(key: Utf8, value: List(Struct(id: Utf8, primary: Boolean, authenticatedState: Utf8))))' (logical)

The schema rewriter rejected the cast and the physical plan builder had no handler for MAP-to-MAP evolution.

Fix

datafusion/common/src/nested_struct.rs

  • Adds a DataType::Map arm in cast_column that delegates to a new cast_map_column() function.
  • cast_map_column casts the inner key_value StructArray through the existing cast_column logic, so missing nullable fields in the value struct are filled with nulls and extra source fields are ignored — the same additive evolution semantics already supported for plain structs.

datafusion/physical-expr-adapter/src/schema_rewriter.rs

  • Adds a (DataType::Map, DataType::Map) arm to the compatibility check in try_adapt_physical_expr that validates the inner key_value struct fields via validate_struct_compatibility, allowing additive changes to pass instead of being rejected by can_cast_types.

Tests

  • test_cast_map_with_evolved_value_struct — end-to-end cast of a real-world identityMap: Map<String, List<Struct>> where older files have fewer value-struct fields.
  • test_validate_map_value_struct_compatibility — unit test for the schema_rewriter compatibility check on Map types.

Notes

The change is purely additive: the two new match arms intercept MAP types before the existing _ => fallback; all other types are unaffected.

DataFusion correctly handles Struct→Struct schema evolution (missing
fields filled with nulls, extra fields ignored) but falls back to
Arrow's generic cast for Map→Map. Since Arrow's generic cast cannot
add missing struct fields, reading Delta tables where the MAP value
struct has gained optional fields (additive evolution) fails with:

  Cannot cast column 'identityMap' from '...(physical data type)'
  to '...(logical data type)'

Fix:

1. schema_rewriter.rs — validate_compat match: add Map arm that
   recursively calls validate_struct_compatibility on the internal
   key_value struct, so additive value-struct changes pass the
   compatibility check.

2. nested_struct.rs — cast_column match: add Map arm via
   cast_map_column() which casts the key_value StructArray through
   cast_column (preserving struct evolution semantics) and rebuilds
   the MapArray with the same offsets and validity bitmap.

Tests added:
- test_cast_map_with_evolved_value_struct: end-to-end cast of a MAP
  where the physical value struct (only "id") is evolved to the
  logical schema (id + primary + authenticatedState).
- test_validate_map_value_struct_compatibility: confirms the
  compatibility check passes for additive value-struct changes.

Real-world trigger: AEP identityMap is Map<String, List<Struct<…>>>.
Older Parquet files have fewer fields in the Struct; newer files have
more. DataFusion previously failed to read the older files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant