Skip to content

feat(table): WKB encoding + GeoArrow conversion for geometry/geography#1138

Open
happydave1 wants to merge 5 commits into
apache:mainfrom
happydave1:feat/wkb
Open

feat(table): WKB encoding + GeoArrow conversion for geometry/geography#1138
happydave1 wants to merge 5 commits into
apache:mainfrom
happydave1:feat/wkb

Conversation

@happydave1

@happydave1 happydave1 commented May 28, 2026

Copy link
Copy Markdown
Contributor

Fixes #991.

Added metadata support in arrow_utils.go, created a WKTToWKB helper in table/internal/geo_codec.go which uses go-geom to convert WKT to WKB, and added a round trip Arrow to Parquet test which tests if geoarrow extension metadata survives. Also tests Iceberg schema to Arrow schema round trips.

@happydave1 happydave1 marked this pull request as ready for review May 28, 2026 19:26
@happydave1 happydave1 requested a review from zeroshade as a code owner May 28, 2026 19:26

@zeroshade zeroshade left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general this looks good to me, though I'm not AS familiar with the geo stuff here.

@dwilson1988 or @paleolimbot would either of you be able to take a quick look here for verification?

@paleolimbot paleolimbot left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

The test cases from apache/arrow-rs#10065 are my most recent attempt at a succinct list of the special cases to take care of, although I've tried to note them inline here, too. The difference between Parquet and Iceberg for CRSes is that Iceberg requires authority:code (or PROJJSON that is offloaded into a table property). You can error for that case (I left inline suggestions about how to convert from PROJJSON to authority:code).

Comment thread table/arrow_utils.go Outdated
Comment on lines +1845 to +1848
func icebergCRSToGeoArrowMetadata(crs string) geoarrow.Metadata {
if crs == "OGC:CRS84" {
return geoarrow.NewMetadata()
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is reversed I think...if the iceberg CRS is "", the GeoArrow CRS is "OGC:CRS84" (unless NewMetadata().

Comment thread table/arrow_utils.go
Comment on lines +1850 to +1858
if strings.HasPrefix(strings.ToLower(crs), "srid:") {
id := crs[len("srid:"):]
raw, _ := json.Marshal(id)

return geoarrow.Metadata{
CRS: raw,
CRSType: geoarrow.CRSTypeSRID,
}
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one special case here: "srid:0" maps to an omitted GeoArrow CRS.

Comment thread table/arrow_utils.go
Comment on lines +1859 to +1864
raw, _ := json.Marshal(crs)

return geoarrow.Metadata{
CRS: raw,
CRSType: geoarrow.CRSTypeAuthorityCode,
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there is one special case here: if the Iceberg CRS is projjson:<table property>, the CRS is the JSON object in that table property field. You should probably either support that or error with a note saying that it's not supported.

Comment thread table/arrow_utils.go
Comment on lines +1808 to +1810
if len(meta.CRS) == 0 && meta.CRSType == "" {
return "OGC:CRS84", nil
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the GeoArrow CRS is omitted from the extension metadata, the Parquet equivalent is "srid:0". If the GeoArrow CRS is "OGC:CRS84" or "EPSG:4326", the canonical Parquet CRS is "omitted" (i.e., default). I believe the Parquet and Iceberg CRS definitions are in sync now but it's worth double checking.

Comment thread table/arrow_utils.go
Comment on lines +1819 to +1826
switch meta.CRSType {
case geoarrow.CRSTypeSRID:
return "srid:" + crs, nil
case geoarrow.CRSTypeAuthorityCode:
return crs, nil
default:
return "", fmt.Errorf("unsupported geoarrow CRS type %q", meta.CRSType)
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the CRSType is PROJJSON, this is a important case to handle (because most GeoArrow producers produce that). You can convert most of these to authority:code by looking for the id member that looks like this "crs": {..., "id":{"authority": "OGC", "code": "CRS84"} or this "crs": {..., "id":{"authority": "EPSG", "code": 3857}. It would also be a good idea to convert the two "lonlat" CRSes ("EPSG:4326" and "OGC:CRS84") to the Iceberg "default" canonically.

When you can't extract an authority:code, this would need to be written to a table property and written to the CRS field as (projjson:<the table property name>). Probably easier to error for that case for now.

Signed-off-by: happydave1 <dzhao2004@gmail.com>
Signed-off-by: happydave1 <dzhao2004@gmail.com>
Signed-off-by: happydave1 <dzhao2004@gmail.com>
@dwilson1988

Copy link
Copy Markdown
Contributor

In general this looks good to me, though I'm not AS familiar with the geo stuff here.

@dwilson1988 or @paleolimbot would either of you be able to take a quick look here for verification?

Happy to re-review if needed, but looks like @paleolimbot is already on it! Ping me if you want a second set of eyes.

Signed-off-by: happydave1 <dzhao2004@gmail.com>
Signed-off-by: happydave1 <dzhao2004@gmail.com>
@happydave1

happydave1 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

Thank you @paleolimbot and @zeroshade for reviewing.

I believe I have addressed each comment by:

  1. updating icebergCRSToGeoArrowMetadata to handle the default CRS case ("OGC:CRS84") as expected.
  2. updating icebergCRSToGeoArrowMetadata to handle case "srid:0"
  3. erroring out the projjson case in both icebergCRSToGeoArrowMetadata and geoArrowCRSToIcebergCRS - this can be addressed in a separate PR
  4. updating TestIcebergGeoTypesToArrowSchema to mirror tests found in Add tests and fix corner cases for Parquet/GeoArrow extension type conversion arrow-rs#10065 (omitted test cases with partial projjson)
  5. mapping geoarrow crs of "OGC:CRS84" and "EPSG:4326" to the default CRS "OGC:CRS84" so that iceberg.GeographyTypeOf and iceberg.GeometryTypeOf both create geo types with omitted ("") crs.

I would appreciate a second pass whenever you guys have a chance, thanks!

@laskoviymishka laskoviymishka left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work threading the WKB / GeoArrow metadata through the schema conversion. The SRID encoding, the planar vs spherical edge handling, and the round-trip tests are all looking solid.

There are two things I’d want fixed before merge.

First, the EPSG:4326 handling looks asymmetric. geoArrowCRSToIcebergCRS collapses EPSG:4326 to OGC:CRS84 on read, but icebergCRSToGeoArrowMetadata writes EPSG:4326 back out as-is. Those two are not interchangeable because the axis order is different. That means a geometry("EPSG:4326") field — which is what PyIceberg emits — silently comes back as geometry("OGC:CRS84"). So the round trip is not actually equal. The existing CRS test uses EPSG:4267, so it misses this branch. I think we should either make the conversion symmetric or stop collapsing it, and add a regression test specifically for EPSG:4326.

Second, the projjson behavior is also asymmetric. The read path returns a clean error, but the write path panics. That panic propagates through VisitGeometry / VisitGeography with no recovery, so converting a table with a projjson CRS can crash the reader. This should return an error instead.

I left a few smaller comments inline as well: the authority_code / wkt2:2019 pass-through, discarded json.Marshal errors, and a couple of nlreturn gaps that will likely fail CI lint.

Once the two main issues above are fixed, I’m happy to take another pass and approve.

Comment thread table/arrow_utils.go
}
}

if crs == "OGC:CRS84" || crs == "EPSG:4326" {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EPSG:4326 and OGC:CRS84 aren't the same CRS — they have opposite axis order (CRS84 is lon/lat, EPSG:4326 is lat/lon). Collapsing them here means a geometry("EPSG:4326") field — which is exactly what PyIceberg emits via ga.wkb().with_crs("EPSG:4326") — reads back as geometry("OGC:CRS84"), so the schema silently changes on the read side.

The write path doesn't share the collapse either: icebergCRSToGeoArrowMetadata emits EPSG:4326 verbatim, so GeometryType{crs:"EPSG:4326"} → Arrow → GeometryType{crs:"OGC:CRS84"} and the two aren't Equals.

I'd drop the EPSG:4326 case here and treat it as a distinct authority code. If we genuinely want them unified, it has to be symmetric — normalize on both the write and read sides so the round trip is stable — and it shouldn't live silently at the schema layer. Either way a round-trip regression test for EPSG:4326 specifically would lock it down. wdyt?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EPSG:4326 and OGC:CRS84 aren't the same CRS

They are for the purposes of the GeoArrow and Parquet specifications, which explicitly define the axis order for these cases

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that a round trip regression test is definitely in order for this behavior.

@paleolimbot, is the asymmetric behavior here expected or should we collapse "EPSG:4326" at the write level too? (i.e. add a conditional in icebergCRSToGeoArrowMetadata)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ideal behaviour is that all of these become the Iceberg default CRS:

  • "crs": "EPSG:4326" and "crs": "OGC:CRS84" (case insensitive)
  • "crs": {..., "id":{"authority": "OGC", "code": "CRS84"}
  • "crs": {..., "id":{"authority": "EPSG", "code": 4326}
    • "crs": {..., "id":{"authority": "EPSG", "code": "4326"}

When reading an iceberg default CRS to GeoArrow, emit "crs": "OGC:CRS84". I think your PR is currently missing this case.

This asymmetry is helpful to improve the compatibility of iceberg tables (e.g., for readers that can't or don't want to understand CRSes and only handle the default case).

Comment thread table/arrow_utils.go
}
}

if strings.HasPrefix(strings.ToLower(crs), "projjson:") {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This panics on projjson, but the symmetric read function geoArrowCRSToIcebergCRS returns a clean error for the same case. The asymmetry is the problem: VisitGeometry/VisitGeography have no error return and nothing recovers, so a GeometryType with a projjson: CRS detonates the whole conversion at TypeToArrowType rather than surfacing an error.

I'd return an error here and thread it through the same way the read path does, so a projjson CRS fails gracefully instead of crashing the process.

Comment thread table/arrow_utils.go

iceType, err := geoArrowMetadataToIcebergType(wkb.Metadata())
if err != nil {
panic(fmt.Errorf("%w: %v", iceberg.ErrInvalidSchema, err))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The %v on err drops it out of the wrapped chain, so errors.Is/errors.As against the underlying error won't work downstream. I'd use %w for both:

panic(fmt.Errorf("%w: converting geoarrow metadata: %w", iceberg.ErrInvalidSchema, err))

Comment thread table/arrow_utils.go
return "srid:" + crs, nil
case geoarrow.CRSTypePROJJSON:
return "", errors.New("geoarrow CRS type projjson not supported yet")
default:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CRSTypeAuthorityCode and CRSTypeWKT22019 both fall into default and get returned as-is. For authority_code that's almost right but loses the type tag; for WKT2:2019 it hands a multi-KB WKT2 blob straight back as the Iceberg CRS string, which isn't a valid CRS identifier.

I'd add an explicit CRSTypeAuthorityCode case, and for CRSTypeWKT22019 either error like projjson does or reduce it to the authority code — whichever we pick, a test for each so the behavior is pinned.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CRSType, if it's coming from JSON, is just a hint, is not required, and is often absent. Here you probably want to:

  • Check if crs is a JSON object. If it is, check the id member and paste together crs["id"]["authority"], :, and crs["id"]["code"]. If any of those are missing, error for an unsupported CRS>
  • Check if crs is a string. If it's shorter than 32 characters, let it through verbatim. There's no official restriction on allowed characters in authorities or codes but the length check should reject anything questionable.

Comment thread table/arrow_utils.go

raw, _ := json.Marshal(crs)

return geoarrow.Metadata{

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For an authority-code CRS like EPSG:4326 this emits {"crs":"EPSG:4326"} with no crs_type, but the GeoArrow spec wants crs_type: "authority_code" for AUTHORITY:CODE strings. Without it the encoding is ambiguous, and combined with the collapse in geoArrowCRSToIcebergCRS it's the other half of the EPSG:4326 round-trip break.

I'd set CRSType: geoarrow.CRSTypeAuthorityCode when the CRS matches the authority:code shape, so the write side is unambiguous and pairs cleanly with an explicit authority_code read case.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"crs_type": "authority_code" is optional and no consumer actually requires this, but it is polite to include it (I put this in the Arrow C++ export path).

Comment thread table/arrow_utils.go
return geoarrow.NewMetadata() // srid:0 maps to omitted GeoArrow CRS
}

raw, _ := json.Marshal(id)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

errcheck is enabled outside _test.go, so the discarded errors here and at the json.Marshal(crs) below will fail CI. Marshaling a string can't actually fail, but the linter doesn't know that — I'd either build the raw JSON without the error return (e.g. json.RawMessage(strconv.AppendQuote(nil, id))) or add a //nolint with a one-line why.

Comment thread table/arrow_utils.go

func geoArrowCRSToIcebergCRS(meta geoarrow.Metadata) (string, error) {
if len(meta.CRS) == 0 && meta.CRSType == "" {
return "srid:0", nil

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bare ga.wkb() field with no CRS — the PyIceberg default and what older clients emit — reads back as geometry("srid:0") rather than geometry(), which isn't Equals to a default-CRS geometry.

Separately, an empty meta.CRS with a non-empty CRSType skips this early return and falls into the switch, so CRSTypeSRID yields "srid:" with an empty id and GeometryTypeOf accepts it silently. I'd return the OGC:CRS84 default for the bare case and guard the empty-CRS-with-CRSType combination, with a test for bare geoarrow.wkb.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bare ga.wkb() field with no CRS — the PyIceberg default and what older clients emit — reads back as geometry("srid:0") rather than geometry(), which isn't Equals to a default-CRS geometry.

This is the correct behaviour: the GeoArrow default does not equal the Iceberg default. PyIceberg is probably wrong here.

Separately, an empty meta.CRS with a non-empty CRSType skips this early return

This should be fixed...the CRSType can actually just be ignored for the purposes of this function (it's purely a hint)

Comment thread table/arrow_utils.go
}
}

func icebergCRSToGeoArrowMetadata(crs string) geoarrow.Metadata {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small thing while we're here: strings.ToLower(crs) runs twice (here and for the projjson check) and the slice crs[len("srid:"):] indexes the original-case string after a lowercased prefix check. I'd hoist lower := strings.ToLower(crs) once and reuse it. Also worth noting the EPSG:4326 match in the read function is case-sensitive, so epsg:4326 falls through — strings.EqualFold would close that.

Comment thread table/arrow_utils.go
return arrow.Field{Type: geoarrow.NewWKBType(geoarrow.WKBWithBinaryStorage())}
return arrow.Field{Type: geoarrow.NewWKBType(geoarrow.WKBWithBinaryStorage(), geoarrow.WKBWithMetadata(meta))}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "always add an edge to differentiate geography from geometry" convention is Go-only. PyIceberg emits edges only for spherical algorithms and nothing for planar, so a geography(OGC:CRS84, planar) field written by PyIceberg arrives as {"crs":"OGC:CRS84"} with no edges and geoArrowMetadataToIcebergType reads it back as GeometryType — a genuine round-trip break for that combination.

This is inherent to the GeoArrow↔Iceberg mapping: the Arrow extension metadata can't always distinguish the two, so the canonical source of the geo type is the Iceberg schema JSON, not the Arrow metadata. I don't think it blocks the PR, but I'd document it right at this comment — that the edge convention is a best-effort hint and planar geography from other clients won't round-trip through Arrow alone. wdyt?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I'll document it

@paleolimbot paleolimbot left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more details but this is looking good!

Comment thread table/arrow_utils.go
}
}

if crs == "OGC:CRS84" || crs == "EPSG:4326" {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EPSG:4326 and OGC:CRS84 aren't the same CRS

They are for the purposes of the GeoArrow and Parquet specifications, which explicitly define the axis order for these cases

Comment thread table/arrow_utils.go

func geoArrowCRSToIcebergCRS(meta geoarrow.Metadata) (string, error) {
if len(meta.CRS) == 0 && meta.CRSType == "" {
return "srid:0", nil

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bare ga.wkb() field with no CRS — the PyIceberg default and what older clients emit — reads back as geometry("srid:0") rather than geometry(), which isn't Equals to a default-CRS geometry.

This is the correct behaviour: the GeoArrow default does not equal the Iceberg default. PyIceberg is probably wrong here.

Separately, an empty meta.CRS with a non-empty CRSType skips this early return

This should be fixed...the CRSType can actually just be ignored for the purposes of this function (it's purely a hint)

Comment thread table/arrow_utils.go
return "srid:" + crs, nil
case geoarrow.CRSTypePROJJSON:
return "", errors.New("geoarrow CRS type projjson not supported yet")
default:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CRSType, if it's coming from JSON, is just a hint, is not required, and is often absent. Here you probably want to:

  • Check if crs is a JSON object. If it is, check the id member and paste together crs["id"]["authority"], :, and crs["id"]["code"]. If any of those are missing, error for an unsupported CRS>
  • Check if crs is a string. If it's shorter than 32 characters, let it through verbatim. There's no official restriction on allowed characters in authorities or codes but the length check should reject anything questionable.

Comment thread table/arrow_utils.go

raw, _ := json.Marshal(crs)

return geoarrow.Metadata{

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"crs_type": "authority_code" is optional and no consumer actually requires this, but it is polite to include it (I put this in the Arrow C++ export path).

Comment thread table/arrow_utils_test.go
Comment on lines +485 to +495
typeCases := []struct {
name string
ice iceberg.Type
wantMetaJSON string
}{
// Geometry with default CRS (defaults to OGC:CRS84 per Parquet spec)
{
name: "geometry_default_crs",
ice: iceberg.GeometryType{},
wantMetaJSON: `{"crs":"OGC:CRS84"}`,
},

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These test cases look good to me. Thanks!

There should be a separate set of cases for the opposite direction (like in the PR that I linked), where you start with a metadata string and expect a specific iceberg type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(table): WKB encoding + GeoArrow conversion for geometry/geography

5 participants