feat(table): WKB encoding + GeoArrow conversion for geometry/geography#1138
feat(table): WKB encoding + GeoArrow conversion for geometry/geography#1138happydave1 wants to merge 5 commits into
Conversation
zeroshade
left a comment
There was a problem hiding this comment.
In general this looks good to me, though I'm not AS familiar with the geo stuff here.
@dwilson1988 or @paleolimbot would either of you be able to take a quick look here for verification?
paleolimbot
left a comment
There was a problem hiding this comment.
Cool!
The test cases from apache/arrow-rs#10065 are my most recent attempt at a succinct list of the special cases to take care of, although I've tried to note them inline here, too. The difference between Parquet and Iceberg for CRSes is that Iceberg requires authority:code (or PROJJSON that is offloaded into a table property). You can error for that case (I left inline suggestions about how to convert from PROJJSON to authority:code).
| func icebergCRSToGeoArrowMetadata(crs string) geoarrow.Metadata { | ||
| if crs == "OGC:CRS84" { | ||
| return geoarrow.NewMetadata() | ||
| } |
There was a problem hiding this comment.
This is reversed I think...if the iceberg CRS is "", the GeoArrow CRS is "OGC:CRS84" (unless NewMetadata().
| if strings.HasPrefix(strings.ToLower(crs), "srid:") { | ||
| id := crs[len("srid:"):] | ||
| raw, _ := json.Marshal(id) | ||
|
|
||
| return geoarrow.Metadata{ | ||
| CRS: raw, | ||
| CRSType: geoarrow.CRSTypeSRID, | ||
| } | ||
| } |
There was a problem hiding this comment.
There is one special case here: "srid:0" maps to an omitted GeoArrow CRS.
| raw, _ := json.Marshal(crs) | ||
|
|
||
| return geoarrow.Metadata{ | ||
| CRS: raw, | ||
| CRSType: geoarrow.CRSTypeAuthorityCode, | ||
| } |
There was a problem hiding this comment.
I believe there is one special case here: if the Iceberg CRS is projjson:<table property>, the CRS is the JSON object in that table property field. You should probably either support that or error with a note saying that it's not supported.
| if len(meta.CRS) == 0 && meta.CRSType == "" { | ||
| return "OGC:CRS84", nil | ||
| } |
There was a problem hiding this comment.
If the GeoArrow CRS is omitted from the extension metadata, the Parquet equivalent is "srid:0". If the GeoArrow CRS is "OGC:CRS84" or "EPSG:4326", the canonical Parquet CRS is "omitted" (i.e., default). I believe the Parquet and Iceberg CRS definitions are in sync now but it's worth double checking.
| switch meta.CRSType { | ||
| case geoarrow.CRSTypeSRID: | ||
| return "srid:" + crs, nil | ||
| case geoarrow.CRSTypeAuthorityCode: | ||
| return crs, nil | ||
| default: | ||
| return "", fmt.Errorf("unsupported geoarrow CRS type %q", meta.CRSType) | ||
| } |
There was a problem hiding this comment.
If the CRSType is PROJJSON, this is a important case to handle (because most GeoArrow producers produce that). You can convert most of these to authority:code by looking for the id member that looks like this "crs": {..., "id":{"authority": "OGC", "code": "CRS84"} or this "crs": {..., "id":{"authority": "EPSG", "code": 3857}. It would also be a good idea to convert the two "lonlat" CRSes ("EPSG:4326" and "OGC:CRS84") to the Iceberg "default" canonically.
When you can't extract an authority:code, this would need to be written to a table property and written to the CRS field as (projjson:<the table property name>). Probably easier to error for that case for now.
Signed-off-by: happydave1 <dzhao2004@gmail.com>
Signed-off-by: happydave1 <dzhao2004@gmail.com>
Signed-off-by: happydave1 <dzhao2004@gmail.com>
Happy to re-review if needed, but looks like @paleolimbot is already on it! Ping me if you want a second set of eyes. |
Signed-off-by: happydave1 <dzhao2004@gmail.com>
Signed-off-by: happydave1 <dzhao2004@gmail.com>
|
Thank you @paleolimbot and @zeroshade for reviewing. I believe I have addressed each comment by:
I would appreciate a second pass whenever you guys have a chance, thanks! |
laskoviymishka
left a comment
There was a problem hiding this comment.
Nice work threading the WKB / GeoArrow metadata through the schema conversion. The SRID encoding, the planar vs spherical edge handling, and the round-trip tests are all looking solid.
There are two things I’d want fixed before merge.
First, the EPSG:4326 handling looks asymmetric. geoArrowCRSToIcebergCRS collapses EPSG:4326 to OGC:CRS84 on read, but icebergCRSToGeoArrowMetadata writes EPSG:4326 back out as-is. Those two are not interchangeable because the axis order is different. That means a geometry("EPSG:4326") field — which is what PyIceberg emits — silently comes back as geometry("OGC:CRS84"). So the round trip is not actually equal. The existing CRS test uses EPSG:4267, so it misses this branch. I think we should either make the conversion symmetric or stop collapsing it, and add a regression test specifically for EPSG:4326.
Second, the projjson behavior is also asymmetric. The read path returns a clean error, but the write path panics. That panic propagates through VisitGeometry / VisitGeography with no recovery, so converting a table with a projjson CRS can crash the reader. This should return an error instead.
I left a few smaller comments inline as well: the authority_code / wkt2:2019 pass-through, discarded json.Marshal errors, and a couple of nlreturn gaps that will likely fail CI lint.
Once the two main issues above are fixed, I’m happy to take another pass and approve.
| } | ||
| } | ||
|
|
||
| if crs == "OGC:CRS84" || crs == "EPSG:4326" { |
There was a problem hiding this comment.
EPSG:4326 and OGC:CRS84 aren't the same CRS — they have opposite axis order (CRS84 is lon/lat, EPSG:4326 is lat/lon). Collapsing them here means a geometry("EPSG:4326") field — which is exactly what PyIceberg emits via ga.wkb().with_crs("EPSG:4326") — reads back as geometry("OGC:CRS84"), so the schema silently changes on the read side.
The write path doesn't share the collapse either: icebergCRSToGeoArrowMetadata emits EPSG:4326 verbatim, so GeometryType{crs:"EPSG:4326"} → Arrow → GeometryType{crs:"OGC:CRS84"} and the two aren't Equals.
I'd drop the EPSG:4326 case here and treat it as a distinct authority code. If we genuinely want them unified, it has to be symmetric — normalize on both the write and read sides so the round trip is stable — and it shouldn't live silently at the schema layer. Either way a round-trip regression test for EPSG:4326 specifically would lock it down. wdyt?
There was a problem hiding this comment.
EPSG:4326 and OGC:CRS84 aren't the same CRS
They are for the purposes of the GeoArrow and Parquet specifications, which explicitly define the axis order for these cases
There was a problem hiding this comment.
I believe that a round trip regression test is definitely in order for this behavior.
@paleolimbot, is the asymmetric behavior here expected or should we collapse "EPSG:4326" at the write level too? (i.e. add a conditional in icebergCRSToGeoArrowMetadata)
There was a problem hiding this comment.
The ideal behaviour is that all of these become the Iceberg default CRS:
"crs": "EPSG:4326"and"crs": "OGC:CRS84"(case insensitive)"crs": {..., "id":{"authority": "OGC", "code": "CRS84"}"crs": {..., "id":{"authority": "EPSG", "code": 4326}-
"crs": {..., "id":{"authority": "EPSG", "code": "4326"}
When reading an iceberg default CRS to GeoArrow, emit "crs": "OGC:CRS84". I think your PR is currently missing this case.
This asymmetry is helpful to improve the compatibility of iceberg tables (e.g., for readers that can't or don't want to understand CRSes and only handle the default case).
| } | ||
| } | ||
|
|
||
| if strings.HasPrefix(strings.ToLower(crs), "projjson:") { |
There was a problem hiding this comment.
This panics on projjson, but the symmetric read function geoArrowCRSToIcebergCRS returns a clean error for the same case. The asymmetry is the problem: VisitGeometry/VisitGeography have no error return and nothing recovers, so a GeometryType with a projjson: CRS detonates the whole conversion at TypeToArrowType rather than surfacing an error.
I'd return an error here and thread it through the same way the read path does, so a projjson CRS fails gracefully instead of crashing the process.
|
|
||
| iceType, err := geoArrowMetadataToIcebergType(wkb.Metadata()) | ||
| if err != nil { | ||
| panic(fmt.Errorf("%w: %v", iceberg.ErrInvalidSchema, err)) |
There was a problem hiding this comment.
The %v on err drops it out of the wrapped chain, so errors.Is/errors.As against the underlying error won't work downstream. I'd use %w for both:
panic(fmt.Errorf("%w: converting geoarrow metadata: %w", iceberg.ErrInvalidSchema, err))| return "srid:" + crs, nil | ||
| case geoarrow.CRSTypePROJJSON: | ||
| return "", errors.New("geoarrow CRS type projjson not supported yet") | ||
| default: |
There was a problem hiding this comment.
CRSTypeAuthorityCode and CRSTypeWKT22019 both fall into default and get returned as-is. For authority_code that's almost right but loses the type tag; for WKT2:2019 it hands a multi-KB WKT2 blob straight back as the Iceberg CRS string, which isn't a valid CRS identifier.
I'd add an explicit CRSTypeAuthorityCode case, and for CRSTypeWKT22019 either error like projjson does or reduce it to the authority code — whichever we pick, a test for each so the behavior is pinned.
There was a problem hiding this comment.
The CRSType, if it's coming from JSON, is just a hint, is not required, and is often absent. Here you probably want to:
- Check if crs is a JSON object. If it is, check the
idmember and paste togethercrs["id"]["authority"],:, andcrs["id"]["code"]. If any of those are missing, error for an unsupported CRS> - Check if crs is a string. If it's shorter than 32 characters, let it through verbatim. There's no official restriction on allowed characters in authorities or codes but the length check should reject anything questionable.
|
|
||
| raw, _ := json.Marshal(crs) | ||
|
|
||
| return geoarrow.Metadata{ |
There was a problem hiding this comment.
For an authority-code CRS like EPSG:4326 this emits {"crs":"EPSG:4326"} with no crs_type, but the GeoArrow spec wants crs_type: "authority_code" for AUTHORITY:CODE strings. Without it the encoding is ambiguous, and combined with the collapse in geoArrowCRSToIcebergCRS it's the other half of the EPSG:4326 round-trip break.
I'd set CRSType: geoarrow.CRSTypeAuthorityCode when the CRS matches the authority:code shape, so the write side is unambiguous and pairs cleanly with an explicit authority_code read case.
There was a problem hiding this comment.
"crs_type": "authority_code" is optional and no consumer actually requires this, but it is polite to include it (I put this in the Arrow C++ export path).
| return geoarrow.NewMetadata() // srid:0 maps to omitted GeoArrow CRS | ||
| } | ||
|
|
||
| raw, _ := json.Marshal(id) |
There was a problem hiding this comment.
errcheck is enabled outside _test.go, so the discarded errors here and at the json.Marshal(crs) below will fail CI. Marshaling a string can't actually fail, but the linter doesn't know that — I'd either build the raw JSON without the error return (e.g. json.RawMessage(strconv.AppendQuote(nil, id))) or add a //nolint with a one-line why.
|
|
||
| func geoArrowCRSToIcebergCRS(meta geoarrow.Metadata) (string, error) { | ||
| if len(meta.CRS) == 0 && meta.CRSType == "" { | ||
| return "srid:0", nil |
There was a problem hiding this comment.
A bare ga.wkb() field with no CRS — the PyIceberg default and what older clients emit — reads back as geometry("srid:0") rather than geometry(), which isn't Equals to a default-CRS geometry.
Separately, an empty meta.CRS with a non-empty CRSType skips this early return and falls into the switch, so CRSTypeSRID yields "srid:" with an empty id and GeometryTypeOf accepts it silently. I'd return the OGC:CRS84 default for the bare case and guard the empty-CRS-with-CRSType combination, with a test for bare geoarrow.wkb.
There was a problem hiding this comment.
bare ga.wkb() field with no CRS — the PyIceberg default and what older clients emit — reads back as geometry("srid:0") rather than geometry(), which isn't Equals to a default-CRS geometry.
This is the correct behaviour: the GeoArrow default does not equal the Iceberg default. PyIceberg is probably wrong here.
Separately, an empty meta.CRS with a non-empty CRSType skips this early return
This should be fixed...the CRSType can actually just be ignored for the purposes of this function (it's purely a hint)
| } | ||
| } | ||
|
|
||
| func icebergCRSToGeoArrowMetadata(crs string) geoarrow.Metadata { |
There was a problem hiding this comment.
Small thing while we're here: strings.ToLower(crs) runs twice (here and for the projjson check) and the slice crs[len("srid:"):] indexes the original-case string after a lowercased prefix check. I'd hoist lower := strings.ToLower(crs) once and reuse it. Also worth noting the EPSG:4326 match in the read function is case-sensitive, so epsg:4326 falls through — strings.EqualFold would close that.
| return arrow.Field{Type: geoarrow.NewWKBType(geoarrow.WKBWithBinaryStorage())} | ||
| return arrow.Field{Type: geoarrow.NewWKBType(geoarrow.WKBWithBinaryStorage(), geoarrow.WKBWithMetadata(meta))} | ||
| } | ||
|
|
There was a problem hiding this comment.
The "always add an edge to differentiate geography from geometry" convention is Go-only. PyIceberg emits edges only for spherical algorithms and nothing for planar, so a geography(OGC:CRS84, planar) field written by PyIceberg arrives as {"crs":"OGC:CRS84"} with no edges and geoArrowMetadataToIcebergType reads it back as GeometryType — a genuine round-trip break for that combination.
This is inherent to the GeoArrow↔Iceberg mapping: the Arrow extension metadata can't always distinguish the two, so the canonical source of the geo type is the Iceberg schema JSON, not the Arrow metadata. I don't think it blocks the PR, but I'd document it right at this comment — that the edge convention is a best-effort hint and planar geography from other clients won't round-trip through Arrow alone. wdyt?
There was a problem hiding this comment.
Good catch, I'll document it
paleolimbot
left a comment
There was a problem hiding this comment.
A few more details but this is looking good!
| } | ||
| } | ||
|
|
||
| if crs == "OGC:CRS84" || crs == "EPSG:4326" { |
There was a problem hiding this comment.
EPSG:4326 and OGC:CRS84 aren't the same CRS
They are for the purposes of the GeoArrow and Parquet specifications, which explicitly define the axis order for these cases
|
|
||
| func geoArrowCRSToIcebergCRS(meta geoarrow.Metadata) (string, error) { | ||
| if len(meta.CRS) == 0 && meta.CRSType == "" { | ||
| return "srid:0", nil |
There was a problem hiding this comment.
bare ga.wkb() field with no CRS — the PyIceberg default and what older clients emit — reads back as geometry("srid:0") rather than geometry(), which isn't Equals to a default-CRS geometry.
This is the correct behaviour: the GeoArrow default does not equal the Iceberg default. PyIceberg is probably wrong here.
Separately, an empty meta.CRS with a non-empty CRSType skips this early return
This should be fixed...the CRSType can actually just be ignored for the purposes of this function (it's purely a hint)
| return "srid:" + crs, nil | ||
| case geoarrow.CRSTypePROJJSON: | ||
| return "", errors.New("geoarrow CRS type projjson not supported yet") | ||
| default: |
There was a problem hiding this comment.
The CRSType, if it's coming from JSON, is just a hint, is not required, and is often absent. Here you probably want to:
- Check if crs is a JSON object. If it is, check the
idmember and paste togethercrs["id"]["authority"],:, andcrs["id"]["code"]. If any of those are missing, error for an unsupported CRS> - Check if crs is a string. If it's shorter than 32 characters, let it through verbatim. There's no official restriction on allowed characters in authorities or codes but the length check should reject anything questionable.
|
|
||
| raw, _ := json.Marshal(crs) | ||
|
|
||
| return geoarrow.Metadata{ |
There was a problem hiding this comment.
"crs_type": "authority_code" is optional and no consumer actually requires this, but it is polite to include it (I put this in the Arrow C++ export path).
| typeCases := []struct { | ||
| name string | ||
| ice iceberg.Type | ||
| wantMetaJSON string | ||
| }{ | ||
| // Geometry with default CRS (defaults to OGC:CRS84 per Parquet spec) | ||
| { | ||
| name: "geometry_default_crs", | ||
| ice: iceberg.GeometryType{}, | ||
| wantMetaJSON: `{"crs":"OGC:CRS84"}`, | ||
| }, |
There was a problem hiding this comment.
These test cases look good to me. Thanks!
There should be a separate set of cases for the opposite direction (like in the PR that I linked), where you start with a metadata string and expect a specific iceberg type.
Fixes #991.
Added metadata support in arrow_utils.go, created a
WKTToWKBhelper intable/internal/geo_codec.gowhich uses go-geom to convert WKT to WKB, and added a round trip Arrow to Parquet test which tests if geoarrow extension metadata survives. Also tests Iceberg schema to Arrow schema round trips.