Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion docs/docs/beta-release-notes.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -15,13 +15,18 @@ This page tracks **API and naming changes** since the GeoBrix project started. A

## What's new in v0.3.0

Released 2026-05-19. Per-version highlights; full migration tables are in the per-component sections below.
Released 2026-05-26. Per-version highlights; full migration tables are in the per-component sections below.

- **`rst_clip` CRS axis-order fix (all-black clips).** GDAL 3+ defaults EPSG-imported `SpatialReference`s to authority-compliant axis order (lat/lon for EPSG:4326), which silently swapped axes against JTS/Databricks WKT/WKB cutlines so the clip missed the raster entirely. The reprojection now clones the source/destination `SpatialReference`s and forces `OAMS_TRADITIONAL_GIS_ORDER` before the OGR transform; caller-owned `SpatialReference`s are not mutated.
- **EWKT / EWKB support for `rst_clip`.** `JTS.fromWKT` / `JTS.fromWKB` auto-detect EWKT/EWKB; new `JTS.toEWKT` / `JTS.toEWKB` helpers emit SRID-preserving forms. `rst_clip` reprojects the cutline when its SRID differs from the raster CRS, and falls back to the raster's CRS (Mosaic-compatible) when the SRID is `0` / unresolvable.
- **`rst_transform` rejects invalid SRIDs.** `targetSrid <= 0` and unresolvable EPSG codes now surface a clear error via tile metadata `error_message` instead of returning a raster with an uninitialized CRS.
- **`/vsimem/` path-handling hardening.** `rst_memsize` / `rst_unlink` / GDAL writer in-memory byte fetch now use `startsWith("/vsimem/")` (not `contains`) and null-check `GetMemFileBuffer`, so datasets whose description embeds the substring (e.g. NetCDF subdataset selectors) aren't mis-routed through the in-memory branch.
- **`tile.raster` bytes are always self-contained (no VRT payloads).** Three RasterX operations — `MergeRasters` (`gbx_rst_merge`, `gbx_rst_merge_agg`), `MergeBands` (`gbx_rst_frombands`), and `PixelCombineRasters` (`gbx_rst_derivedband`, `gbx_rst_derivedband_agg`, `gbx_rst_combineavg`, `gbx_rst_combineavg_agg`) — used to return tiles whose `metadata("driver")` claimed `VRT` even though the on-disk file was a materialized GTiff. That mis-tag propagated through `RasterDriver.writeToBytes` (which keys both the tempfile extension AND the `-of` flag in the inner `gdal_translate` call off `metadata.driver`), causing the serialized `tile.raster` payload to be VRT XML referencing a `/vsimem/` tempfile only reachable on the producing executor. Single-node testing passed by accident; multi-executor clusters hit `file not found` when the VRT was opened elsewhere. Fix: `GDALTranslate.executeTranslate` now records the **output** dataset's driver in its returned metadata (not the input's), and `RasterDriver.writeToBytes` defensively coerces VRT to GTiff on serialization + sniffs the result to refuse shipping VRT bytes. Regression coverage in [`RST_NoVrtPayloadTest`](https://github.com/databrickslabs/geobrix/blob/main/src/test/scala/com/databricks/labs/gbx/rasterx/expressions/RST_NoVrtPayloadTest.scala).
- **`PixelCombineRasters` pixel function now actually fires (`combineavg` / `derivedband` were silently returning one of the inputs).** `gbx_rst_combineavg`, `gbx_rst_combineavg_agg`, `gbx_rst_derivedband`, and `gbx_rst_derivedband_agg` build a multi-source VRT, inject a `<PixelFunctionLanguage>Python</...>` band, and re-open it for `gdal_translate`. The previous implementation re-opened the VRT **before** mutating the XML file, so the in-memory `Dataset` handle never saw the pixel function; `gdal.Translate` then fell back to a default multi-source mosaic (last-source-wins per pixel). On co-extensive inputs (e.g. a monthly EO time-series), the output silently equaled one of the inputs — non-deterministic per partition in a distributed setting, producing visible tile-of-different-years patchwork on multi-executor clusters. Fix: `PixelCombineRasters.combine` now injects the pixel function **before** the VRT is re-opened, and pre-creates the per-JVM `NodeFilePathUtil.rootPath` staging dir itself (previously only `ClipToGeom` did, so `combineavg` would `file not found` if it was the first op to hit a fresh JVM). Regression coverage: `RST_AggregationsTest` "CombineAvg actually averages pixel values" (two constant rasters 50 + 100 → output 75).
- **Friendly error on `ARRAY<tile>`-function misuse.** Calling `gbx_rst_combineavg`, `gbx_rst_merge`, `gbx_rst_frombands`, or `gbx_rst_mapalgebra` on a single tile column (instead of an `ARRAY<tile>` like `collect_list(tile)`) used to surface as a raw `ClassCastException: StructType cannot be cast to ArrayType` from inside Catalyst analysis — untraceable from a notebook. The four expressions now route through `RST_ExpressionUtil.arrayOfTileRasterType`, which raises a clean `IllegalArgumentException` naming the function, the actual type received, and (where applicable) the aggregator companion the user likely wanted, e.g. `gbx_rst_combineavg expects ARRAY<tile> (e.g. collect_list(tile) or array(t1, t2, ...)), but received STRUCT<...>. To aggregate the column across rows, use gbx_rst_combineavg_agg(tile).`
- **Docs: `GDAL_VRT_ENABLE_PYTHON` for custom GDAL code paths.** Built-in `combineavg` / `derivedband` calls auto-enable VRT Python via the in-process `GDALManager.withVrtPython` bracket — no cluster config needed. The new [RasterX § VRT Python pixel functions](./packages/rasterx#vrt-python-pixel-functions) section documents how to enable the same evaluation in your own GDAL calls (Python `gdal.SetConfigOption`, cluster `spark.executorEnv`, or the JVM `withVrtPython` helper) and points to the `TRUSTED_MODULES` variant for less-trusted VRT sources. A cross-reference is added in [Security § 6](./security#6-vrt-python-pixel-functions-off-by-default-by-design) explaining why GeoBrix ships the option `NO` by default.
- **`gbx_rst_derivedband` / `gbx_rst_derivedband_agg` numerical-correctness regression coverage.** These functions share the `PixelCombineRasters` code path with `combineavg`, so they were silently no-opping in the same way (returning one of the inputs unchanged on co-extensive stacks). The ordering fix above repairs both call sites, but the existing tests only checked that the result wasn't null — they would have passed either way. This release adds explicit pixel-value assertions: `RST_AggregationsTest` covers the in-process `RST_DerivedBand` path with a doubling pyfunc and a 3-input numpy-mean pyfunc, and `RST_AggEvalTest` covers the Spark-aggregation `rst_derivedband_agg` path end-to-end (three constant-Byte tiles 10/20/30 with a "mean × 2" pyfunc must yield 40 across the result tile). Two previously-passing tests used `def myfunc(x): return x * 2` — an invalid VRT pixel-function signature — and were updated to the canonical `(in_ar, out_ar, xoff, yoff, xsize, ysize, raster_xsize, raster_ysize, buf_radius, gt, **kwargs)` shape; they only "passed" before because the pyfunc never actually ran.
- **`gbx_rst_combineavg` / `gbx_rst_combineavg_agg` math corrected (NoData, valid zeros, rounding).** With the pixel function now firing (previous bullet), several latent bugs in the average kernel surface and are fixed in this release. The pyfunc used to sum every source value blindly — including each band's NoData sentinel (e.g. 255 on Byte EO products) — and counted only strictly-positive cells in the divisor (`np.sum(stacked > 0, axis=0)`), which (a) inflated the numerator with NoData and (b) wrongly excluded valid `0` measurements from the divisor. It also used `np.divide(..., casting='unsafe')`, which **truncates** rather than rounds when casting back to an integer output dtype (Byte / UInt16), producing systematic underbias on integer EO stacks. Now the kernel reads each source band's declared NoData (via `BandAccessors.getNoDataValue`, baked into the pyfunc source as a literal list at VRT-write time), masks NoData cells out of both sum and divisor, includes valid `0`s, uses float64 internally, and rounds-to-nearest-even before the unsafe cast when the output dtype is integer. The bogus `np.clip(out_ar, stacked.min(), stacked.max(), ...)` (the bounds were contaminated by NoData sentinels) is removed. When at least one input declares NoData, that value is also stamped on the output band so downstream `GetNoDataValue` reports all-NoData pixels. Regression coverage in `RST_AggregationsTest`: "excludes declared NoData from both sum and divisor", "counts valid 0 cells in the divisor", "rounds (not truncates) when casting to integer output".
- **Scalar args without `f.lit(...)`.** Python wrappers auto-wrap `bool` / `int` / `float` / `bytes`; Scala adds typed overloads. SQL was already natively-typed. String literals still wrap in `f.lit(...)` per pyspark's column-ref convention. Details and migration examples in [Scalar values vs `lit(...)` wrapping](#scalar-values-vs-lit-wrapping).
- **Example notebooks — EO Series, xView, and enablement diagrams.** New end-to-end walkthroughs under `docs/examples/` covering EO time-series, xView object-detection rasters, and RasterX architecture diagrams.
- **Supply-chain hardening (lockdown).** Jobs pinned to the Databricks-hardened runner group (org-level allowlist, ephemeral VMs, constrained secret access); every Maven dependency, transitive dep, plugin, and plugin dependency is PGP-verified against `.maven-keys.list` before any compile or test execution; pip and Maven routed through JFrog with OIDC; init script + pinned package versions vetted; new [Security](./security.mdx) page in the docs.
Expand Down
40 changes: 40 additions & 0 deletions docs/docs/packages/rasterx.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,46 @@ Every RasterX function returns a tile whose `raster` field is a **self-contained

Functions that internally build via an intermediate VRT — `gbx_rst_merge`, `gbx_rst_merge_agg`, `gbx_rst_frombands`, `gbx_rst_combineavg`, `gbx_rst_combineavg_agg`, `gbx_rst_derivedband`, `gbx_rst_derivedband_agg` — materialize the result to GTiff before returning, so downstream stages on different executors see real raster bytes. Inspect a tile's payload format from `tile.metadata.driver`; for any of the functions above, it will read `GTiff` (not `VRT`). See [Beta Release Notes](../beta-release-notes#whats-new-in-v030) for the v0.3.0 correctness fix that introduced this invariant.

## VRT Python pixel functions

`gbx_rst_combineavg`, `gbx_rst_combineavg_agg`, `gbx_rst_derivedband`, and `gbx_rst_derivedband_agg` evaluate a Python expression on each pixel via GDAL's [VRT Python pixel-function API](https://gdal.org/en/stable/drivers/raster/vrt.html#using-derived-bands-with-pixel-functions-in-python). That API is gated behind the GDAL config option `GDAL_VRT_ENABLE_PYTHON`, which **GeoBrix sets to `NO` at executor startup** (see [Security § Restrict GDAL drivers](../security#6-vrt-python-pixel-functions-off-by-default-by-design)). When you call one of the four functions above, GeoBrix flips the option to `YES` for the duration of that call only — via the internal `GDALManager.withVrtPython` bracket — and restores `NO` immediately on return. You don't need to set anything on the cluster or in your notebook to use the built-in functions.

### When you need to enable it yourself

If you're invoking the GDAL Python bindings (`from osgeo import gdal`) **directly** — outside the built-in RasterX functions — and you read a VRT that declares a `<PixelFunctionLanguage>Python</...>` band, you'll get an empty/null read unless you enable the option in the same process. Pick one of:

**Python — programmatic, scoped to your read.** Recommended in all cases. Mirrors what GeoBrix does internally, works for both driver-side `pyspark.sql` calls and inside `mapPartitions` / `mapInPandas` UDFs that load VRT-with-pyfunc via `osgeo.gdal`, and survives interleaving with GeoBrix built-in calls (each GeoBrix call resets the option to `NO` on exit, so re-set it on every read):

```python
from osgeo import gdal

gdal.SetConfigOption("GDAL_VRT_ENABLE_PYTHON", "YES")
try:
ds = gdal.Open("/path/to/your/vrt-with-pixel-function.vrt")
arr = ds.GetRasterBand(1).ReadAsArray()
ds = None
finally:
gdal.SetConfigOption("GDAL_VRT_ENABLE_PYTHON", "NO")
```

**Cluster env var — for Python-worker processes only.** Setting `spark.executorEnv.GDAL_VRT_ENABLE_PYTHON YES` on the cluster works for Python UDF workers (a separate process from the JVM, where GDAL initializes from env vars). It does **not** help JVM-side reads — GeoBrix calls `gdal.SetConfigOption("GDAL_VRT_ENABLE_PYTHON", "NO")` at executor JVM startup, and `SetConfigOption` takes precedence over the env var. Prefer the programmatic form above unless you have a strong reason to globally enable.

**Scala / JVM code.** If you're writing custom Spark expressions that consume Python-pixel VRTs, wrap the read/translate in the same helper GeoBrix uses internally — it refcounts the option so concurrent tasks on the same executor JVM compose safely:

```scala
import com.databricks.labs.gbx.rasterx.gdal.GDALManager

val result = GDALManager.withVrtPython {
val ds = org.gdal.gdal.gdal.Open(vrtPath)
// ... GDAL reads / translates here see the Python pixel function ...
ds
}
```

### Trusted-modules variant

GDAL also accepts `GDAL_VRT_ENABLE_PYTHON=TRUSTED_MODULES` plus a `GDAL_VRT_PYTHON_TRUSTED_MODULES` allowlist if you want pixel-function code restricted to specific Python module prefixes. GeoBrix uses the plain `YES` form because the pixel-function source is constructed in-process from trusted (geobrix-generated) strings, never from user-supplied VRT XML on disk. If your custom code path reads VRTs whose `<PixelFunctionCode>` originates from less-trusted sources, switch to the `TRUSTED_MODULES` form and allowlist only what you intend to load.

## Usage Examples

### Python/PySpark
Expand Down
18 changes: 18 additions & 0 deletions docs/docs/security.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -206,6 +206,24 @@ publishing details. See
[SECURITY.md](https://github.com/databrickslabs/geobrix/blob/main/SECURITY.md)
for what to include in the report.

### 6. VRT Python pixel functions: off by default by design

GDAL's [VRT Python pixel function API](https://gdal.org/en/stable/drivers/raster/vrt.html#using-derived-bands-with-pixel-functions-in-python)
lets a `<PixelFunctionCode>` element in a VRT XML file execute arbitrary
Python in-process at band-read time. GeoBrix sets `GDAL_VRT_ENABLE_PYTHON=NO`
at executor startup and only flips it to `YES` for the duration of an
individual `combineavg` / `derivedband` call (via the internal
`GDALManager.withVrtPython` bracket). The four built-in functions inject
pyfunc source generated by GeoBrix itself, never by user input.

If your own code consumes Python-pixel VRTs from less-trusted sources
(e.g. you pull VRT XML from object storage that other principals can
write to), either keep the option `NO` and pre-translate to GTiff, or
switch to `GDAL_VRT_ENABLE_PYTHON=TRUSTED_MODULES` with a narrow
`GDAL_VRT_PYTHON_TRUSTED_MODULES` allowlist. See
[RasterX § VRT Python pixel functions](./packages/rasterx#vrt-python-pixel-functions)
for the full how-to.

## Next steps

- [Installation Guide](./installation) — apply the init script as part of
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,9 @@ case class RST_CombineAvg(
) extends InvokedExpression {

/** Raster DataType from the tile array element struct. */
private def rasterType = tileExpr.dataType.asInstanceOf[ArrayType].elementType.asInstanceOf[StructType].fields(1).dataType
private def rasterType = RST_ExpressionUtil.arrayOfTileRasterType(
RST_CombineAvg.name, tileExpr, aggHint = Some("gbx_rst_combineavg_agg")
)
override def children: Seq[Expression] = Seq(tileExpr, ExpressionConfigExpr())
override def dataType: DataType = RST_ExpressionUtil.tileDataType(rasterType)
override def nullable: Boolean = true
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,9 @@ case class RST_MapAlgebra(
jsonSpecExpr: Expression
) extends InvokedExpression {

private def rasterType = tileExpr.dataType.asInstanceOf[ArrayType].elementType.asInstanceOf[StructType].fields(1).dataType
private def rasterType = RST_ExpressionUtil.arrayOfTileRasterType(
RST_MapAlgebra.name, tileExpr, aggHint = None
)
override def children: Seq[Expression] = Seq(tileExpr, jsonSpecExpr, ExpressionConfigExpr())
override def dataType: DataType = RST_ExpressionUtil.tileDataType(rasterType)
override def nullable: Boolean = true
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,9 @@ case class RST_Merge(
) extends InvokedExpression {

/** Raster DataType from the tile array element struct. */
private def rasterType = tileExpr.dataType.asInstanceOf[ArrayType].elementType.asInstanceOf[StructType].fields(1).dataType
private def rasterType = RST_ExpressionUtil.arrayOfTileRasterType(
RST_Merge.name, tileExpr, aggHint = Some("gbx_rst_merge_agg")
)
override def children: Seq[Expression] = Seq(tileExpr, ExpressionConfigExpr())
override def dataType: DataType = RST_ExpressionUtil.tileDataType(rasterType)
override def nullable: Boolean = true
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,9 @@ case class RST_FromBands(
) extends InvokedExpression {

/** Raster DataType from the bands array element struct. */
private def rasterType = bandsExpr.dataType.asInstanceOf[ArrayType].elementType.asInstanceOf[StructType].fields(1).dataType
private def rasterType = RST_ExpressionUtil.arrayOfTileRasterType(
RST_FromBands.name, bandsExpr, aggHint = None
)
override def children: Seq[Expression] = Seq(bandsExpr, ExpressionConfigExpr())
override def dataType: DataType = RST_ExpressionUtil.tileDataType(rasterType)
override def nullable: Boolean = true
Expand Down
Loading
Loading