Skip to content

fix(rasterx): combineavg / derivedband correctness — pixel function actually fires + math repaired#32

Merged
mjohns-databricks merged 6 commits into
mainfrom
beta/0.3.0
May 26, 2026
Merged

fix(rasterx): combineavg / derivedband correctness — pixel function actually fires + math repaired#32
mjohns-databricks merged 6 commits into
mainfrom
beta/0.3.0

Conversation

@mjohns-databricks
Copy link
Copy Markdown
Collaborator

Summary

gbx_rst_combineavg, gbx_rst_combineavg_agg, gbx_rst_derivedband, and gbx_rst_derivedband_agg silently returned one of the input tiles instead of computing the per-pixel combination, with no error surfaced. On a co-extensive temporal mosaic (e.g. monthly EO time-series), the result appeared as a non-deterministic patchwork of different years on multi-executor clusters. Surfaced by a user reproducer over the AL-233 Australian grass TSDM stack (3× 12 monthly tiles, 2022/2023/2024); the team's bug report attributed the symptom to GDAL_VRT_ENABLE_PYTHON not being set, but the real cause turned out to be a stale-Dataset ordering bug in PixelCombineRasters.combine layered with four math bugs in the CombineAVG Python pixel function. Verified locally on the user's data:

Probe 1 (synthetic 50 + 100, expected 75) Probe 2 (real TSDM data vs numpy reference)
main (pre-#29) uniform 100 ⛔ (last input wins) 100% exact match with input 202401
main post-#29 uniform 100 ⛔ (#29 fixed VRT bytes, not the pyfunc) same as above
this PR commit 1 uniform 75 ✓ 100% within-1 of numpy naive mean; max|diff|=0.7 (truncation)
this PR commit 2 uniform 75 ✓ 100% within-1 of numpy mean; max|diff|=0.3 (rounding artifacts only)

What's in this PR

Four commits, each with its own test + release-note bullet. Stacked on top of main after rebasing the original four off the previous beta/0.3.0 tip (the two pre-squash duplicates of #30 / #31 got dropped during rebase as already-applied).

1. fix(rasterx): combineavg/derivedband pixel function actually fires

Two root causes in PixelCombineRasters.combine:

  • Ordering bug. The implementation built a multi-source VRT, re-opened it from disk into a Dataset handle, then mutated the on-disk XML to inject <PixelFunctionLanguage>Python</...>. gdal.Open parses VRT XML at Open time, so the in-memory Dataset never saw the pyfunc; gdal.Translate then ran a default multi-source mosaic (last-source-wins per pixel). On co-extensive inputs that produced output identical to one of the inputs. Fix: inject the pyfunc first, then RasterDriver.read, then executeTranslate. The withVrtPython bracket still wraps read + translate so the embedded Python interpreter is enabled when bands are evaluated.
  • Missing staging-dir creation. PixelCombineRasters wrote its VRT to NodeFilePathUtil.rootPath/combine_rasters_vrt_<uuid>.vrt, but only ClipToGeom called Files.createDirectories(rootPath). If combineavg/derivedband was the first op to hit a fresh JVM, gdalbuildvrt silently failed to write the VRT and RasterDriver.read threw No such file or directory — masked by the existing RST_NoVrtPayloadTest with an rst_clip warmup. PixelCombineRasters.combine now creates the dir itself; the warmup is removed from the test.

Side-effect: RST_ExpressionExecuteTest "RST_DerivedBand should compute derived band from raster" had been passing accidentally with a malformed def compute(pixel): return pixel[0] * 2 pyfunc that never executed. With the fix the pyfunc actually runs; GDAL rejects the bad signature; the test now correctly uses the canonical (in_ar, out_ar, xoff, yoff, xsize, ysize, raster_xsize, raster_ysize, buf_radius, gt, **kwargs) signature.

Regression: RST_AggregationsTest "CombineAvg actually averages pixel values" — two constant Byte rasters (50, 100) → output must be 75 (not 50 or 100).

2. fix(rasterx): combineavg pixel function math (NoData, valid 0, rounding)

With the pyfunc now firing, four latent bugs in the CombineAVG Python kernel surfaced and are fixed:

Bug Before After
A np.sum(stacked, axis=0) blindly summed NoData sentinel (e.g. 255) mask stacked != nodata per source; baked-in NODATA list read via BandAccessors.getNoDataValue at VRT-write time
B div = np.sum(stacked > 0, axis=0) excluded valid 0 from divisor divisor = valid_mask.sum(axis=0); valid zeros count
C np.divide(..., casting='unsafe') truncated to integer output dtype float64 arithmetic + np.rint(...) round-to-even before unsafe cast, only when output dtype is integer
D np.clip(out_ar, stacked.min(), stacked.max(), ...) — bounds polluted by NoData removed entirely

When at least one input declares NoData, that value is stamped on the output band via SetNoDataValue so downstream GetNoDataValue can detect all-NoData pixels.

Regression: three new tests in RST_AggregationsTest ("excludes declared NoData from both sum and divisor", "counts valid 0 cells in the divisor", "rounds (not truncates) when casting to integer output").

3. test(rasterx): derivedband / derivedband_agg numerical parity coverage

RST_DerivedBand and RST_DerivedBandAgg share the PixelCombineRasters code path with combineavg, so commit 1 fixes them too. The existing derivedband tests only checked that the result was non-null — they would have passed even when the pyfunc silently no-opped. This commit adds explicit pixel-value assertions on both paths:

  • RST_AggregationsTest: in-process RST_DerivedBand doubling pyfunc (input 50 → output 100) + 3-input numpy-mean pyfunc (50 / 100 / 150 → 100).
  • RST_AggEvalTest: end-to-end Spark aggregation via rst_derivedband_agg on three constant tiles 10 / 20 / 30 with a "mean × 2" pyfunc → 40 everywhere. Same file had two noException-only tests using the malformed def myfunc(x): return x * 2 shape; replaced with a canonical VRT pyfunc signature so the existing tests now exercise real Python execution end-to-end.

4. docs(rasterx): GDAL_VRT_ENABLE_PYTHON for custom GDAL code paths

The user team's original bug report blamed GDAL_VRT_ENABLE_PYTHON not being set on the cluster (it's auto-managed in-process by GDALManager.withVrtPython, so that hypothesis was a red herring for the built-ins). But it IS relevant for users invoking osgeo.gdal directly outside the built-ins. Three new sections cover that:

  • docs/docs/packages/rasterx.mdx § "VRT Python pixel functions" — three enable patterns (Python gdal.SetConfigOption, cluster spark.executorEnv env var with caveats, JVM GDALManager.withVrtPython helper) + the TRUSTED_MODULES variant for less-trusted VRT sources.
  • docs/docs/security.mdx § 6 — why the option is off by default, with cross-reference to the rasterx page.
  • docs/docs/beta-release-notes.mdx — bullet pointing at both.

Test plan

  • gbx:test:scala --suite 'com.databricks.labs.gbx.rasterx.*' — 38 suites, 421 tests, 0 failed (2m 58s locally).
  • gbx:lint:scalastyle — 0 errors.
  • User-notebook reproducer (Python pyspark verification script over the AL-233 TSDM data, multi-input combineavg_agg): output now matches numpy reference within 0.3 across 100% of pixels; previously 100% exact match with one of the inputs.
  • Docs static build — clean, no broken-link warnings (anchors verified for both new sections).
  • CI green on this PR (Maven build + scalatest suite + python tests + lint + LFS / lockfile preflights from ci(package-geobrix-artifacts): fail-fast preflights for LFS + lockfile #30).

Release status

v0.3.0 has already been re-cut from the beta/0.3.0 tip with this branch's content (per the soft-launch policy of replacing 0.3.0 rather than minting 0.3.1):

  • Tag re-pointed from the previous 96b12dd to the new tip.
  • package-geobrix-artifacts.yml run 26463934891 republished all six release assets.
  • Release notes body refreshed with a recut warning callout + four new "Highlights" bullets.

Merging this PR brings main in line with the shipped v0.3.0 artifacts.

Context for the user bug report: gbx_rst_combineavg_agg_issue.ipynb over the AL-233 Australian grass TSDM stack (kept locally in the gitignored input/climate_friendly/ dir). The verification script verify_combineavg.py discriminated between (a) the multi-executor VRT-payload bug fixed in #29 (b) the pixel-function ordering bug fixed here (c) the math bugs fixed here.

This pull request and its description were written by Isaac.

Michael Johns added 4 commits May 26, 2026 13:20
Before this commit, gbx_rst_combineavg, gbx_rst_combineavg_agg,
gbx_rst_derivedband and gbx_rst_derivedband_agg silently returned one
of their input tiles instead of the per-pixel combination, with no
error surfaced. On a co-extensive temporal mosaic the patchwork is
non-deterministic per partition — different Spark tasks pick
different "input years" — so the symptom looks like random
year-of-the-stack artifacts on multi-executor clusters.

Two root causes in PixelCombineRasters.combine:

1. Ordering bug. The implementation built a multi-source VRT via
   gdalbuildvrt, RE-OPENED it from disk into a Dataset handle, and
   THEN mutated the on-disk XML to inject
   <PixelFunctionLanguage>Python</...>. gdal.Open parses VRT XML at
   Open time, so the Dataset never saw the pixel function; the
   subsequent gdal.Translate ran a default multi-source mosaic
   (last-source-wins per pixel). On co-extensive inputs that
   produced output identical to one input. Fix: inject the pixel
   function FIRST, then RasterDriver.read, then executeTranslate.
   The withVrtPython bracket still wraps the read+translate so the
   embedded Python interpreter is enabled when bands are evaluated.

2. Missing staging-dir creation. PixelCombineRasters writes its VRT
   to NodeFilePathUtil.rootPath/combine_rasters_vrt_<uuid>.vrt, but
   only ClipToGeom previously called Files.createDirectories on
   that path. If combineavg or derivedband was the first op to hit
   a fresh JVM (or the first op in a test environment), gdalbuildvrt
   silently failed to write the VRT and RasterDriver.read threw
   "No such file or directory" — masked by the existing
   RST_NoVrtPayloadTest with an rst_clip warmup. Fix:
   PixelCombineRasters.combine now Files.createDirectories(rootPath)
   itself; the rst_clip warmup is removed from the test.

Tests:

- src/test/.../RST_AggregationsTest.scala new
  "CombineAvg actually averages pixel values (regression: pixel
  function must fire)" — two constant Byte rasters (50, 100) →
  output must be 75 (not 50 or 100). Before this fix the test
  output was uniform 100 (last input wins); after, it's uniform 75.

- src/test/.../RST_ExpressionExecuteTest.scala
  "RST_DerivedBand should compute derived band from raster" was
  passing accidentally with a malformed `def compute(pixel)` pyfunc
  that never actually ran. After this fix the function does run,
  GDAL rejects the bad signature, and the test fails. Replaced with
  a valid VRT pixel-function signature (in_ar, out_ar, xoff, ...).

- src/test/.../RST_NoVrtPayloadTest.scala dropped the rst_clip
  warmup it used to need; PixelCombineRasters now creates the
  staging dir itself.

Co-authored-by: Isaac
With the ordering fix in place (commit f38c064), the latent math bugs
in the CombineAVG pixel function are now observable and matter on real
EO mosaics. This commit rewrites the average kernel.

Bugs in the previous kernel:

A. NoData included in the numerator. `np.sum(stacked, axis=0)` summed
   every source value blindly, including each band's NoData sentinel
   (commonly 255 on Byte EO products), inflating the mean wherever any
   input cell was NoData.

B. Valid 0 excluded from the divisor. `div = np.sum(stacked > 0, axis=0)`
   counted only strictly-positive cells, so a real measurement of 0
   (e.g. coastal water in a vegetation product) was wrongly omitted
   from the denominator. With (A), errors compound — sum is inflated
   by NoData while count is deflated by valid zeros.

C. Truncation on integer output dtypes. `np.divide(..., casting='unsafe')`
   floors rather than rounds when the destination is Byte / UInt16,
   producing systematic underbias on integer EO stacks.

D. Spurious `np.clip(out_ar, stacked.min(), stacked.max(), ...)`. The
   bounds were polluted by the NoData sentinel (often the global max
   for Byte products), so the clip's upper bound was a no-op while the
   lower bound silently floored real measurements when NoData was 0.

The new kernel:

- Reads each source band's declared NoData via
  BandAccessors.getNoDataValue, baked into the pyfunc source as a
  literal list (one entry per input, None where no NoData is declared)
  at VRT-write time. The Python code masks those cells out of BOTH the
  sum and the divisor.
- Counts valid zeros in the divisor (mask is `value != nodata`, not
  `value > 0`).
- Computes the mean in float64 and rounds to nearest-even before the
  unsafe cast back to the output dtype — only when that dtype is
  integer; float outputs preserve the full mean.
- Drops the bogus stack-min/max clip entirely.

When at least one input declares NoData, that value (the first declared)
is now also stamped on the output band via SetNoDataValue, so callers
that consult GetNoDataValue can identify all-NoData pixels in the result.

Verified on the AL-233_aussie_grass_tsdm reproducer (Byte percentile
data, no actual 255 cells but the rounding fix matters): geobrix output
mean was drifting ~0.16 from the numpy reference under truncation; now
the max per-pixel diff vs numpy's reference mean is 0.3 (round-to-even
artifacts), down from 0.7 in the previous build.

Tests added in RST_AggregationsTest.scala:
- "CombineAvg excludes declared NoData from both sum and divisor":
  input A = half 100, half 255 with NoData=255; input B = all 50;
  output must be 75 on the 100-cells and 50 on the 255-cells.
- "CombineAvg counts valid 0 cells in the divisor": input A all 0,
  input B all 100, neither declares NoData; mean must be 50 (was 100
  before the fix because A was filtered out by the `>0` mask).
- "CombineAvg rounds (not truncates) when casting to integer output":
  mean(51, 100) = 75.5; expect 76 (banker's rounding to even); the
  previous truncation produced 75.

Co-authored-by: Isaac
The PixelCombineRasters ordering fix (f38c064) repaired both
combineavg and derivedband, but the existing derivedband tests only
checked that the result was non-null — they would have passed even
when the pyfunc silently no-opped and gdal.Translate returned a
multi-source mosaic of the inputs. This commit adds explicit
pixel-value assertions on both the in-process and the Spark-
aggregation derivedband paths.

RST_AggregationsTest (two new tests, in-process .execute() path):
- "DerivedBand actually transforms pixel values (regression: pixel
  function must fire)": single input value=50 with a doubling pyfunc
  must produce 100 across the output. Would have failed before fix
  #1 (output would have been 50 — the input passed through unchanged
  via default mosaic).
- "DerivedBand averages multi-input pixel values": three constant
  Byte inputs 50 / 100 / 150 with a numpy-mean pyfunc must yield 100
  everywhere. Exercises multi-source averaging through the
  user-pyfunc branch.

RST_AggEvalTest (one new test, end-to-end Spark aggregation):
- "rst_derivedband_agg actually transforms pixel values (parity with
  combineavg_agg fix)": three constant tiles 10 / 20 / 30 collected
  via groupBy + rst_derivedband_agg with a "mean × 2" pyfunc must
  yield a single result tile of 40 across all pixels. Confirms the
  aggregator buffer + PixelCombineRasters.combine wiring works for
  derivedband identically to combineavg.

Two previously-passing tests in this file used the malformed
`def myfunc(x): return x * 2` shape — not a valid VRT Python
pixel-function signature. They passed because the surrounding
`noException` assertion only checked that nothing threw, and the
underlying ordering bug ensured the pyfunc never actually ran. The
old pyfunc is replaced with a canonical
`(in_ar, out_ar, xoff, yoff, xsize, ysize, raster_xsize, raster_ysize,
buf_radius, gt, **kwargs)` signature; the existing `noException`
tests now hit real Python execution end-to-end, and the new
parity test pins the numerical contract.

Co-authored-by: Isaac
Built-in combineavg / derivedband calls auto-enable VRT Python via the
in-process GDALManager.withVrtPython bracket — no cluster config is
required to use them. But users running their own osgeo.gdal /
org.gdal.gdal code outside the built-in RasterX functions had no
documentation explaining (a) that GeoBrix defaults the option to NO
at executor startup and (b) how to enable it safely for their own
VRT reads.

docs/docs/packages/rasterx.mdx (new "VRT Python pixel functions"
section):

- States the invariant: built-ins handle the toggle internally; only
  custom GDAL code paths need to enable it.
- Three enable patterns with code examples:
  * Python — programmatic gdal.SetConfigOption("YES") around the read
    (recommended; works for driver-side and Python-worker UDFs, and
    survives interleaving with GeoBrix calls since each call resets
    to NO on exit).
  * Cluster env var (spark.executorEnv.GDAL_VRT_ENABLE_PYTHON YES) —
    documented with the caveat that it only takes effect in
    Python-worker processes, NOT in the executor JVM (where
    GeoBrix's SetConfigOption shadows the env var by precedence).
  * Scala / JVM — GDALManager.withVrtPython refcount helper for
    custom Spark expressions that compose with built-ins.
- TRUSTED_MODULES variant noted for users who read VRTs whose
  PixelFunctionCode originates from less-trusted sources.

docs/docs/security.mdx — new section 6 under "How you can build on
this foundation" briefly explaining why GeoBrix ships the option NO
by default (arbitrary in-process Python execution from VRT XML) and
linking to the rasterx page for the how-to. No new security
guarantee, just surfacing the existing design choice.

docs/docs/beta-release-notes.mdx — one bullet for v0.3.0 pointing
at both new sections.

Co-authored-by: Isaac
The "Released 2026-05-19" line dated the original v0.3.0 release.
After the 2026-05-26 recut (combineavg / derivedband correctness
fixes; tag re-pointed at the new branch tip; all six release assets
re-built), the user-facing release date now matches the recut date.

Co-authored-by: Isaac
Calling gbx_rst_combineavg, gbx_rst_merge, gbx_rst_frombands, or
gbx_rst_mapalgebra on a single tile column (instead of an
ARRAY<tile> like collect_list(tile)) used to surface as a raw
ClassCastException from inside Catalyst's CheckAnalysis:

    java.lang.ClassCastException: class
    org.apache.spark.sql.types.StructType cannot be cast to class
    org.apache.spark.sql.types.ArrayType
        at com.databricks.labs.gbx.rasterx.expressions
                .RST_CombineAvg.rasterType(RST_CombineAvg.scala:21)

— a hostile, untraceable error from a notebook. The root cause is
four unguarded `tileExpr.dataType.asInstanceOf[ArrayType]` calls in
the expression files (one for each ARRAY<tile>-accepting public
function).

New helper RST_ExpressionUtil.arrayOfTileRasterType takes the
function name, the tile-expr, and an optional pointer to the
aggregator companion; it pattern-matches on ArrayType(StructType(_))
and on mismatch raises IllegalArgumentException with a message of
the form:

    gbx_rst_combineavg expects ARRAY<tile> (e.g. collect_list(tile)
    or array(t1, t2, ...)), but received STRUCT<...>. To aggregate
    the column across rows, use gbx_rst_combineavg_agg(tile).

The four call sites now route through this helper:

- RST_CombineAvg   → aggHint = gbx_rst_combineavg_agg
- RST_Merge        → aggHint = gbx_rst_merge_agg
- RST_FromBands    → no aggregator companion (aggHint = None)
- RST_MapAlgebra   → no aggregator companion (aggHint = None)

Spark 4.0's AnalysisException dropped its (String) constructor in
favor of the error-class form; throwing
IllegalArgumentException(msg) surfaces during analysis with the
full message intact and avoids depending on Spark-internal error
catalogs that may shift between releases.

Regression coverage in RST_AggEvalTest: invokes
gbx_rst_combineavg(tile) (single-tile column) via SQL selectExpr
and walks the cause-chain to assert both the function name and the
aggregator-companion hint appear in the surfaced message.

Surfaced by the user reproducer notebook
input/climate_friendly/gbx_rst_combineavg_agg_issue.ipynb after the
v0.3.0 recut: they had written .selectExpr("gbx_rst_combineavg(tile)
AS tile") (non-agg) when they wanted the row-aggregate
combineavg_agg form. The non-agg-on-single-tile case is now caught
at analysis with a one-line fix suggestion.

Co-authored-by: Isaac
@mjohns-databricks mjohns-databricks merged commit a3eb2af into main May 26, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant