Skip to content

beta(0.4.0): 45 new functions + PMTiles DataSource#33

Draft
mjohns-databricks wants to merge 178 commits into
mainfrom
beta/0.4.0
Draft

beta(0.4.0): 45 new functions + PMTiles DataSource#33
mjohns-databricks wants to merge 178 commits into
mainfrom
beta/0.4.0

Conversation

@mjohns-databricks
Copy link
Copy Markdown
Collaborator

Summary

GeoBrix v0.4.0 — 45 new functions + 1 new DataSource across 12 implementation waves merged between 2026-05-27 and 2026-05-28. This PR opens as DRAFT for review while the wave-by-wave commit history is fresh; mark ready when the diff has been audited.

What's new

See docs/docs/beta-release-notes.mdx § What's new in v0.4.0 for the full per-feature changelog (12 bullets). High-level groupings:

VectorX (new expression surface) — first expression-level functions in VectorX:

  • gbx_st_asmvt — UDAF aggregating features into Mapbox Vector Tile (MVT) protobuf
  • gbx_st_asmvt_pyramid — generator: feature → many (z, x, y, mvt_bytes) rows

GridX (new quadbin subpackage):

  • 9 quadbin grid-math functions in gridx/quadbin/ (pointascell, aswkb, centroid, resolution, polyfill, kring, tessellate, cellunion, distance) — CARTO quadbin v0 spec, 64-bit Long cell IDs aligned with the web-mercator XYZ tile grid.

RasterX (29 new functions):

  • Vector↔raster bridge: gbx_rst_rasterize, gbx_rst_polygonize
  • 5 raster→quadbin aggregators (parallel to existing H3 family)
  • Web-mercator XYZ tile output: to_webmercator, tilexyz, xyzpyramid
  • 7 terrain analysis: slope, aspect, hillshade, tri, tpi, roughness, color_relief
  • 5 spectral indices: evi, savi, ndwi, nbr, plus generic index dispatcher
  • 5 resample + IDW: resample / _to_size / _to_res, gridfrompoints + _agg
  • 7 pixel ops + extraction: fillnodata, sample, setsrid, histogram, threshold, buildoverviews, band
  • 4 analysis: cog_convert, proximity, contour, viewshed

PMTiles (new top-level package):

  • gbx_pmtiles_agg UDAF — returns BINARY PMTile v3 blob
  • .write.format("pmtiles") DataSource — streams larger pyramids to file via partitioned commit protocol
  • Native Scala v3 encoder, no GDAL/OGR dependency for the container

Test plan

  • All 12 wave plans completed with passing tests (Scala / Python / SQL-docs / function-info parity)
  • Streamlined-tests guideline applied from Wave 5 onward — no suite over 2-min wall-clock
  • CI (build main) green on beta/0.4.0 (latest run)
  • Hash-pinned Python deps unchanged (no new transitive deps in any wave)
  • No new Maven deps in pom.xml — .maven-keys.list unchanged
  • No new third-party GitHub Actions — pinned-SHA policy intact
  • Pre-merge wave-references scrub on every wave's docs (per new .cursor/rules/user-facing-docs-voice.mdc rule)
  • Function-info.json regenerated (139 entries)
  • Diagram pill PNGs regenerated to v0.4.0
  • Reviewer: walk the wave-by-wave merge commits — each is a self-contained reviewable unit
  • Reviewer: validate the new PMTiles DataSource interaction with existing META-INF/services/...DataSourceRegister registrations

This pull request and its description were written by Isaac.

Michael Johns added 30 commits May 27, 2026 17:00
Bumps the canonical version (pom.xml, python __init__.py) and all
load-bearing references: JAR paths in 4 conftests (geobrix-0.4.0-jar-with-dependencies.jar),
docs site (docs/package.json), notebook bundle fallback, beta-release-notes
"Current version" banner, issue-template version line in support.mdx, and the
diagram-pill strings in rasterx-{function-categories,tile-structure}.py.

The committed PNGs still display v0.3.0 until resources/images/*.py is re-run;
that step is tracked on the 0.4.0 release checklist (see auto-memory
release_pill_regeneration.md).

Historical narrative references to v0.3.0 (release-notes "What's new" section,
"As of v0.3.0..." prose in api/overview, notebook READMEs, security.mdx) are
intentionally NOT changed.

Co-authored-by: Isaac
Initial TDD step for Wave 6: tests for the native Scala PMTiles v3
encoder. Asserts header magic+version, addressed_tiles_count field
layout, Hilbert id determinism+uniqueness, monotonicity across zooms,
tile-data round-trip, and RLE deduplication. Fails to compile until
PMTilesV3Encoder lands in Task 2.

Co-authored-by: Isaac
Pure-Scala CARTO quadbin v0 implementation in
com.databricks.labs.gbx.gridx.grid.Quadbin, matching the reference
quadbin-py bit layout (HEADER bit 62, mode bits 59..61, resolution
bits 52..58, 52-bit Morton-interleaved x/y in bits 0..51, FOOTER
tail-padding).

Co-authored-by: Isaac
Native Scala PMTiles v3 binary encoder for Wave 6. Produces a single
PMTile blob with the spec's five-section layout (header + root
directory + JSON metadata + leaf directories + tile data); no GDAL/OGR
dependency for the container itself — tile bytes are passed through
verbatim.

- PMTilesEntry: (tile_id, offset, length, run_length) tuple per spec § 4.1.
- PMTilesV3Encoder:
  - hilbertId(z, x, y): cumulative TileID = (4^z - 1)/3 + xy2d Hilbert
    index; matches spec table for z=0..2 exactly.
  - encode(): sorts tiles by Hilbert id, dedups identical content
    (SHA-256-keyed), RLE-merges consecutive identical-content runs.
  - encodeDirectory(): five-part varint encoding per spec § 4.2
    (count, delta tile_ids, run_lengths, lengths, offsets-with-0-for-
    contiguous).
  - 127-byte header, all uint64s little-endian, addressed_tiles_count
    at offset 72, tile_data_offset at 56 (per spec § 3.1 layout table).
  - Errors out clearly if root directory would exceed 16,257 bytes —
    points caller to the DataSource writer path.
  - Internal & tile compression default to none (0x01) for v0.4.0;
    callers may override the tile-compression byte to advertise
    already-applied gzip/brotli/zstd.

All 7 unit tests green (PMTilesV3EncoderTest).

Co-authored-by: Isaac
Mutable aggregation buffer for the gbx_pmtiles_agg TypedImperative
aggregator. Holds an ArrayBuffer[(z, x, y, bytes)] plus optional
metadata JSON; serialize/deserialize use length-prefixed payloads so
the buffer ships cleanly between executors during the merge phase.

A 100 MiB per-buffer payload cap guards against runaway pipelines —
the UDAF path is limited by Spark's 2 GiB cell size, so the error
explicitly points users to the .write.format("pmtiles") DataSource
for larger pyramids.

Co-authored-by: Isaac
One Spark expression case-class per CARTO quadbin operation, each
following the InvokedExpression + WithExpressionInfo pattern shared
with BNG: Quadbin_PointAsCell, Quadbin_AsWKB, Quadbin_Centroid,
Quadbin_Resolution, Quadbin_Polyfill, Quadbin_KRing, Quadbin_Tessellate,
Quadbin_CellUnion, Quadbin_Distance. EWKB output uses SRID=4326.

Co-authored-by: Isaac
TypedImperativeAggregate that materializes (z, x, y, bytes) rows
into a single PMTile v3 BINARY blob via PMTilesV3Encoder.

- PMTiles_Agg: 4-or-5-arity expression (bytes, z, x, y,
  [metadata_json]). Overrides children and withNewChildrenInternal
  manually since there is no Quaternary/QuintaryLike trait in
  Catalyst.
- Auto-detects tile_type from the first non-null payload's magic
  bytes: PNG (89 50 4E 47), JPEG (FF D8), WebP (RIFF...WEBP),
  otherwise MVT. Magic-byte sniffer is private[pmtiles] for unit
  tests.
- functions.scala: register() + Scala API (pmtiles_agg with
  4-arg, 5-arg Column, and String-literal-metadata overloads).
- Empty group returns a valid header-only PMTile (not null), so
  downstream consumers always see well-formed bytes.

All 7 e2e tests green (PMTiles_AggTest) — including multi-partition
shuffle-merge through a repartition(4) → agg path.

Co-authored-by: Isaac
Add gridx.quadbin.functions.register() (idempotent per session) and
the typed Column wrappers (quadbin_pointascell, _aswkb, _centroid,
_resolution, _polyfill, _kring, _tessellate, _cellunion, _distance)
with int-scalar overloads for ergonomics. Wire RegisterBatch to
dispatch "gridx.quadbin" and include it in "all".

Co-authored-by: Isaac
Register quadbin SQL functions with Spark and exercise each via the
Column API: pointascell + resolution round-trip, aswkb / centroid
EWKB parseback (SRID=4326), polyfill cells-at-z, kring cardinality,
tessellate chips, cellunion to MultiPolygon, and zero / one Chebyshev
distance assertions.

Co-authored-by: Isaac
Mirror the Scala API as databricks.labs.gbx.gridx.quadbin.functions
with one ColLike wrapper per function and an idempotent register()
that uses register_ds + functions=gridx.quadbin.

Co-authored-by: Isaac
PySpark infers schema from Python ints as LongType by default, which
caused dispatch to InvokedExpression to fail to find an Int-arg eval
method (verified with PyPI quadbin cross-check). Add Long-arg eval
overloads (delegating to .toInt) for pointascell, polyfill, kring,
and tessellate.

Co-authored-by: Isaac
Spark V2 DataSource for streaming larger pyramids to a single PMTile
file. Replaces the in-memory `gbx_pmtiles_agg` UDAF path when the
pyramid exceeds Spark's 2 GiB cell limit.

Wiring:
  - PMTiles_DataSource: TableProvider + DataSourceRegister
    (shortName "pmtiles"); registered via META-INF/services.
  - PMTiles_Table: capabilities BATCH_READ + BATCH_WRITE + TRUNCATE.
    BATCH_READ is declared only so the read code path lands in
    newScanBuilder where we throw a descriptive
    "Reading PMTiles archives is not supported in GeoBrix 0.4.0"
    error rather than letting Spark surface a vague "is not a valid
    Spark SQL Data Source".
  - PMTiles_WriteBuilder implements SupportsTruncate.

Schema validation:
  - Required write schema: exactly (z INT, x INT, y INT, bytes BINARY).
  - PMTiles_DataSource.validateWriteSchema mirrors the GDAL writer's
    exact-schema policy (gdal_writer_schema.md memory). Both missing
    and extra columns plus type mismatches surface a clear error
    naming the canonical schema.

Partitioned commit protocol:
  - Per-task PMTiles_RowWriter:
    1. Streams tile bytes to {parent}/_part_{partId}_{taskId}.tdata.
    2. Maintains an in-task SHA-256 content map so duplicate tiles
       in the same partition share one blob; consecutive identical-
       content TileIDs RLE-merge into one entry with run_length > 1.
    3. On commit, writes a sidecar .entries file with
       (tileId, offsetWithinPart, length, runLength) tuples and
       returns a PMTiles_WriterMsg carrying the partitionId, basenames,
       and cumulative byte count.
    4. On abort, deletes both scratch files.
  - Driver-side PMTiles_BatchWrite.commit:
    1. Sorts committed messages by partitionId for deterministic
       layout, computes cumulative partition offsets.
    2. Reads each .entries file, rebases offsets into the global
       frame, sorts the merged entry list by tileId, then encodes
       the root directory via PMTilesV3Encoder.encodeDirectory.
    3. Streams the final file: header(127) || root_dir || metadata ||
       (leaf dirs empty in v0.4.0) || concatenated .tdata segments.
    4. Cleans up scratch.
  - abort(): deletes _part_* and .entries scratch files.
  - Errors out cleanly if the global root directory would exceed
    16,257 bytes (spec § 4) — v0.4.0 does not yet emit leaf dirs.
  - Tile-type detection: prefers explicit `tileType` option; else
    sniffs the first non-empty partition's first bytes via the
    PMTiles_Agg.detectTileType magic-byte matcher.

All 7 DataSource tests green:
  - 100 tiles × 4 partitions → single output file with cleaned scratch
  - single-partition write
  - missing-column + wrong-type + extra-column schema rejections
  - metadataJson option round-trip
  - read path raises our own "not supported" error (asserts the
    specific message, not just "not supported" generic text).

Co-authored-by: Isaac
Add quadbin_*_sql_example() returning SQL strings for each quadbin
function, register gbx_quadbin_* in registered_functions.txt
(alphabetical), regenerate function-info.json (98/98 functions
covered), wire the docs conftest to register Quadbin alongside
BNG/RasterX/VectorX, and teach generate-function-info.py to scan
the quadbin_ prefix from the gridx module (and to look up
path_config from docs/tests/python on sys.path).

Co-authored-by: Isaac
The OGR RegisterAll() and GetDriverByName() native calls require
libgdalalljni.so to be System.load'd on the executor JVM. RasterX does
this via GDALManager.loadSharedObjects when its register(spark) runs,
but VectorX has no equivalent code path yet — and the load has to happen
on the executor (where eval runs), not just on the driver. Adds an
idempotent native-loader to MvtWriter.encode so st_asmvt works in
docs-test sessions that haven't initialized GDAL via rasterx.

Co-authored-by: Isaac
Add a "Quadbin (CARTO v0)" section to gridx.mdx parallel to the BNG
section, with function categories and SQL examples. Add a "What's
new in v0.4.0" bullet to beta-release-notes.mdx describing the new
gbx_quadbin_* family.

Co-authored-by: Isaac
Python entry point (databricks.labs.gbx.pmtiles.functions):
  - register(spark): wires into the existing register_ds DataSource
    plumbing — adds a new "pmtiles" branch to RegisterBatch alongside
    "gridx.bng", "vectorx.jts.legacy", "rasterx", "all" so the same
    `spark.read.format("register_ds").option("functions", ...)` pattern
    works.
  - pmtiles_agg(bytes, z, x, y, metadata_json=None): UDAF wrapper.
    metadata_json accepts Column or bare Python str (auto-wrapped via
    f.lit). Defaults to "{}" — passing None gets the default, NOT a
    column reference (we don't follow the pyspark string-as-col-ref
    convention here because the metadata default-of-"{}" was confusing
    otherwise).

Scala coercion fix in PMTiles_Agg.update:
  - PySpark's createDataFrame infers Python int as LongType. Previously
    the aggregator did .asInstanceOf[Int] on the z/x/y values, which
    threw ClassCastException for LongType columns. New
    PMTiles_Agg.toIntCoerce helper accepts Int, Long, java.lang.Integer,
    java.lang.Long; throws a clear error for any other type or null.
  - DataSource write schema still requires IntegerType strictly (it's
    a write-time contract, not a read-time coercion).

Python tests (6, all passing):
  - registration via register() + SHOW USER FUNCTIONS lookup
  - pmtiles_agg returns valid PMTile blob with correct magic + count
  - metadata JSON round-trips through the encoded archive
  - PNG magic bytes auto-detected into tile_type=2
  - .write.format("pmtiles").mode("overwrite").save(path) writes a
    single file with cleaned scratch
  - read path raises our "Reading PMTiles archives is not supported"
    error rather than class-not-found.

Co-authored-by: Isaac
9 quadbin grid-math functions in new gridx/quadbin/ subpackage plus
gridx/grid/Quadbin.scala cell-math helper. All 17 Scala tests + 9
Python tests + 9 function-info tests passing; Quadbin.scala coverage
96.4%; PyPI cross-check 5/5 match vs quadbin-py.

Wave 3 agent corrected the CARTO quadbin v0 bit layout against the
canonical Python reference (HEADER bit 62 / mode bits 59..61 / res bits
52..58 / 52-bit Morton). Scope doc at input/scoping-quadbin-spatial-binning.md
should be updated to match (follow-up).

Also added Long overloads for Int args (PySpark sends Python ints as
LongType), wired RegisterBatch dispatch, and fixed a pre-existing
path_config import in generate-function-info.py.

Co-authored-by: Isaac
…ase notes

- docs/tests/python/api/pmtiles_functions_sql.py: SQL examples for
  gbx_pmtiles_agg (5-arg with metadata, 4-arg default-metadata form).
  Tests in test_pmtiles_functions_sql.py exercise them against the live
  UDAF and assert the resulting PMTile v3 magic + addressed-tiles count.
- docs/tests-function-info/registered_functions.txt: add gbx_pmtiles_agg
  (DataSource format string `pmtiles` is NOT listed — it's not a SQL
  function).
- docs/scripts/generate-function-info.py: add PMTILES_MODULE + the
  ("pmtiles", "gbx_pmtiles_") package prefix so the generator picks up
  the new examples and writes them into function-info.json. The
  resulting JSON file is also committed; coverage tests now pass.
- docs/docs/packages/pmtiles.mdx: new standalone docs page. Covers when
  to pick UDAF vs DataSource, register/save patterns in Python+SQL+Scala,
  the exact write schema, tile-type detection table, tile-compression
  override, CORS/MapLibre snippet for serving from object storage, and
  the v0.4.0 limits (no leaf dirs, no read path, no cross-task dedup).
- docs/docs/beta-release-notes.mdx: insert "What's new in v0.4.0"
  section before v0.3.0 with the canonical PMTiles bullet from the
  plan.

Function-info coverage tests green (9/9); docs-SQL tests green (3/3).

Co-authored-by: Isaac
Michael Johns added 18 commits May 29, 2026 10:58
Migrated Overview, Key Features, Tile payload, VRT Python pixel
functions, and Usage Examples (Python/Scala/SQL) from
packages/rasterx.mdx into api/rasterx-functions.mdx. Added the
RasterX.png banner and rasterx-function-categories.png image with the
category listing. Removed the stale cross-reference to the package
page from the Pixel ops section. Dropped the back-link to
packages/rasterx from Next Steps (replaced with readers/gdal link).
Migrated Overview (BNG + Quadbin descriptions, import paths, registration
note), Key Features, BNG Structure, BNG Grid Reference Format (Standard
Format + Precision Levels table with examples), Major Grid Squares, and
Quadbin concept section from packages/gridx.mdx into the top of
api/gridx-functions.mdx. Added the GridX.png banner. Carried over the
Usage Examples block (Python/Scala/SQL) with the required packagesExamples
and gridxScalaCode raw-loader imports. Dropped the per-function Function
Categories bullet listings (redundant with per-function reference below) and
removed the back-link to packages/gridx from Next Steps.
Migrates Quick start (UDAF + DataSource), Schema contract, Tile-type
detection, Tile compression, Serving from object storage, Limits in
v0.4.0, and References from packages/pmtiles.mdx into the functions
page. Reconciles the two-paths table (functions page version kept, with
cross-links). Adds a prose note to the MapLibre embed snippet advising
users to pin the script version and add SRI attributes.

Co-authored-by: Isaac
Migrated Available Packages (RasterX/GridX/VectorX/PMTiles summaries
with images), Package Comparison table, Choosing the Right Package,
and Installation into api/overview.mdx. Unified Function Naming
Convention table to include PMTiles and quadbin prefixes. Removed
duplicate Registration prose and streamlined Choosing section from
per-package lists to inline paragraphs. Internal links in merged
content point to api/*-functions pages.

Co-authored-by: Isaac
… category; fix links

Remove docs/docs/packages/*.mdx (5 files); remove Packages category from
sidebars.js; rename 'API Reference' sidebar category to 'Functions'; update
footer footer title 'Packages' → 'Functions' in docusaurus.config.js and
homepage links in src/pages/index.js; repoint all packages/* hyperlinks in
14 .mdx files and 2 JS config files to their api/* equivalents. Build passes
with zero broken-link warnings.

Co-authored-by: Isaac
…rst_frombands_agg with representative outputs

Co-authored-by: Isaac
…epresentative outputs

Co-authored-by: Isaac
…ox,geom} with representative outputs

Add Triangulation and elevation section to vectorx-functions.mdx covering
gbx_st_triangulate, gbx_st_interpolateelevationbbox, and
gbx_st_interpolateelevationgeom with canonical signatures, per-param docs,
generator/LATERAL VIEW notes, and CodeFromTest blocks. Replace bare-string
placeholder _output constants with aligned single-column [BINARY] tables.
… 107-function set

Adds 42 functions missing from the old 65-function diagram. New cards:
Terrain Analysis (slope/aspect/hillshade/tri/tpi/roughness/color_relief/viewshed),
Spectral Indices (evi/savi/ndwi/nbr/index), Vector-Raster Bridge (rasterize/polygonize/
dtmfromgeoms/gridfrompoints), Quadbin Grid (quadbin_rastertogrid*), Web-Mercator Tile
Output (to_webmercator/tilexyz/xyzpyramid). Extended Operations with resample/threshold/
fillnodata/proximity/contour/band/buildoverviews/cog_convert. Extended Aggregators and
Accessors with _agg variants, sample, and histogram. Fixes count string 65->107.
Also converts pre-existing non-ASCII bytes (middle dots, em-dash) to XML entities so
the source file is fully ASCII per repo convention.

Co-authored-by: Isaac
…VectorX TIN, custom grids

Co-authored-by: Isaac
… no placeholder example outputs

Co-authored-by: Isaac
…rompoints, representative outputs for terrain/spectral/analysis ops

Co-authored-by: Isaac
…t match registered rst_ set

Co-authored-by: Isaac
…LLM release-notes check

Replaces the LLM-based release-notes-current check (timeout/leniency
failures) with a pure-stdlib deterministic script:
docs/scripts/check-release-notes-functions.py.

The new check diffs registered_functions.txt over QC_RANGE, collects
newly added gbx_* names, and verifies each appears in
docs/docs/beta-release-notes.mdx -- accepting full name, bare name
(strip gbx_), or brace-expansion shorthand (e.g.
gbx_rst_quadbin_rastertogrid{avg,count,...}). Git errors are
advisory (exit 0) so a bad range never hard-blocks a push.

Co-authored-by: Isaac
Michael Johns added 3 commits May 29, 2026 14:36
…ng rasterx functions (not [BINARY])

Co-authored-by: Isaac
…how real array-of-struct values (polygonize, h3/quadbin rastertogrid)

Co-authored-by: Isaac
…ile struct (not [BINARY]); enforce ASCII table alignment

D4: classify rasterx functions as TILE-returning by scanning Scala
dataType RHS for tileDataType() calls or inline StructType with a
"raster" BinaryType field. Detected set size = 55. Assert that every
TILE function's _sql_example_output contains "<raster bytes>" and
does not render as a bare [BINARY] cell.

D5: for every _sql_example_output constant in all four SQL example
files (rasterx, gridx, vectorx, pmtiles), verify each ASCII table is
canonically aligned (per-column width = max stripped cell width; border
and row padding derived from that). Shared reformat_table() helper used
for both the check and to pre-fix the gridx file.

Pre-fix: normalised alignment in gridx_functions_sql.py (19 tables);
vectorx and pmtiles were already aligned. rasterx was already normalised.

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant