beta(0.4.0): 45 new functions + PMTiles DataSource#33
Draft
mjohns-databricks wants to merge 178 commits into
Draft
beta(0.4.0): 45 new functions + PMTiles DataSource#33mjohns-databricks wants to merge 178 commits into
mjohns-databricks wants to merge 178 commits into
Conversation
added 30 commits
May 27, 2026 17:00
Bumps the canonical version (pom.xml, python __init__.py) and all
load-bearing references: JAR paths in 4 conftests (geobrix-0.4.0-jar-with-dependencies.jar),
docs site (docs/package.json), notebook bundle fallback, beta-release-notes
"Current version" banner, issue-template version line in support.mdx, and the
diagram-pill strings in rasterx-{function-categories,tile-structure}.py.
The committed PNGs still display v0.3.0 until resources/images/*.py is re-run;
that step is tracked on the 0.4.0 release checklist (see auto-memory
release_pill_regeneration.md).
Historical narrative references to v0.3.0 (release-notes "What's new" section,
"As of v0.3.0..." prose in api/overview, notebook READMEs, security.mdx) are
intentionally NOT changed.
Co-authored-by: Isaac
Initial TDD step for Wave 6: tests for the native Scala PMTiles v3 encoder. Asserts header magic+version, addressed_tiles_count field layout, Hilbert id determinism+uniqueness, monotonicity across zooms, tile-data round-trip, and RLE deduplication. Fails to compile until PMTilesV3Encoder lands in Task 2. Co-authored-by: Isaac
Pure-Scala CARTO quadbin v0 implementation in com.databricks.labs.gbx.gridx.grid.Quadbin, matching the reference quadbin-py bit layout (HEADER bit 62, mode bits 59..61, resolution bits 52..58, 52-bit Morton-interleaved x/y in bits 0..51, FOOTER tail-padding). Co-authored-by: Isaac
Native Scala PMTiles v3 binary encoder for Wave 6. Produces a single
PMTile blob with the spec's five-section layout (header + root
directory + JSON metadata + leaf directories + tile data); no GDAL/OGR
dependency for the container itself — tile bytes are passed through
verbatim.
- PMTilesEntry: (tile_id, offset, length, run_length) tuple per spec § 4.1.
- PMTilesV3Encoder:
- hilbertId(z, x, y): cumulative TileID = (4^z - 1)/3 + xy2d Hilbert
index; matches spec table for z=0..2 exactly.
- encode(): sorts tiles by Hilbert id, dedups identical content
(SHA-256-keyed), RLE-merges consecutive identical-content runs.
- encodeDirectory(): five-part varint encoding per spec § 4.2
(count, delta tile_ids, run_lengths, lengths, offsets-with-0-for-
contiguous).
- 127-byte header, all uint64s little-endian, addressed_tiles_count
at offset 72, tile_data_offset at 56 (per spec § 3.1 layout table).
- Errors out clearly if root directory would exceed 16,257 bytes —
points caller to the DataSource writer path.
- Internal & tile compression default to none (0x01) for v0.4.0;
callers may override the tile-compression byte to advertise
already-applied gzip/brotli/zstd.
All 7 unit tests green (PMTilesV3EncoderTest).
Co-authored-by: Isaac
Mutable aggregation buffer for the gbx_pmtiles_agg TypedImperative
aggregator. Holds an ArrayBuffer[(z, x, y, bytes)] plus optional
metadata JSON; serialize/deserialize use length-prefixed payloads so
the buffer ships cleanly between executors during the merge phase.
A 100 MiB per-buffer payload cap guards against runaway pipelines —
the UDAF path is limited by Spark's 2 GiB cell size, so the error
explicitly points users to the .write.format("pmtiles") DataSource
for larger pyramids.
Co-authored-by: Isaac
One Spark expression case-class per CARTO quadbin operation, each following the InvokedExpression + WithExpressionInfo pattern shared with BNG: Quadbin_PointAsCell, Quadbin_AsWKB, Quadbin_Centroid, Quadbin_Resolution, Quadbin_Polyfill, Quadbin_KRing, Quadbin_Tessellate, Quadbin_CellUnion, Quadbin_Distance. EWKB output uses SRID=4326. Co-authored-by: Isaac
TypedImperativeAggregate that materializes (z, x, y, bytes) rows into a single PMTile v3 BINARY blob via PMTilesV3Encoder. - PMTiles_Agg: 4-or-5-arity expression (bytes, z, x, y, [metadata_json]). Overrides children and withNewChildrenInternal manually since there is no Quaternary/QuintaryLike trait in Catalyst. - Auto-detects tile_type from the first non-null payload's magic bytes: PNG (89 50 4E 47), JPEG (FF D8), WebP (RIFF...WEBP), otherwise MVT. Magic-byte sniffer is private[pmtiles] for unit tests. - functions.scala: register() + Scala API (pmtiles_agg with 4-arg, 5-arg Column, and String-literal-metadata overloads). - Empty group returns a valid header-only PMTile (not null), so downstream consumers always see well-formed bytes. All 7 e2e tests green (PMTiles_AggTest) — including multi-partition shuffle-merge through a repartition(4) → agg path. Co-authored-by: Isaac
Add gridx.quadbin.functions.register() (idempotent per session) and the typed Column wrappers (quadbin_pointascell, _aswkb, _centroid, _resolution, _polyfill, _kring, _tessellate, _cellunion, _distance) with int-scalar overloads for ergonomics. Wire RegisterBatch to dispatch "gridx.quadbin" and include it in "all". Co-authored-by: Isaac
Register quadbin SQL functions with Spark and exercise each via the Column API: pointascell + resolution round-trip, aswkb / centroid EWKB parseback (SRID=4326), polyfill cells-at-z, kring cardinality, tessellate chips, cellunion to MultiPolygon, and zero / one Chebyshev distance assertions. Co-authored-by: Isaac
Co-authored-by: Isaac
Co-authored-by: Isaac
Co-authored-by: Isaac
Co-authored-by: Isaac
Co-authored-by: Isaac
Co-authored-by: Isaac
Co-authored-by: Isaac
Mirror the Scala API as databricks.labs.gbx.gridx.quadbin.functions with one ColLike wrapper per function and an idempotent register() that uses register_ds + functions=gridx.quadbin. Co-authored-by: Isaac
Co-authored-by: Isaac
PySpark infers schema from Python ints as LongType by default, which caused dispatch to InvokedExpression to fail to find an Int-arg eval method (verified with PyPI quadbin cross-check). Add Long-arg eval overloads (delegating to .toInt) for pointascell, polyfill, kring, and tessellate. Co-authored-by: Isaac
Spark V2 DataSource for streaming larger pyramids to a single PMTile
file. Replaces the in-memory `gbx_pmtiles_agg` UDAF path when the
pyramid exceeds Spark's 2 GiB cell limit.
Wiring:
- PMTiles_DataSource: TableProvider + DataSourceRegister
(shortName "pmtiles"); registered via META-INF/services.
- PMTiles_Table: capabilities BATCH_READ + BATCH_WRITE + TRUNCATE.
BATCH_READ is declared only so the read code path lands in
newScanBuilder where we throw a descriptive
"Reading PMTiles archives is not supported in GeoBrix 0.4.0"
error rather than letting Spark surface a vague "is not a valid
Spark SQL Data Source".
- PMTiles_WriteBuilder implements SupportsTruncate.
Schema validation:
- Required write schema: exactly (z INT, x INT, y INT, bytes BINARY).
- PMTiles_DataSource.validateWriteSchema mirrors the GDAL writer's
exact-schema policy (gdal_writer_schema.md memory). Both missing
and extra columns plus type mismatches surface a clear error
naming the canonical schema.
Partitioned commit protocol:
- Per-task PMTiles_RowWriter:
1. Streams tile bytes to {parent}/_part_{partId}_{taskId}.tdata.
2. Maintains an in-task SHA-256 content map so duplicate tiles
in the same partition share one blob; consecutive identical-
content TileIDs RLE-merge into one entry with run_length > 1.
3. On commit, writes a sidecar .entries file with
(tileId, offsetWithinPart, length, runLength) tuples and
returns a PMTiles_WriterMsg carrying the partitionId, basenames,
and cumulative byte count.
4. On abort, deletes both scratch files.
- Driver-side PMTiles_BatchWrite.commit:
1. Sorts committed messages by partitionId for deterministic
layout, computes cumulative partition offsets.
2. Reads each .entries file, rebases offsets into the global
frame, sorts the merged entry list by tileId, then encodes
the root directory via PMTilesV3Encoder.encodeDirectory.
3. Streams the final file: header(127) || root_dir || metadata ||
(leaf dirs empty in v0.4.0) || concatenated .tdata segments.
4. Cleans up scratch.
- abort(): deletes _part_* and .entries scratch files.
- Errors out cleanly if the global root directory would exceed
16,257 bytes (spec § 4) — v0.4.0 does not yet emit leaf dirs.
- Tile-type detection: prefers explicit `tileType` option; else
sniffs the first non-empty partition's first bytes via the
PMTiles_Agg.detectTileType magic-byte matcher.
All 7 DataSource tests green:
- 100 tiles × 4 partitions → single output file with cleaned scratch
- single-partition write
- missing-column + wrong-type + extra-column schema rejections
- metadataJson option round-trip
- read path raises our own "not supported" error (asserts the
specific message, not just "not supported" generic text).
Co-authored-by: Isaac
Add quadbin_*_sql_example() returning SQL strings for each quadbin function, register gbx_quadbin_* in registered_functions.txt (alphabetical), regenerate function-info.json (98/98 functions covered), wire the docs conftest to register Quadbin alongside BNG/RasterX/VectorX, and teach generate-function-info.py to scan the quadbin_ prefix from the gridx module (and to look up path_config from docs/tests/python on sys.path). Co-authored-by: Isaac
The OGR RegisterAll() and GetDriverByName() native calls require libgdalalljni.so to be System.load'd on the executor JVM. RasterX does this via GDALManager.loadSharedObjects when its register(spark) runs, but VectorX has no equivalent code path yet — and the load has to happen on the executor (where eval runs), not just on the driver. Adds an idempotent native-loader to MvtWriter.encode so st_asmvt works in docs-test sessions that haven't initialized GDAL via rasterx. Co-authored-by: Isaac
Co-authored-by: Isaac
Co-authored-by: Isaac
Co-authored-by: Isaac
Co-authored-by: Isaac
Add a "Quadbin (CARTO v0)" section to gridx.mdx parallel to the BNG section, with function categories and SQL examples. Add a "What's new in v0.4.0" bullet to beta-release-notes.mdx describing the new gbx_quadbin_* family. Co-authored-by: Isaac
Python entry point (databricks.labs.gbx.pmtiles.functions):
- register(spark): wires into the existing register_ds DataSource
plumbing — adds a new "pmtiles" branch to RegisterBatch alongside
"gridx.bng", "vectorx.jts.legacy", "rasterx", "all" so the same
`spark.read.format("register_ds").option("functions", ...)` pattern
works.
- pmtiles_agg(bytes, z, x, y, metadata_json=None): UDAF wrapper.
metadata_json accepts Column or bare Python str (auto-wrapped via
f.lit). Defaults to "{}" — passing None gets the default, NOT a
column reference (we don't follow the pyspark string-as-col-ref
convention here because the metadata default-of-"{}" was confusing
otherwise).
Scala coercion fix in PMTiles_Agg.update:
- PySpark's createDataFrame infers Python int as LongType. Previously
the aggregator did .asInstanceOf[Int] on the z/x/y values, which
threw ClassCastException for LongType columns. New
PMTiles_Agg.toIntCoerce helper accepts Int, Long, java.lang.Integer,
java.lang.Long; throws a clear error for any other type or null.
- DataSource write schema still requires IntegerType strictly (it's
a write-time contract, not a read-time coercion).
Python tests (6, all passing):
- registration via register() + SHOW USER FUNCTIONS lookup
- pmtiles_agg returns valid PMTile blob with correct magic + count
- metadata JSON round-trips through the encoded archive
- PNG magic bytes auto-detected into tile_type=2
- .write.format("pmtiles").mode("overwrite").save(path) writes a
single file with cleaned scratch
- read path raises our "Reading PMTiles archives is not supported"
error rather than class-not-found.
Co-authored-by: Isaac
9 quadbin grid-math functions in new gridx/quadbin/ subpackage plus gridx/grid/Quadbin.scala cell-math helper. All 17 Scala tests + 9 Python tests + 9 function-info tests passing; Quadbin.scala coverage 96.4%; PyPI cross-check 5/5 match vs quadbin-py. Wave 3 agent corrected the CARTO quadbin v0 bit layout against the canonical Python reference (HEADER bit 62 / mode bits 59..61 / res bits 52..58 / 52-bit Morton). Scope doc at input/scoping-quadbin-spatial-binning.md should be updated to match (follow-up). Also added Long overloads for Int args (PySpark sends Python ints as LongType), wired RegisterBatch dispatch, and fixed a pre-existing path_config import in generate-function-info.py. Co-authored-by: Isaac
…ase notes
- docs/tests/python/api/pmtiles_functions_sql.py: SQL examples for
gbx_pmtiles_agg (5-arg with metadata, 4-arg default-metadata form).
Tests in test_pmtiles_functions_sql.py exercise them against the live
UDAF and assert the resulting PMTile v3 magic + addressed-tiles count.
- docs/tests-function-info/registered_functions.txt: add gbx_pmtiles_agg
(DataSource format string `pmtiles` is NOT listed — it's not a SQL
function).
- docs/scripts/generate-function-info.py: add PMTILES_MODULE + the
("pmtiles", "gbx_pmtiles_") package prefix so the generator picks up
the new examples and writes them into function-info.json. The
resulting JSON file is also committed; coverage tests now pass.
- docs/docs/packages/pmtiles.mdx: new standalone docs page. Covers when
to pick UDAF vs DataSource, register/save patterns in Python+SQL+Scala,
the exact write schema, tile-type detection table, tile-compression
override, CORS/MapLibre snippet for serving from object storage, and
the v0.4.0 limits (no leaf dirs, no read path, no cross-task dedup).
- docs/docs/beta-release-notes.mdx: insert "What's new in v0.4.0"
section before v0.3.0 with the canonical PMTiles bullet from the
plan.
Function-info coverage tests green (9/9); docs-SQL tests green (3/3).
Co-authored-by: Isaac
added 18 commits
May 29, 2026 10:58
Co-authored-by: Isaac
Migrated Overview, Key Features, Tile payload, VRT Python pixel functions, and Usage Examples (Python/Scala/SQL) from packages/rasterx.mdx into api/rasterx-functions.mdx. Added the RasterX.png banner and rasterx-function-categories.png image with the category listing. Removed the stale cross-reference to the package page from the Pixel ops section. Dropped the back-link to packages/rasterx from Next Steps (replaced with readers/gdal link).
Migrated Overview (BNG + Quadbin descriptions, import paths, registration note), Key Features, BNG Structure, BNG Grid Reference Format (Standard Format + Precision Levels table with examples), Major Grid Squares, and Quadbin concept section from packages/gridx.mdx into the top of api/gridx-functions.mdx. Added the GridX.png banner. Carried over the Usage Examples block (Python/Scala/SQL) with the required packagesExamples and gridxScalaCode raw-loader imports. Dropped the per-function Function Categories bullet listings (redundant with per-function reference below) and removed the back-link to packages/gridx from Next Steps.
Migrates Quick start (UDAF + DataSource), Schema contract, Tile-type detection, Tile compression, Serving from object storage, Limits in v0.4.0, and References from packages/pmtiles.mdx into the functions page. Reconciles the two-paths table (functions page version kept, with cross-links). Adds a prose note to the MapLibre embed snippet advising users to pin the script version and add SRI attributes. Co-authored-by: Isaac
Migrated Available Packages (RasterX/GridX/VectorX/PMTiles summaries with images), Package Comparison table, Choosing the Right Package, and Installation into api/overview.mdx. Unified Function Naming Convention table to include PMTiles and quadbin prefixes. Removed duplicate Registration prose and streamlined Choosing section from per-package lists to inline paragraphs. Internal links in merged content point to api/*-functions pages. Co-authored-by: Isaac
… category; fix links Remove docs/docs/packages/*.mdx (5 files); remove Packages category from sidebars.js; rename 'API Reference' sidebar category to 'Functions'; update footer footer title 'Packages' → 'Functions' in docusaurus.config.js and homepage links in src/pages/index.js; repoint all packages/* hyperlinks in 14 .mdx files and 2 JS config files to their api/* equivalents. Build passes with zero broken-link warnings. Co-authored-by: Isaac
Co-authored-by: Isaac
…rst_frombands_agg with representative outputs Co-authored-by: Isaac
…epresentative outputs Co-authored-by: Isaac
…ox,geom} with representative outputs Add Triangulation and elevation section to vectorx-functions.mdx covering gbx_st_triangulate, gbx_st_interpolateelevationbbox, and gbx_st_interpolateelevationgeom with canonical signatures, per-param docs, generator/LATERAL VIEW notes, and CodeFromTest blocks. Replace bare-string placeholder _output constants with aligned single-column [BINARY] tables.
… 107-function set Adds 42 functions missing from the old 65-function diagram. New cards: Terrain Analysis (slope/aspect/hillshade/tri/tpi/roughness/color_relief/viewshed), Spectral Indices (evi/savi/ndwi/nbr/index), Vector-Raster Bridge (rasterize/polygonize/ dtmfromgeoms/gridfrompoints), Quadbin Grid (quadbin_rastertogrid*), Web-Mercator Tile Output (to_webmercator/tilexyz/xyzpyramid). Extended Operations with resample/threshold/ fillnodata/proximity/contour/band/buildoverviews/cog_convert. Extended Aggregators and Accessors with _agg variants, sample, and histogram. Fixes count string 65->107. Also converts pre-existing non-ASCII bytes (middle dots, em-dash) to XML entities so the source file is fully ASCII per repo convention. Co-authored-by: Isaac
…VectorX TIN, custom grids Co-authored-by: Isaac
… no placeholder example outputs Co-authored-by: Isaac
…rompoints, representative outputs for terrain/spectral/analysis ops Co-authored-by: Isaac
…t match registered rst_ set Co-authored-by: Isaac
…LLM release-notes check
Replaces the LLM-based release-notes-current check (timeout/leniency
failures) with a pure-stdlib deterministic script:
docs/scripts/check-release-notes-functions.py.
The new check diffs registered_functions.txt over QC_RANGE, collects
newly added gbx_* names, and verifies each appears in
docs/docs/beta-release-notes.mdx -- accepting full name, bare name
(strip gbx_), or brace-expansion shorthand (e.g.
gbx_rst_quadbin_rastertogrid{avg,count,...}). Git errors are
advisory (exit 0) so a bad range never hard-blocks a push.
Co-authored-by: Isaac
…ram for slides Co-authored-by: Isaac
added 3 commits
May 29, 2026 14:36
…ng rasterx functions (not [BINARY]) Co-authored-by: Isaac
…how real array-of-struct values (polygonize, h3/quadbin rastertogrid) Co-authored-by: Isaac
…ile struct (not [BINARY]); enforce ASCII table alignment D4: classify rasterx functions as TILE-returning by scanning Scala dataType RHS for tileDataType() calls or inline StructType with a "raster" BinaryType field. Detected set size = 55. Assert that every TILE function's _sql_example_output contains "<raster bytes>" and does not render as a bare [BINARY] cell. D5: for every _sql_example_output constant in all four SQL example files (rasterx, gridx, vectorx, pmtiles), verify each ASCII table is canonically aligned (per-column width = max stripped cell width; border and row padding derived from that). Shared reformat_table() helper used for both the check and to pre-fix the gridx file. Pre-fix: normalised alignment in gridx_functions_sql.py (19 tables); vectorx and pmtiles were already aligned. rasterx was already normalised. Co-authored-by: Isaac
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
GeoBrix v0.4.0 — 45 new functions + 1 new DataSource across 12 implementation waves merged between 2026-05-27 and 2026-05-28. This PR opens as DRAFT for review while the wave-by-wave commit history is fresh; mark ready when the diff has been audited.
What's new
See
docs/docs/beta-release-notes.mdx§ What's new in v0.4.0 for the full per-feature changelog (12 bullets). High-level groupings:VectorX (new expression surface) — first expression-level functions in VectorX:
gbx_st_asmvt— UDAF aggregating features into Mapbox Vector Tile (MVT) protobufgbx_st_asmvt_pyramid— generator: feature → many(z, x, y, mvt_bytes)rowsGridX (new quadbin subpackage):
gridx/quadbin/(pointascell,aswkb,centroid,resolution,polyfill,kring,tessellate,cellunion,distance) — CARTO quadbin v0 spec, 64-bit Long cell IDs aligned with the web-mercator XYZ tile grid.RasterX (29 new functions):
gbx_rst_rasterize,gbx_rst_polygonizeto_webmercator,tilexyz,xyzpyramidslope,aspect,hillshade,tri,tpi,roughness,color_reliefevi,savi,ndwi,nbr, plus genericindexdispatcherresample/_to_size/_to_res,gridfrompoints+_aggfillnodata,sample,setsrid,histogram,threshold,buildoverviews,bandcog_convert,proximity,contour,viewshedPMTiles (new top-level package):
gbx_pmtiles_aggUDAF — returns BINARY PMTile v3 blob.write.format("pmtiles")DataSource — streams larger pyramids to file via partitioned commit protocolTest plan
build main) green onbeta/0.4.0(latest run).maven-keys.listunchanged.cursor/rules/user-facing-docs-voice.mdcrule)META-INF/services/...DataSourceRegisterregistrationsThis pull request and its description were written by Isaac.