Skip to content

feat: unified chunk grid with rectilinear chunk/shard support#3802

Open
maxrjones wants to merge 123 commits intozarr-developers:mainfrom
maxrjones:poc/unified-chunk-grid
Open

feat: unified chunk grid with rectilinear chunk/shard support#3802
maxrjones wants to merge 123 commits intozarr-developers:mainfrom
maxrjones:poc/unified-chunk-grid

Conversation

@maxrjones
Copy link
Copy Markdown
Member

@maxrjones maxrjones commented Mar 21, 2026

Summary

This PR contains an alternative implementation of the rectilinear chunk grid extension, building on the work in #3534 (RLE helpers, validation logic, and test cases were directly adopted). While the core feature of variable-sized chunks is the same, the internal architecture differs in ways that impact extensibility, performance, and release safety.

I appreciate the patience of those who contributed to #3534, and everyone who's been waiting on this feature. I know it's frustrating to see a new PR after #3534 was so close. That PR provided fundamental components, and I hope people will see the value here. I really believe it is worth the churn for the following reasons:

Key differences from #3534

  1. Extensibility. Each dimension is represented by a type implementing the DimensionGrid protocol (FixedDimension, VaryingDimension). Adding a new dimension type (e.g. TiledDimension for periodic patterns like days-per-month) requires implementing that protocol — no changes to indexing, codecs, or the ChunkGrid class. A prototype was built to verify this.
  2. Performance. The indexing pipeline queries each dimension independently with scalar calls rather than constructing N-d coordinate tuples per chunk lookup. This avoids allocation overhead in the inner loop of every indexer. VaryingDimension uses precomputed prefix sums for O(log n) lookups via binary search. See https://github.com/maxrjones/zarr-chunk-grid-tests for a performance comparison.
  3. Feature flag. Rectilinear chunk grids are gated behind zarr.config.set({'array.rectilinear_chunks': True}) (or ZARR_ARRAY__RECTILINEAR_CHUNKS=True), disabled by default. This gives downstream libraries time to adapt before the API is finalized, and us an opportunity to gracefully finalize the API.
  4. Rectilinear sharding. Shard boundaries can be rectilinear while inner chunks remain regular, with validation that each shard edge is divisible by the inner chunk size. This is tested end-to-end and documented in the user guide.

Design document: docs/design/chunk-grid.md covers the full design, rationale, and a suggested PR sequence for splitting this into reviewable increments, if needed.

Downstream POCs (all passing):

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.md
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

maxrjones and others added 2 commits March 30, 2026 09:47
…metadata (#7)

* chore: simplify sharding codec validation against varying chunk grid metadata

* test: restore test strength
d-v-b and others added 2 commits March 30, 2026 10:45
…chunk grid (#8)

* refactor: allow regular-style chunk grid declaration for rectilinear chunk grid

The rectilinear chunk grid spec allows bare integers per dimension (meaning
"regular step size"), distinct from explicit single-element edge lists. This
commit widens `RectilinearChunkGrid.chunk_shapes` to `tuple[int | tuple[int, ...], ...]`
so bare ints are preserved for faithful JSON round-tripping.

Additionally:
- unifies `_validate_chunk_shapes` to handle both regular and rectilinear validation;
  `_parse_chunk_shape` now delegates to it
- adds `from_sizes` method to `ChunkGrid`, accepting `int | Sequence[int]` per dimension
- removes `from_regular` and `from_rectilinear` methods from `ChunkGrid`
- removes `parse_chunk_grid` from `chunk_grids.py` (JSON → ChunkGrid shortcut that
  bypassed the metadata layer)
- removes `serialize_chunk_grid`, `_infer_chunk_grid_name`, and serialization helpers
  from `chunk_grids.py` (ChunkGrid never needs to be serialized; metadata DTOs handle it)
- renames `parse_chunk_grid` in `v3.py` to `parse_chunk_grid_metadata` to disambiguate
- moves the rectilinear feature flag to `RectilinearChunkGrid.__post_init__`
- simplifies sharding codec validation into a single divisibility check for both
  regular and rectilinear grids
- updates `validate_rectilinear_edges` to skip bare-int dimensions
- refactors chunk grid tests to functional style with parametrization
- adds docstrings to all test functions

* chore: remove .claude

* refactor: rename chunk_grid parsing function

---------

Co-authored-by: Max Jones <14077947+maxrjones@users.noreply.github.com>
Co-authored-by: Davis Bennett <davis.v.bennett@gmail.com>
if spec is not None:
yield spec

def all_chunk_coords(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo the name all_chunk_coords doesn't convey that this is an iterator, but iter_chunk_coords would. if we need all_chunk_coords for backwards compat, can we deprecate that method and make it call iter_chunk_coords instead?

if spec is not None:
yield spec.slices

def get_nchunks(self) -> int:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: i imagine this name is here for backwards compatibility, but nchunks (like ndim) feels like a tighter name.


BytesLike = bytes | bytearray | memoryview
ShapeLike = Iterable[int | np.integer[Any]] | int | np.integer[Any]
ChunksLike = ShapeLike | Sequence[Sequence[int]] | None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would we lose if we used Iterable[Iterable[int]]? we'd have to call tuple before we could get a length, but we could also accept more inputs.

@maxrjones
Copy link
Copy Markdown
Member Author

Thanks for the thorough reviews, @d-v-b!

I'll make DimensionGrid an internal implementation detail by making _dimensions private.

I'd prefer to address the naming/API comments on all_chunk_coords, get_nchunks, and the Iterable[Iterable[int]] type in common.py in follow-ups to keep the scope here focused because those all predate this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants