Skip to content

Supervoxel splitting with base+fork support and locks#534

Open
akhileshh wants to merge 51 commits into
pcgv3from
akhilesh/sv-splitting-locks
Open

Supervoxel splitting with base+fork support and locks#534
akhileshh wants to merge 51 commits into
pcgv3from
akhilesh/sv-splitting-locks

Conversation

@akhileshh
Copy link
Copy Markdown
Contributor

@akhileshh akhileshh commented Apr 23, 2026

Summary

Supervoxel splits in OCDBT-backed segmentations are now atomic under worker failure, with an operator tool for recovering stuck ops.

Locks

Two indefinite locks, both holding their cells on exception:

  • Root lock wraps multicut → split detection → voxel-level split → commit.
  • L2 chunk lock nests inside, scoped to rewritten chunks + 1-chunk margin for the edge-routing shell.

On unclean exit, both hold and the L2 scope is written to the op-log row.

Stuck-op signal

Non-empty scope field on the op-log row. Covers both CREATED (worker crash) and FAILED (caught exception) paths; status alone missed FAILED. Clean exit clears it.

Cleanup-then-replay

Concurrent ops on non-overlapping chunks advance OCDBT during the outage, so recovery can't use one pinned view.

  • Cleanup: read pre-op voxels from an OCDBT handle pinned at the op's timestamp, write them back unpinned at the same chunk keys. Chunks outside the scope are untouched.
  • Replay: reruns via the privileged-repair path against latest state. Indefinite-lock exit releases cells by value-match on the original op ID.

OCDBT time-travel

Tensorstore version field in the spec (integer generation or ISO-8601 Z timestamp, interpreted as commit_time ≤ T). Threaded as pinned_at. Pinned handles are read-only.

Reader contract

dataset_info now publishes the full kvstore spec (kvstack layers + OCDBT config + data prefixes) instead of path fragments, so readers pass it verbatim to tensorstore. The multi-scale open asserts fork presence, failing with a clear message instead of a tensorstore internal.

Drop the inline propagate_to_coarser_scales call from write_seg; coarser
mips are now the async downsample worker's responsibility. write_seg is
back to a single base-scale tensorstore write, so SV splits no longer
block on the full pyramid update. TestWriteSeg updated to assert the
coarser scales stay zero after write_seg (propagation tested separately
via TestPropagateToCoarserScales).
split_supervoxel now returns its base-resolution bbox. split_with_sv_splits
collects one per call and attaches the list to Result.seg_bbox (new
optional field). publish_edit includes the list in the payload and sets
downsample="true" so the worker only runs on edits that touched base seg.

List kept unmerged — lets the worker skip tiles outside the actual
change region.
workers/downsample_worker.py consumes edits-exchange messages flagged
downsample="true" and writes each non-base mip within the SV-split
bbox. graph/downsample.py splits the region into pyramid_blocks (sized
so no two blocks share a storage chunk at any mip), then either
tinybrain'd in one call (fast path, typical small edits) or per-mip
(fallback when base read exceeds memory budget). Write filtering keeps
OCDBT delta proportional to the actual change.

DownsampleBlockLock serializes overlapping jobs via kvdbclient's
row-key lock API; 26-byte hash-prefixed keys avoid tablet hot-spots.
Depends on kvdbclient lock_by_row_key / unlock_by_row_key /
renew_lock_by_row_key landing first.
Used by the async downsample worker as the mip-pyramid kernel.
Closes a concurrency gap where two SV splits on overlapping L2 chunks
but distinct roots can't be serialized by root locks — they acquire
disjoint root-lock sets and race on seg state. L2ChunkLock serializes
them via the kvdbclient row-key lock primitive.

Row key = 2-byte blake2b hash + 8-byte uint64 chunk_id (10 bytes).
Hash prefix keeps spatially-clustered L2 chunks from hot-spotting a
single bigtable tablet under concurrent load.

Primitive only — callers land separately.

RowKeyLockRegistry helper moved to tests/helpers.py so L2ChunkLock and
DownsampleBlockLock tests share it instead of duplicating.
Callers now get Cut(atomic_edges) | PreviewCut(ccs, illegal_split) |
SvSplitRequired(sv_remapping) instead of unwrapping-by-convention or
catching SupervoxelSplitRequiredError. The exception still unwinds
inside LocalMincutGraph — cheapest way to bail out of deep path code —
but it's caught once at the run_multicut boundary and never escapes,
so callers don't use raise/catch for control flow.
MulticutOperation._apply dispatches on the tagged multicut result and,
when an SV split is needed, calls the new edits_sv.split_supervoxels
under its surrounding RootLock, refreshes source/sink SV IDs from seg,
and retries multicut against the post-split graph. Root lock spans the
whole critical section; L2ChunkLock held only around the split loop.

This closes two races that existed when split_with_sv_splits handled
the flow outside any lock: same-root (root lock now never released
between multicut and commit) and cross-root (L2ChunkLock serializes
overlapping split regions). split_with_sv_splits is deleted;
handle_split calls cg.remove_edges directly.
Pre-compute each rep's bbox from the chunk coords of its CC members in
sv_remapping (no coord-padding, no resolution-axis assumption).
split_supervoxels builds the union lock set across reps — sparse chunks
plus one L2-chunk margin for update_edges's 1-voxel overlap read —
acquires once, then loops per-rep splits.

_update_chunks surfaces the change_chunks that actually got new SV IDs;
write_seg_chunks fires one tensorstore future per change-chunk and
awaits together, so only chunks with real label changes hit OCDBT.
Gap chunks between CC pieces and neighbor chunks read for the overlap
never get rewritten, keeping the delta proportional to the edit.

split_supervoxels also threads back the fresh source/sink SV IDs from
the in-memory new_seg block (same bytes that just landed on storage),
so the retry multicut sees current IDs without an extra seg read.

Drops _get_whole_sv (dead since the sv_remapping switch).

Adds a high-level architecture doc covering the end-to-end flow,
concurrency design, and durable invariants.
Enable opening a CG's OCDBT at a prior commit via the driver's `version`
spec field (int generation or ISO-8601 commit_time upper bound).
Groundwork for operator-driven replay of failed SV splits, which needs
clean pre-op reads against append-only storage.

OCDBT stamps commits from absl::Now() with no caller-override hook, so
pins will use OperationTimeStamp captured under the L2 chunk lock rather
than aligning OCDBT commit times to operation time.
@codecov

This comment was marked as spam.

split_supervoxels is now a pure planner returning a SplitResult
(seg_bboxes, source_ids_fresh, sink_ids_fresh, seg_writes,
bigtable_rows). No lock acquisition, no writes. The caller
(MulticutOperation._apply) holds the L2 chunk locks and fires
the consolidated persist — OCDBT chunks + bigtable rows — inside
an inner lock scope.

seg_writes is a flat list of (voxel_slices, data) pairs across all
reps so write_seg_chunks fires every chunk write as one parallel
tensorstore batch. Removes the per-rep serialization in the old
write_seg_chunks loop.

get_seg_source_and_destination_ocdbt gains a pinned_at kwarg, forwarded
to build_cg_ocdbt_spec — used later by the recovery path.
New L2 chunk counterpart to IndefiniteRootLock, keyed by chunk row.
L2ChunkLock now acquires via lock_by_row_key_with_indefinite so a
temporal acquire sees a crashed op's indefinite cell and refuses.
IndefiniteL2ChunkLock records its chunk scope on the op-log row's
L2ChunkLockScope column at __enter__ and clears it on clean exit,
giving recovery a durable scope without a bigtable-wide scan.

Both indefinite locks (root and L2 chunk) now short-circuit __exit__
when an exception is propagating: cells stay held, scope stays set.
Partial writes may exist after an exception; leaving the cells forces
subsequent ops to refuse at lock-acquire and the operator to run
recovery explicitly.

privileged_mode=True on either lock is the operator recovery escape
hatch: skips acquire, pre-populates acquired_keys so __exit__'s
value-matched release deletes the crashed op's cells.

RowKeyLockRegistry (test helper) gains the three new kvdbclient
primitives.
Operator recovery for SV-split ops that crashed mid-write. A worker
death inside IndefiniteL2ChunkLock leaves per-chunk indefinite cells
set and records the chunk scope on the op-log row. Recovery reverts
partial OCDBT writes using a version-pinned read of pre-op voxels,
then replays the op normally.

list_stuck scans OperationLogs for ops still at CREATED past a
min-age threshold. replay(cg, op_id) runs cleanup_partial_writes
followed by repair.edits.repair_operation(..., unlock=True);
IndefiniteL2ChunkLock's privileged-mode __exit__ deletes the crashed
op's pre-existing cells after the replay's writes land.

Architecture-level operator guide at docs/sv_splitting_recovery.md,
linked from docs/sv_splitting.md's Concurrency section.
Pass the op's timestamp from execute → _apply → split_supervoxels →
split_supervoxel → copy_parents_and_add_lineage / add_new_edges so
every new-SV bigtable mutation lands at the op's logical write time.
Gets atomic visibility under a parent_ts filter and makes replay's
override_ts actually control what time-filtered readers see after a
repair_operation. Parent-copy and Child-list writes deliberately keep
the old cell's timestamp so pre-op readers still see the old
hierarchy.

Replace the seven-tuple in/out soup in split_supervoxels with named
dataclasses: SvSplitTask (plan_sv_splits → split_supervoxel input)
and SvSplitOutcome (split_supervoxel's per-task output). Drop the
two unused return fields on split_supervoxel.
list_stuck filter switches from Status==CREATED to "L2ChunkLockScope
set and Status != SUCCESS past min_age". The authoritative signal for
stuck-ness is "scope recorded, not cleared" — worker crash (Status
stays CREATED) and Python exception during persist (Status=EXCEPTION
after Fix 1) both fall under it. Ops without scope aren't blocking
other ops and are outside stuck_ops' concern.

replay now verifies each chunk in the recorded scope actually has
Concurrency.IndefiniteLock held by this op_id before running cleanup
or repair. If cells are missing or held by a different op, raises
with a clear error. Protects against double-replay (first run already
released cells) and out-of-band clearing (manual bigtable edit,
buggy release path) — both would have cleanup_partial_writes revert
chunks that aren't ours.
fork_base_manifest is now an explicit step — invoked from the ingest
CLI's --ocdbt path or the seg_ocdbt notebook — rather than being
auto-triggered on first ws_ocdbt_scales access. ws_ocdbt_scales
asserts fork_exists() so a missing fork fails with a clear pointer
instead of a tensorstore mismatch/not-found error.
Stuck-op detection keys off L2ChunkLockScope being populated, not
Status=CREATED — that filter also covers caught-exception paths
(Status=FAILED) where cells are held but the row isn't CREATED.

Recovery now verifies each chunk's IndefiniteLock is actually held
by the op before cleaning up, so a stale scope can't have us revert
chunks another op owns. Reflect both in docs and the list CLI help.
Drop "mode-downsample" from the SV-split diagram — tinybrain owns
the algorithm; the doc shouldn't pin it.
@akhileshh akhileshh changed the title sv splitting locks Supervoxel splitting with locks Apr 24, 2026
@akhileshh akhileshh changed the title Supervoxel splitting with locks Supervoxel splitting with base+fork support and locks Apr 24, 2026
@akhileshh akhileshh requested review from fcollman and sdorkenw April 24, 2026 02:29
`_rep_bbox` enveloped every piece of the cross-chunk-connected rep —
for physical SVs split into many pieces across chunks, the bbox grew
far wider than the cut surface needs. Replace with `_coords_bbox`:
envelope of the user-placed source/sink coords plus a one-chunk
margin (matches the existing L2 lock margin and 1-voxel shell).

After the seg read, `cut_supervoxels` is intersected with the IDs
present in seg, so the "whole sv" set names only the rep pieces the
bbox actually touches. Pieces of the rep outside the bbox keep their
existing IDs — their cross-chunk edges to in-bbox split fragments
are routed via the 1-voxel shell, edges between two unsplit pieces
don't change.

Adds `TestCoordsBbox` covering envelope+margin, volume-bound clipping,
and that `plan_sv_splits` returns a tight bbox regardless of how
distant the rep's other pieces sit.
handle_supervoxel_id_lookup and id_helpers.get_atomic_ids_from_coords
both short-circuit to lookup_svs_from_seg whenever ocdbt_seg is true,
regardless of node-id layer. 2D slice clicks send L1 IDs from a view
that may be stale after an SV split; 3D mesh clicks send a root and
no L1 at all. Either way the current SV is what matters, so we read
seg at the click coords and let downstream same/different-root
checks surface any staleness with the sv_id->root diagnostic.

Also vectorize lookup_svs_from_seg's per-coord indexing into a single
advanced-index op.
`_schema_from_src` was passing both `domain` and `shape` to ts.open
when cloning the source schema to the destination. For sources with
a non-zero `voxel_offset`, `domain` carries absolute bounds (e.g.
[17756, 62244)) while `shape` implies an origin of 0, and tensorstore
refuses to merge them — base creation fails on any precomputed source
that doesn't start at the origin.

`domain` already encodes both extent and offset, so passing only it
is sufficient and avoids the conflict.
cloudbuild now uses BuildKit registry cache (`:buildcache`) so
unchanged stages reuse the prior build's layer artifacts and
already-warm nodes skip re-downloading them on pull.

The fresh-ingest CLI no longer infers `ocdbt_populate_base` from
`base_exists` — manifest presence didn't reflect whether chunks
were actually copied. Replaced with an explicit `--populate-base`
flag; operator sets it on first ingest and omits on subsequent
runs.
akhileshh added 22 commits May 19, 2026 22:44
`docker buildx create --use` stores the "current builder" client-side
in ~/.docker/buildx/, which doesn't persist across cloudbuild steps —
each step runs in a fresh container. The build step then fell back
to the default `docker` driver, which doesn't support
`--cache-to type=registry`. Cache export silently no-op'd, so the
:buildcache image never got pushed and every subsequent build's
--cache-from missed.

Move the create + use into the same step as the build and explicitly
set --driver docker-container.
create_base_ocdbt and wipe_base_ocdbt both wiped through the OCDBT
driver, which on an empty dir creates a default-config manifest stub
(max_inline_value_bytes=100) and on a populated dir only clears the
B+tree leaving the manifest in place. Either way the subsequent open
with OCDBT_CONFIG mismatches. Wipe through the underlying GCS/file
driver so the manifest itself gets deleted.

Move OCDBT base setup before cg.create() so a failure in OCDBT setup
doesn't leave behind an orphan bigtable table.
…mic txn

Move the precomputed→OCDBT base copy out of L2 atomic tasks into a single
parent-layer task per chunk, layer configurable via --populate-layer
(default 4). Each task batches every underlying L2 chunk across every MIP
scale into one OCDBT commit via ts.Transaction(atomic=True) — without
atomic=True the precomputed driver splits a multi-chunk write into two
OCDBT sub-commits regardless of the surrounding python-level transaction.

The chosen populate-layer is persisted in <ws>/ocdbt/.populated/meta.json
alongside the per-chunk markers; mixing layers is unsupported, so a
mismatch errors out and tells the operator to delete <ws>/ocdbt/ if they
want to repopulate at a different layer.
… status + per-chunk node/edge logs

`ingest graph --retry` now loads the existing IngestionManager from redis
instead of requiring the OCDBT flags to be re-passed; `dataset` becomes
optional under --retry. The base/populate-meta/cg.create one-time setup
is skipped — only the per-CG fork manifest is re-wiped and L2 tasks are
re-enqueued. The job_type_guard already handles cross-mode misuse.

create_parent_chunk fires the OCDBT populate on a worker thread before
add_parent_chunk so the GCS copy overlaps with the BigTable hierarchy
build; the populate future is awaited (and exceptions propagated)
before marking the task complete.

print_status now surfaces ocdbt_seg / ocdbt_populate_base /
ocdbt_populate_layer so operators can see at-a-glance what mode the
running job is in. add_atomic_chunk and add_parent_chunk each log node
and edge counts so each task's BigTable write volume is visible in the
worker logs.
…le refresh

print_status now renders a single rounded Rich Panel via Live. Two
header mini-tables (graph fields, ocdbt fields when enabled) sit above
the per-layer progress/queue/busy table; both header groups share
computed column widths so their boundaries line up. job_type lives in
the Panel title so it stays out of the row data.

Busy-worker count comes straight from Redis: SMEMBERS rq:workers:l{N}
to enumerate worker keys per queue, then a single pipelined HGET state
across all of them and a sum of state == b'busy'. Two round-trips per
refresh regardless of worker count, displayed as busy/total.

Refresh cadence defaults to 5s and is overridable via --refresh on
both `ingest status` and `upgrade status`.
…ations

Each refresh was constructing a fresh Queue plus a fresh
FailedJobRegistry per layer, and the first FailedJobRegistry access
triggered a lazy `from rq.registry import FailedJobRegistry` import,
which dominated first-call latency. The relevant keys are stable
strings (`rq:queue:l{N}`, `rq:failed:l{N}`, `rq:workers:l{N}`), so
compute them once before the live loop and pass them straight into
the pipeline.
Overlapping the GCS copy with add_parent_chunk via a worker thread
doubled peak memory inside the worker (the multi-scale read buffer
and the BigTable hierarchy state are both live), which was OOM-killing
workers. Drop the ThreadPoolExecutor and let the populate run inline
after add_parent_chunk completes; _populate_ocdbt_chunk stays as a
named helper.
…ntermediate numpy

Passing the source TensorStore directly into write() lets tensorstore
copy through its own pipeline rather than materializing a uint64
numpy array in Python and then handing it back to write(). On a 512
MiB region the duplicate-buffer cost goes away — peak RSS drops by
roughly one scale's worth. Inside the atomic transaction the txn
itself still buffers staged writes, so this trims the avoidable half
of memory, not all of it.
… config

Split ocdbt.py into ocdbt/{__init__,meta,utils,main}.py. OcdbtConfig
dataclass replaces the OCDBT_CONFIG module constant; spec builders
take it explicitly. Per-CG settings persist in
custom_data["ocdbt_config"]; legacy custom_data["seg"] still reads.

Drop --ocdbt, --populate-base, --populate-layer, --sv-split-threshold
from `ingest graph`; those move to yaml under a top-level
`ocdbt_config:` section. bootstrap() parses it.
IngestionManager carries one ocdbt_config dict with @Property
accessors so cluster.py callers don't change.

max_inline_value_bytes default bumped to 1 MiB (tensorstore's hard
ceiling).
…hared with ws_cv

ChunkedGraphMeta.ocdbt_config now consults <ws>/ocdbt/.populated/meta.json
at runtime and resolves it on top of custom_data per the info-file >
custom_data > defaults order. New module-level _redis_cached_json(key,
loader) caches the JSON dict in Redis so distributed workers hit Redis
once per process instead of GCS per CG instance; ws_cv refactored to
use the same helper.
…p edges-OCDBT path

New ingest/ocdbt.py owns the per-chunk populate task, the shared base/fork
setup, and a coordinator-server context manager that the populate-layer
CLI will wire in next. populate_chunk fails loudly when the coordinator
address isn't advertised in Redis — uncoordinated parallel commits race
the shared manifest and leak orphan d/ files, the exact bug this module
exists to prevent.

The --ocdbt-edges branch and start_ocdbt_server (edges-only writer with
a different base path) go away with their tests; no caller is left.
`flask ingest layer N` now wraps queue_layer_helper in a
DistributedCoordinatorServer when N matches the OCDBT populate layer:
the server starts in-process, its address is advertised in Redis, and
the foreground blocks so workers route every commit through it.
Eliminates the manifest-CAS races that were emitting orphan d/ files at
~47× the rate of committed data.

Both CLIs lose --reset-ocdbt — wipe out of band with
`gcloud storage rm -r gs://<ws>/ocdbt/`. The shared base/fork/persist
dance moves out of both `graph` subcommands into the single setup_base
helper. upgrade's legacy custom_data["seg"] write is also replaced by
the canonical ocdbt_config schema setup_base persists.
…s.requeue_chunk

ingest_chunk and upgrade_chunk were identical aside from their L2 / L3+
function pair. The shared body lives in utils.requeue_chunk; each CLI's
chunk command is one line that passes its own (atomic_fn, parent_fn).
bootstrap returns (meta, ingest_config, client_info, ocdbt_config_dict)
since the OCDBT-package refactor, but the test still unpacked three
values and raised at call site. Update the unpack and assert the new
dict element is present.
queued sits at position 2 — the column you actually watch during a run.
Layer centers under its header now that it's the lone left column. %
becomes progress; done becomes completed. The composite workers [busy]
column splits into two right-aligned columns (workers, busy) so the
total and the active count line up by digit and don't visually fuse.
Global padding bumps to (0, 2) for breathing room.

Headers go through rich.text.Text so any future bracketed header would
render literally instead of being parsed as Rich markup.
Workers were setting OCDBT_COORDINATOR_HOST/PORT env vars before opening
the base kvstore. The tensorstore binary doesn't reference those strings —
the OCDBT driver looks at the spec's `coordinator: {address: ...}` field
(or the `ocdbt_coordinator` Context resource), nothing else. So every
"distributed" populate worker has been writing direct, racing the shared
manifest, and emitting orphan d/ files: ~634 GiB of d/ at under 3% of
layer 3 done before this was caught.

open_base_ocdbt takes an optional coordinator_address and injects it into
the kvstore spec. populate_chunk forwards an address when one is passed
(distributed) and writes direct when none is (single-process / notebooks),
since uncoordinated single-writer is fine. cluster.create_parent_chunk
is the only distributed caller and is where the fail-loud lives: it
reads the address from Redis via get_coordinator_address and raises if
unset, so a worker can never silently regress to the env-var no-op.

Verified by opening an OCDBT kvstore with coordinator=localhost:1 — the
first write raises UNAVAILABLE, confirming the spec field is the actual
routing knob.
tensorstore's distributed-OCDBT path rejects cross-key atomic
transactions — every populate task with the coordinator wired in was
raising "Cannot read/write ... as single atomic transaction" the moment
copy_ws_bbox_multiscale tried to span the info key and chunk keys (and
across scales).

The fix is just to drop atomic=True. The transaction object itself is
what batches every per-scale .write() into one OCDBT commit; atomic
only adds cross-key isolation, which we don't need here and which the
coordinator path forbids. Verified by reproducing both shapes:

  * coord + atomic=True  → fails immediately on info+chunk
  * coord + Transaction() → single d/ file per task (multi-scale)

Concurrent-writer check under the coordinator: 8 threads × 20 commits
produced 183 d/ files (≈1.14× overhead from manifest updates). Without
the coordinator the same workload produced 1017 d/ files (6.4× orphan
ratio), which matches the manifest-CAS-race behavior the coordinator
fixes.
Each populate task now prints a layer/coords header line, then the
multiscale commit prints a tight summary of how many voxels landed
across how many scales. Per-scale detail stays at debug level so the
default operator view stays one informative line per task — enough to
see what's being done and how much, without scrolling.
New pychunkedgraph/graph/ocdbt/debug.py holds all the diagnostic
plumbing — humanize_count, failure_envelope (host/pod/versions/
traceback/timestamp/coordinator env), bbox_failure_payload (per-scale
shape/chunk/key-count/raw-bytes + src+dst kvstore specs), and
dump_failure_to_gcs (writes JSON to $ERROR_DUMP/<dump_tag>__<utc>.json).

copy_ws_bbox_multiscale stays slim: when ERROR_DUMP is unset, no
per-scale bookkeeping, no payload, no GCS write — happy path runs
unchanged. When ERROR_DUMP is set, per-scale rows accumulate and on
failure get serialized via the helpers. populate_chunk threads a
<graph_id>/L<layer>/<x>_<y>_<z> dump_tag so each failing task self-
identifies in the bucket and multiple experiments share one dump root
without collisions.
`flask ingest purge_layer <N>` drops all per-layer redis state — the RQ
queue (deletes jobs along with it), every per-layer registry (failed,
started, deferred, scheduled, finished, canceled, addressed by the
registry's own .key so we don't hardcode RQ's internal key naming),
and the pychunkedgraph completion set f"{N}c". Confirmation prompt
gates the destructive action.

Used when you restore the table from a previous layer's backup and
need to re-enqueue from scratch — without this, stale queue + completion
state from the aborted run would shadow the fresh re-run.
Restructure `flask ingest graph`: --retry/-r reuses the IngestionManager
from redis but otherwise runs the same setup path as a fresh ingest —
opens the existing table via imanager.cg, calls setup_base when OCDBT
is enabled (idempotent: recreates the base + info if absent, reconciles
config, forks for this CG), and enqueues. Skipping only cg.create()
means an externally-wiped OCDBT base now recovers on retry without
touching Bigtable.

New --skip-queue/-s does setup but no L2 enqueue, for operators who
want to verify the OCDBT base / fork state before queueing work.
Short forms for retry/-r, skip-queue/-s, test/-t.
@akhileshh akhileshh force-pushed the akhilesh/sv-splitting-locks branch 3 times, most recently from c1c51ce to 9b86434 Compare May 21, 2026 21:14
In-repo notes on every OCDBT spec field, config sub-field,
ocdbt_coordinator resource field, DistributedCoordinatorServer
constructor field, and runtime defaults — verified empirically against
the venv's tensorstore binary by probing with intentional-bad-value
plus spec round-trips.

Captures the distributed-mode constraints: atomic transactions
forbidden, cooperator-forwarded RPC bounded at gRPC's 4 MiB default
with no Python knob to raise it, leases held per btree node so
disjoint user-key writes still cross-forward. Catalogs the levers that
do exist (max_decoded_node_bytes, max_inline_value_bytes,
lease_duration) and ones that look promising but aren't fields.
@akhileshh akhileshh force-pushed the akhilesh/sv-splitting-locks branch from 9b86434 to 5873aa8 Compare May 21, 2026 21:30
akhileshh added 3 commits May 21, 2026 21:36
Reading tensorstore's distributed btree_writer.cc StagePending: values
≤ max_inline_value_bytes are carried inline in the encoded mutation, so
the cooperator's per-leaf WriteRequest packs all of them together. With
the previous 1 MiB threshold, every compressed_segmentation chunk that
landed below 1 MiB stayed inline and contributed its full bytes to the
gRPC forward — which blows past tensorstore's hardcoded 4 MiB gRPC
max-receive whenever a few inline chunks land on the same btree leaf.

Dropping to 4 KiB pushes every chunk value through WriteData into a d/
file and replaces it in the mutation with an IndirectDataReference, so
forwarded RPCs carry only small refs regardless of value size. Small
metadata (info JSON, populate-marker files) still stays inline.

Tradeoff: the old comment cited "7× GCS bloat" at the 100-byte default
— that ratio came from per-chunk independent zstd framing dominating
tiny payloads, not chunk-sized ones. At 4 KiB+ payloads the per-chunk
zstd overhead is in the single-digit-percent range. Reference doc
updated to credit max_inline_value_bytes as the actual RPC-size lever
and correct the cooperator-batching section.
tensorstore raises ValueError with an absl status-code prefix on
transport-level failures. A DNS hiccup or a GCS 5xx in is_chunk_populated
or mark_chunk_populated was killing the whole populate task. Wrap the
four marker / populate-meta helpers in a tenacity retry that matches
UNAVAILABLE / DEADLINE_EXCEEDED / ABORTED / INTERNAL prefixes only —
NOT_FOUND, INVALID_ARGUMENT, RESOURCE_EXHAUSTED still propagate.
…ask-body mode flags

`ingest layer N` now owns the whole OCDBT lifecycle when N matches
ocdbt_populate_layer: idempotently runs setup_base (base + fork +
config reconcile), starts the coordinator, queues tasks. Moves
setup_base out of `ingest graph` so there's a single command that
manages OCDBT. Re-pickles the IngestionManager so workers see the
resolved config.

Adds three flags to `ingest layer`:
  --queue-only/-q  : skip the coordinator (one is running elsewhere).
  --ocdbt-only/-o  : workers run only OCDBT populate, skipping
    add_parent_chunk. For re-populating a freshly created/wiped OCDBT
    base against an already-built bigtable graph.
  --ingest-only/-i : workers run only add_parent_chunk, skipping
    OCDBT populate. For rebuilding bigtable against an existing
    OCDBT base.

Refactors create_parent_chunk to a single function with a `mode`
parameter bound at queue time via functools.partial — no duplicated
OCDBT-populate block. _post_task_completion runs unconditionally so
layer-progress tracking in redis stays consistent across modes.

Adds IngestionManager.is_ocdbt_populate_layer(layer) so the compound
guard isn't repeated at every call site.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant