Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,13 @@ All notable changes to this project will be documented in this file.

### Fixed

- **@overeng/megarepo**: Harden the cold named-branch `mr store gc` reclamation path against an adversarial review. (1) **Archive freed no branch for production worktrees**: a production `refs/heads/*` worktree is NON-DETACHED, so after `git worktree move` the moved worktree still has the branch checked out and `git branch -D` is REFUSED (`cannot delete branch used by worktree`), leaving the branch unfreed and a later `mr apply` re-add broken (invariant 4). Fixed by detaching the moved worktree's HEAD (`git checkout --detach`, new `Git.detachWorktreeHead`) before freeing the ref; covered by a new integration test using a non-detached worktree (the prior fixtures only used `--detach`, masking the defect). (2) **Liveness registry writes were non-atomic**: `refreshWorkspaceRegistry` and the under-lock reconcile rewrite used plain `writeFileString`, so a torn read during a concurrent gc reconcile could drop a workspace's live-set veto (decision 0010 hard veto); both now route through `writeFileAtomic`. (3) **Partial-archive mis-reporting**: once the move succeeded the result is now `archived` with the real `recoverPath` (post-move branch-free + README steps are best-effort-but-reported via a warning), instead of falsely reporting `error`/"left intact". The `.archive/README.md` append is now an atomic write. New tests cover the archive-time live-set veto re-check, `loadStoreGcConfig` file load + corrupt-file degradation, a CLOSED-PR archive, dry-run reap intent, the unclean-reconcile grace withholding/restart, and `writeFileAtomic` temp-cleanup on a rename failure (#771).
- **@overeng/otelite**: Honor durability-before-ack — flush each export to the kernel before the 200/OK. `tokio::fs::File` buffers writes, so `write_all` alone did NOT guarantee the bytes reached the kernel before the sink acked; an independent reader (or a crash) before the next flush could miss them, contradicting R05 ("flush … before acking") and the `append_line` doc's own "durably reaching the kernel before returning" promise. This surfaced as a CI flake in the `durable_before_ack` gate (a read immediately after the 200 occasionally saw an empty file under thread contention — reproduced ~1/60 at 16 test threads). Fix: `SignalFile::append_line` / `append_json` now `flush()` after `write_all`, before returning. This is a flush, not an fsync — `sync_all` (physical-disk durability) stays deferred to shutdown, so the M2 "no per-export fsync under the lock" throughput decision is preserved. Verified: 0 failures over 200 × 16-thread runs (was ~1/60).
- **@overeng/otelite**: Make the HTTP-JSON metrics receive path lossless, fixing two silent data-loss bugs a stress test surfaced. The upstream `opentelemetry-proto` `with-serde` deserialize — which the receiver used to BUILD the proto value the sink then re-serialized — silently drops several metric JSON shapes: a `sum`/`gauge` `NumberDataPoint` whose int64 value is the default string form (`"asInt":"7"`) lost its value entirely (captured null), and a regular `histogram` metric was dropped down to `{name,description,unit,metadata}` (its data oneof gone). Both returned HTTP 200 + bumped `counts.metrics` → a silent mis-capture that violates the lossless + "loud, never silent" contracts (decisions/0011). Fix: on the JSON metrics path, `with-serde` still runs purely as the dialect VALIDATOR (Err → 400 + `note_rejected`, gate unchanged), but on success the receiver now persists the VALIDATED RAW JSON body verbatim (re-emitted through `serde_json::Value` via the new `Sink::write_metrics_json`, counting metrics from the JSON structure) instead of the lossy proto re-serialization. Since the body is already canonical OTLP/JSON and `inspect` walks raw JSON, the JSON metrics path is now lossless for string-int64 sums/gauges, regular histograms, AND exponential histograms — the last also RESOLVING the previously-documented exp-histogram-on-JSON limitation for the receive path. Traces/logs JSON paths and all protobuf/gRPC paths are unchanged (already lossless). New gates (real receiver, no mocks): an HTTP-JSON round-trip of a string-int64 sum + histogram + exponential histogram all survive receive → capture → `inspect`; cross-transport equivalence extended to metrics (the same logical string-int64-sum + histogram over HTTP-JSON vs HTTP-protobuf vs gRPC flattens to equivalent `inspect` rows, the proto/gRPC fixture built natively to avoid the lossy `with-serde` source); and a loud-rejection guard that a malformed metrics JSON body still 400s + is captured nowhere. KNOWN RESIDUAL: the upstream metrics `with-serde` is more lenient than the trace one, so for metrics the JSON dialect gate is effectively structural (malformed JSON / hard field-type mismatches), tolerating some non-default dialect shapes (numeric int64 nanos, string enums) rather than rejecting them loudly — a stricter metrics dialect gate is a follow-up (#769, #772).

### Added

- **@overeng/megarepo**: `mr store gc` now reclaims **cold** named-branch (`refs/heads/*`) worktrees, the dominant store accumulation that default gc previously protected unconditionally. A named worktree is reclaimed only after passing layered, short-circuiting gates: (1) a hard cross-megarepo live-set veto (`collectStoreLiveSet`, store-wide — a `repos/` symlink alone never protects; only a recorded `livePaths` entry does); (2) staleness — the branch's GitHub PR is merged or closed (an open PR, no PR, or any `gh`/resolver failure ⇒ keep); (3) a lossless floor — every local commit is reachable on a remote (`git rev-list <head> --not --remotes` empty after `fetch --prune`), no stash, no unpushed commits, and a per-repo fetch failure keeps all of that repo's worktrees; (4) three grace timers tracked against a persisted observation ledger — _absence grace_ (default 14d) for every cold worktree, plus a _post-merge grace_ (default 7d after `mergedAt`) for merged branches (closed-unmerged has no post-close grace). Capture is two-phase: a qualifying worktree is `git worktree move`d to `<repo>/.archive/<branch>--<ISO8601>/` and its local ref freed (so `mr apply` re-materializes it), then **reaped** (hard-deleted) once it ages past the _archive retention TTL_ (default 30d); gc also reaps pre-existing `.archive/` worktrees it formerly ignored. Timers are overridable via `$STORE/.state/gc-config.json`. Before any deletion gc reconciles ALL registered workspaces (not just the current one), and `mr apply`/`sync`/`pull`/`pin` now refresh the liveness record — closing a bug where a repinned-but-unre-registered workspace's _live_ worktree could be deleted; reconcile-all fails safe (a present-but-unreadable workspace keeps its last-known live paths, and grace is not advanced for an unclean reconcile). Archive and reap both re-check the live-set veto under the worktree lock immediately before acting. Provably lossless and conservative: absence of evidence never licenses deletion. `mr store gc --json` results gain `archived`/`reaped`/`kept` statuses with a stable `reason` tag and (for `archived`) a `recoverPath`. `--all` is unchanged (nuclear, live-set-bypassing). Design in `docs/decisions/0001`–`0011`, terms in `docs/glossary.md`, spec in `docs/spec.md` (#771).
- **@overeng/utils-dev/otelite**: Add a vitest ↔ otelite capture bridge that wires an in-process `Otelite.capture` receiver to a vitest test's OTLP trace exporter, so spans emitted IN-PROCESS through the normal `@effect/opentelemetry` `OtlpTracer` layer land in a capture the test can assert over. `makeOteliteCaptureLayer(options?)` is a scoped `Layer` that boots ONE receiver, exposes its `CaptureHandle` via the new `OteliteCapture` `Context.Tag`, AND installs the trace exporter pointed at `${handle.endpoints.http}/v1/traces`; used with `@effect/vitest`'s `layer(...)` it gives a PER-FILE lifecycle (one receiver per test file, shared across that file's tests; tests disambiguate by a unique `service.name` / span name) — the cheap default per decision 0015, with per-test available by giving each `layer(...)` its own instance. A test does `const cap = yield* OteliteCapture; …; yield* cap.inspect({ signal: 'traces', name })`. `flushCaptureSpans()` force-flushes the exporter (the emitter's job) before inspecting. Silent-failure guard: a misrouted exporter (the `/v1/traces` suffix bug, see Fixed) lands nothing, so the demonstrator's non-zero `inspect`/`span_count` assertions FAIL the test rather than pass vacuously. Real-binary tests emit a span in-process through the REAL exporter and assert it round-trips, plus a regression that a bare un-suffixed URL captures nothing (#769, #772).
- **@overeng/notion-effect-client**: Add a real-consumer span-assertion demonstrator (D3, decision 0015) co-located in this client's own test suite (`src/test/otelite-span-shape.test.ts`). It drives the REAL instrumented query path (`NotionDatabases.query` → `executeRequest` → `Effect.withSpan('NotionHttp.POST')`) against a STUB upstream — a `HttpClient.make(...)` answering the one `POST /data_sources/{id}/query` endpoint with a canned empty paginated list + `x-ratelimit-remaining`/`x-request-id` headers — under the `@overeng/utils-dev/otelite` capture bridge, with NO secrets and NO network. It asserts the emitted span shape: exactly one `NotionHttp.POST` span carrying the templated `notion.http.route` = `/data_sources/{data_source_id}/query` + `notion.http.method`/`operation`/`status_code` (200) + `notion.rate_limit.remaining` (42); exactly one auto `http.client POST` child from `@effect/platform` whose `url.path` proves the stub served the request; a non-zero `span_count` (silent-export guard); and a public-repo leak guard that NO captured span attribute carries an `authorization` header or the token value (`@effect/platform` records only a header subset and excludes Authorization). The churn-coupled `notion.http.*` assertions sit next to the instrumentation that churns; the bridge stays a lean shared helper. The shadowing gotcha (the bridge re-exports the exporter's `FetchHttpClient` as `HttpClient.HttpClient`) is resolved by providing the stub to the effect-under-test directly (`Effect.provide`, innermost-wins) so the consumer sees the stub. Runs the real nix-built `otelite` binary on `PATH` (#769, #772).
- **@overeng/utils-dev/otelite**: Add a scoped in-process `Otelite.capture` primitive for harnesses that own the system-under-test lifecycle themselves (vs `run`, which spawns the child). `capture(options?)` spawns `otelite capture` with piped stdin/stdout and yields a scoped `CaptureHandle` (`endpoints`, `outDir`, `inspect`, `summary`). It learns the ephemeral receiver endpoints by decoding the FIRST stdout line against a new `EndpointsEvent`/`otelite.endpoints/v1` `Schema` — dispatching on the `schema` tag, never string-scraping. Closing the scope stops the receiver by closing the child's stdin (EOF), drains in-flight exports, and resolves `handle.summary` (the final `otelite.summary/v1` line); teardown is interrupt-safe so an interrupted scope leaves no orphaned child. The handle's `inspect` pins `src` to the out-dir and does a small bounded short-poll retry on a transient 0-row read (the receiver writes each export straight to the file with `write_all` before acking, so a captured span is durable immediately — but an independent reader can briefly observe 0 rows from pure scheduler/fs-visibility latency). Real-binary tests raw-POST a known OTLP/JSON span and assert the typed `SpanRow` round-trips through `inspect` and `summary.counts.spans` (#769, #772).
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# GC reclaims cold named-branch worktrees by deletion, not artifact-pruning

## Status

accepted (supersedes the original scope of issue #771)

## Context

`mr store gc` default mode protects every `refs/heads/*` and `refs/tags/*`
worktree unconditionally (`classifyStoreWorktreePolicy` → `named_branch_ref`).
Only detached `refs/commits/*` worktrees outside the live set are collectable.
A real-store survey (2026-06-10) found 323 named-branch worktrees across the
store (122 in effect-utils alone), most cold, so default GC structurally cannot
reclaim the dominant accumulation.

Issue #771 originally proposed the conservative path: keep every worktree, delete
only its regenerable artifacts in place (`--prune-artifacts`).

## Decision

Target **full deletion of cold named-branch worktrees** instead. Refine the
staleness classification so default GC can safely delete a cold named-branch
worktree (reclaiming source, `.git`, and artifacts together). The
artifact-prune-in-place mode from #771 is **deferred**, not pursued in this work.

## Consequences

- The hard problem moves from "which artifacts are regenerable" to "which
worktrees hold no irreplaceable state" — a safety-classification problem.
- A false-positive deletion can lose un-pushed/uncommitted work, so the safety
gate must be conservative (see later decisions on the deletion invariant).
- Worktrees we want to keep but that carry fat artifacts are NOT addressed here;
artifact-pruning remains available as future work under #771.
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# Cross-megarepo membership vetoes stale-worktree deletion

## Status

accepted (safety invariant)

## Context

The store is shared by independent megarepo workspaces. A worktree that looks
stale in isolation (merged PR, old, clean) may still be an active member of a
_different_ megarepo. Deleting it would break that workspace.

Protection today rides on the store liveness registry (`.state/workspaces/
<hash>.json`): each workspace records its consumed store paths (`livePaths`,
derived from `repos/` symlinks + lock). `collectStoreLiveSet` unions all
registered records. Verified live: in `default` mode a detached commit worktree
consumed by workspace B is skipped when B is registered.

Two structural limits (verified / being verified end-to-end):

1. The registry is a per-workspace **cache**, refreshed only when that workspace
runs an `mr` command. A workspace that exists but has never run `mr` (or whose
record is stale) contributes nothing to the live set — its members are
unprotected.
2. The two existing GC modes can't express the needed gate: `default` blanket-
protects every named branch (so liveness is moot for them); `--all` ignores
the live set entirely (protects nothing). Neither honors "delete a named
branch _only if_ no workspace consumes it."

## Decision

Cross-megarepo membership is a **hard veto** on deletion: a worktree referenced
by ANY workspace's live set is never deleted, even if it independently satisfies
the lossless+staleness gate. The new stale-deletion policy is a THIRD mode
(distinct from `default` and `--all`) that consults the live set for named
branches too.

The registry-completeness gap (limit 1) is itself a safety problem and must be
closed or bounded before stale named-branch deletion is enabled (see the
freshness/heartbeat decision).

## Verified (end-to-end, isolated store — tmp/gc-exp/xmatrix-findings.md)

Real `mr` binary, isolated store, gc run from a workspace that does NOT consume
the target detached-commit worktree C:

- Registered consumer ⇒ C `skipped_in_use` ("referenced by workspace root set").
Protection unions livePaths of ALL registered workspaces. Works.
- Unregistered / deleted-record consumer ⇒ C `removed` (real gc physically
deleted it). A `repos/` symlink ALONE gives zero protection — gc never
live-scans other workspaces' symlinks.
- Only `mr status` / `mr store status` refresh a record; `mr store gc` (even
dry-run), `ls`, `check`, `root` do NOT. Records go stale easily.
- **Latent pre-existing bug:** after a workspace repins to a new target without
re-registering, gc over-protects the abandoned worktree AND _under-protects
the new in-use target_ (deletes a worktree a live workspace is actually using).
This already exists for commit worktrees today, independent of this feature.

## Consequences

- The live-set gate must precede the lossless/staleness checks and use the
store-wide registry (`collectStoreLiveSet`), not just the current workspace.
- Stale deletion cannot reuse `--all` semantics.
- A consumer that never registers is the dominant residual risk; mitigations
(more commands refresh the record; freshness gate; conservative default)
are required, not optional.
- The repin-without-reregister under-protection (verified) must be closed: more
commands must refresh, and/or gc must reconcile registered workspaces before
deleting.
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Trust the liveness registry, bounded by safety margins

## Status

accepted

## Context

Cross-megarepo protection rides on the per-workspace liveness registry, which is
a cache (only fresh for workspaces that have run `mr`). Building an authoritative
global workspace index was considered and rejected as heavy new infrastructure
with the same chicken-and-egg for never-seen workspaces.

A key mitigating fact: the lossless floor (fully pushed + no uncommitted source)
already prevents _data loss_ in the cross-megarepo case — a wrongly deleted
member that passed the floor is re-materializable via `mr apply`. The veto is
therefore mostly about _availability_ (don't disrupt an active consumer) plus one
real edge: a squash-merged branch deleted from its remote may have an
unreachable commit, so re-fetch can fail.

## Decision

Trust the registry as the cross-megarepo signal, bounded by margins rather than
replaced by new infrastructure:

- Refresh the current workspace's registry record on more `mr` commands (cheap)
so records stay fresh in normal use.
- Gate stale named-branch deletion on registry freshness (a TTL / heartbeat) and
refuse-when-uncertain (fall back to keeping the worktree).
- Require a worktree be continuously absent from ALL live sets across a grace
window before it is deletable (see staleness/grace-window decision), not just
absent in one snapshot.

## Consequences

- The residual risk is a consumer that has literally never run `mr`; this is
accepted, bounded by the grace window and the re-apply recoverability of
lossless worktrees.
- The deleted-remote-branch edge needs explicit handling in the lossless floor
(prefer "commit reachable on remote", not merely "branch was pushed once").
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Capture-then-delete: safety never depends on classifying dirt

## Status

accepted — capture-before-delete principle stands; the holding location moved
from `$STORE/.state/trash/` to `<repo>/.archive/` (superseded on location by
[0007](0007-archive-is-the-trash.md)).

## Context

A real-store survey proved that classifying uncommitted changes as "generated"
vs "source" by path is unreliable in both directions (`src/build/app.ts` matched
a `build/` pattern but is hand-written; `*.d.ts.map` / `*.genie.js` are generated
but matched no pattern). `mr` is generic and cannot reliably know a repo's
generated-file set. Yet nearly every cold worktree carries ~10 dirty files of
regenerated drift, so "any dirt blocks deletion" reclaims almost nothing.

## Decision

Deletion safety must NOT depend on the gen/source classifier. Before deleting a
cold worktree that has any uncommitted change, capture the uncommitted state into
a recoverable store-side trash with a retention TTL (e.g. move the worktree under
`$STORE/.state/trash/<repo>/<ref>-<ts>/`, or persist a diff patch + untracked
tarball). Only then remove it. Clean worktrees (nothing to lose, and committed
work already durable per the lossless floor) may be hard-deleted directly.

"Generated vs source" is demoted to a UX-only filter: known-regenerable drift
(lockfiles, declared genie outputs) need not be stashed and need not be reported
as risk — but mis-classifying it never causes data loss.

## Consequences

- Provably lossless regardless of classifier accuracy.
- Trash consumes disk until its TTL expires, partially deferring reclaim for
dirty worktrees; the dominant win (clean, merged worktrees → hard delete) is
unaffected. Trash is itself GC'd by age.
- Recovery story: a wrongly-deleted dirty worktree is restorable from trash
within the TTL.
Loading