Skip to content

feat: fabric session protocol with firmware transfer support#40

Draft
cpunt wants to merge 20 commits into
fabric-protocolfrom
fabric-update
Draft

feat: fabric session protocol with firmware transfer support#40
cpunt wants to merge 20 commits into
fabric-protocolfrom
fabric-update

Conversation

@cpunt
Copy link
Copy Markdown

@cpunt cpunt commented Apr 13, 2026

Summary

  • Adds the MCU updater runtime: retained state/self/* facts, prepare/commit RPCs, staged descriptor handling, and updater/main transfer staging.
  • Routes Fabric update calls through the documented cmd/self/updater/{prepare,commit} and xfer_* flow, with legacy HAL/config routes removed from the update surface.
  • Streams incoming firmware data to the A/B update path on TinyGo and keeps host builds/test fakes in memory.

Why

This branch makes MCU updates work over the Fabric contract from docs/updating.md without adding extra wire routes beyond the Lua-side contract.

Testing

  • go test ./...

@cpunt cpunt changed the base branch from main to fabric-protocol April 13, 2026 11:21
@cpunt cpunt marked this pull request as draft April 13, 2026 11:23
@cpunt cpunt changed the base branch from fabric-protocol to main April 28, 2026 20:21
@cpunt cpunt changed the base branch from main to fabric-protocol April 28, 2026 20:21
cpunt added 20 commits May 19, 2026 09:06
Implements the W1/W4/W5/W6 backbone of the fabric-update plan:

W1 — Verifier interface + production stub
- services/updater/verifier.go defines the Verifier interface with
  Verify(io.Reader, SlotSink) (Manifest, error). Production wiring
  passes StubVerifier(), which always rejects with
  ErrUnsignedNotSupported. Test fakes (fakeVerifierAccept,
  fakeVerifierReject) exercise both branches without touching the
  production stub.
- SlotSink keeps the verifier-to-staging seam tiny so fabric-security
  can swap in an abupdate-backed sink without API churn.
- Manifest is the small subset of the signed-image manifest the
  receiver propagates onward (version, build_id, image_id, sha256,
  length). Full canonical manifest stays in pico2-a-b/imagev1 which
  fabric-security adds.

W4 — Updater state machine + RPC handlers
- 7 canonical states + the empty "" (= nil) quiescent state, with
  Allowed() guard for tests that try invalid transitions.
- handlePrepare clears last_error and transitions to ready, returning
  {ok=true, accepted=true}. Refuses with err="busy" when another
  prepare is in flight or state==applying.
- handleCommit transitions to applying when a staged descriptor or
  in-state staged transition is present; refuses with
  err="nothing_staged" otherwise. Reboot/slot-switch is left to the
  abupdate metadata writer that lands in W11/fabric-security.
- Local-bus topics rpc/updater/{prepare,commit} bind here; the wire
  topics cmd/self/updater/{prepare,commit} are routed in via fabric's
  importCallRules (services/fabric/remap.go).

W5 — Identity facts (state/self/{software, updater, health})
- SoftwareFact = {version, build, image_id, boot_id, payload_sha256}.
- UpdaterFact = {state, last_error, pending_version}, flat shape.
- HealthFact = {state, reason?} with omitempty on reason.
- Service emits the three facts on Run() startup and re-emits via
  Republish() — the latter is the hook fabric will call on every
  hello_ack edge (W10 will wire that in).

W6 — boot_id RAM-only generator
- 8 bytes from crypto/rand, hex-encoded to 16 lower-case chars.
- Generated once at boot via main.go's call site between HAL ready
  and reactor.Run — i.e. before fabric opens, so the first
  software-fact publish carries a populated boot_id.
- Held in RAM only via atomic.Pointer cache. Idempotent — repeat
  calls return the cached value.
- Fallback: process-startup counter packed into 16 hex chars when
  crypto/rand fails or returns all-zero. Logged so the failure-mode
  hardware test suite (master R3) can grep for it.
- Per-test reset via resetBootIDForTest() exercised by 4 unit tests
  covering len-16-hex, idempotence, uniqueness across "boots", and
  all-zero-sentinel rejection.

Receiver wiring (W2 placeholder)
- services/updater/receiver.go binds to the literal local topic
  raw/member/mcu/cap/updater/main/rpc/receive that CM5 sends as
  meta.receiver. Decodes ReceiverPayload, calls the verifier with a
  bytes.Reader over the artefact and a memorySink. On success
  publishes state=staged with manifest.version as pending_version
  and replies {ok:true, stage:"staged"}. On failure publishes
  state=failed with last_error and replies {ok:false, err:...}.
- Stub verifier produces the failure path on every artefact, so the
  production build cannot stage; fakeVerifierAccept exercises the
  staged-publish path in tests.
- The fabric-side glue that actually CALLS this receiver after
  xfer_commit lands in phase 2 (W2 + W3 stage/apply split).

Glue
- services/fabric/remap.go importCallRules: cmd/self/updater/prepare
  -> rpc/updater/prepare; same for commit.
- services/reactor/reactor.go starts the updater service in its own
  goroutine before the env+power subscriptions, so retained facts
  land before the fabric link opens.
- main.go calls updater.GenerateBootID() between HAL ready and
  reactor.Run, logging the chosen value.

Tests (services/updater/updater_test.go)
- Initial fact publish: software/updater/health all retain on Run
  startup with the configured Identity + 16-char boot_id.
- Prepare happy path: ok=true, state -> ready, last_error cleared.
- Commit without staged: refusal err="nothing_staged".
- Commit with staged descriptor: ok=true, state -> applying,
  pending_version mirrors descriptor.Version.
- Receiver under stub verifier: state -> failed, last_error contains
  the verifier_stub sentinel; reply ok=false.
- Receiver under fakeVerifierAccept: state -> staged,
  pending_version = manifest.Version; reply ok=true stage=staged.
- Receiver under fakeVerifierReject: state -> failed with the
  custom error string.
- jsonDecode tolerates typed payloads, json.RawMessage, []byte, and
  nil — the four shapes the bus actually delivers in practice.
- memorySink Abort clears the buffer; Commit closes further writes.

Build sizes (pico_bb_proto_1):
  release default     : code 184640  (was 280364 — TinyGo dead-code
                                       elimination cleaning up after
                                       the receiver/verifier refactor)
  release flash_unsafe: code 184628
  qa_reactor          : code 136232

Out of scope for this commit (later phases):
- W2/W3 fabric-side stage/apply split + receiver invocation from
  transfer.go after xfer_commit. Receiver is wired but not yet
  invoked by fabric.
- W7/W8 retained telemetry publishers + charger alert FSM.
- W9 lane assignment routing into fabric's 3-lane writer.
- W10 post-hello_ack republish hook.
- W11 abupdate metadata extension (payload_sha256 + staged
  descriptor read/write).
- W12 legacy Go remap removal.
- W13 L4 acceptance prep on devicecode-lua.
Implements W2/W3 of the fabric-update plan: fabric's onTransferCommit
now hands the committed artefact to the local-bus receiver topic
named in meta.receiver, gates xfer_done on the receiver's reply, and
defers slot-switch/reboot to the updater commit RPC.

W2 — Receiver invocation
- transfer.go's onTransferCommit, after sink.Commit() succeeds and
  before sending xfer_done, decodes meta.receiver from the xfer_begin
  meta blob and calls that local-bus topic with
  {link_id, xfer_id, size, checksum, meta, artefact}.
- The reply gates the next frame: ok=true -> xfer_done; ok=false ->
  xfer_abort with err = receiver's err string. Synchronous call with
  receiverCallTimeout (5s); session goroutine blocks intentionally
  during the call because we MUST decide xfer_done vs xfer_abort
  before processing further frames.
- decodeReceiverReply tolerates two reply shapes — typed Go struct
  (in-process tests) and JSON bus payload (real receiver via
  services/updater.ReceiverReply).

W3 — Stage/apply split
- transferSink interface grows Bytes() []byte. nil means the sink
  streamed elsewhere (e.g. flash_unsafe writes directly to flash and
  doesn't keep a RAM copy). Buffer sink returns its committed slice;
  flash_unsafe sink and fakeTransferSink return nil.
- onTransferCommit calls the receiver only when (a) Bytes() returns
  non-nil AND (b) meta.receiver is present. Otherwise it falls
  through to the legacy xfer_done -> sink.Apply() flow — keeping
  flash_unsafe builds and tests that don't model the receiver
  unaffected.
- The receiver path NO LONGER calls sink.Apply(). Slot-switch and
  reboot are deferred to cmd/self/updater/commit, which the updater
  state machine in services/updater handles. fabric-security wires
  the actual reboot through abupdate when it lands.

New default sink (transfer_sink_buffer.go)
- Replaces the rejecting transfer_sink_rp2350.go default and the
  rejecting transfer_sink_stub.go host build with an in-RAM buffer
  capped at 64 KB. Hitting the cap returns ErrArtefactTooLarge from
  WriteChunk, which propagates to xfer_abort.
- Bytes() returns the committed slice for the receiver. Apply() is
  a no-op (slot-switch is the updater RPC's job).
- The 64 KB cap is conservative — large-firmware streaming-into-flash
  for unverified artefacts is fabric-security's job; this branch only
  needs enough to round-trip the smoke-test traffic and the receiver
  path with a stub verifier.

Tests
- TestTransferReceiverInvokedAfterCommit: registers a local-bus
  receiver handler, drives a 4-byte transfer with
  meta.receiver=["test","receiver"], asserts the receiver got the
  expected payload (link_id, xfer_id, checksum, artefact bytes) and
  fabric sent xfer_done after the receiver replied ok=true.
- TestTransferReceiverRejectAbortsTransfer: receiver replies
  {ok=false, err="manifest_check_failed"}; fabric sends xfer_abort
  with err mirroring the receiver reason.
- bufferingSinkAdapter wraps the production bufferSink for these
  tests so the receiver path is exercised against the real sink, not
  fakeTransferSink (whose nil Bytes() is the legacy fallback path).
- All existing transfer tests (legacy fallback path with
  fakeTransferSink) continue to pass — the change is additive.

Build sizes (pico_bb_proto_1):
  release default     : 184616  (effectively unchanged)
  release flash_unsafe: 184628

Out of scope for this commit (still later):
- W7 retained telemetry publishers + W8 charger alert FSM.
- W9 lane assignment routing into the 3-lane writer.
- W10 post-hello_ack republish hook (Republish() exists; needs to be
  invoked from the fabric session lifecycle).
- W11 abupdate metadata extension (payload_sha256 + staged
  descriptor read/write).
- W12 legacy Go remap removal.
- W13 L4 acceptance prep on devicecode-lua.
The updater service now watches the fabric link-state retain at
state/fabric/link/+ and calls Republish() on every !Ready -> Ready
transition. Mirrors the spec in
docs/firmware-alignment-update.md:

  Retained `state/self/*` is republished:
    - immediately after every successful boot
    - on every newly established session (`hello_ack`), warm or cold

Implementation:
- Run() grows a fourth subscription on TopicFabricLink
  (state/fabric/link/+), tracking per-link Ready flag in a small map
  to detect edges and avoid double-firing on subsequent Ready=true
  retains that keep the same edge state.
- decodeLinkState pulls the link_id from the topic tail and Ready
  from the payload (typed map or JSON-marshalable struct — covers
  both the in-process linkStatePayload published by
  services/fabric/session.go and any future test-side fakes).

Test (TestRepublishOnLinkReadyEdge):
- Drains the initial fact retains, publishes a Ready=false link
  state (no republish expected), then a Ready=true edge (republish
  asserted via the next state/self/software emit), then a second
  Ready=true retain (no double-fire — channel must stay quiet for
  the settle window).
New services/telemetry package subscribes to the existing HAL value
topics (hal/cap/env/..., hal/cap/power/...) and republishes them
under the canonical state/self/* surface using integer engineering
units. Also runs the W8 charger alert FSM emitting
event/self/power/charger/alert with 14 canonical kinds.

W7 — retained-state publishers
- state/self/power/battery: pack_mV, per_cell_mV, ibat_mA, temp_mC,
  bsr_uohm_per_cell. Sourced from types.BatteryValue.
- state/self/power/charger: vin_mV, vsys_mV, iin_mA, raw bitfields
  (state_bits/status_bits/system_bits) AND 3 decoded boolean maps
  (state{}, status{}, system{}) carrying the canonical names from
  the existing types.ChargerStateTable / ChargeStatusTable /
  SystemStatusTable. The 27-boolean spec (11 + 4 + 12) follows from
  the size of those tables.
- state/self/environment/temperature: deci_c.
- state/self/environment/humidity: rh_x100.
- state/self/runtime/memory: alloc_bytes from runtime.MemStats. Tick
  every 3 s — matches the existing reactor mem-stat cadence.
- state/self/power/charger/config: deferred until W7 stretch (LTC4015
  effective config tracking is a separate read path through the
  charger driver). When it lands, it'll feed the analog-threshold
  branch of the alert FSM.

Every fact carries a monotonic per-topic seq counter and a
service-monotonic uptime_ms — recommended in the spec for
"where cheap" cases, which all of these are.

W8 — charger alert FSM (services/telemetry/alerts.go)
- 14 canonical AlertKind values, frozen by the spec.
  AllAlertKinds is a guarded slice; the unit test compares it
  byte-for-byte against the spec's snake_case list to catch
  accidental rename.
- chargerAlertFSM holds the previous ChargerValue and on each new
  observation walks the bit-set transitions:
    state bits  -> bat_missing, bat_short, max_charge_time_fault,
                   absorb, equalize, cccv, precharge
    status bits -> iin_limited, uvcl_active, cc_phase, cv_phase
  emitting one AlertEvent per !set->set edge, sparse (retained=false).
- Analog-threshold kinds (vin_lo, vin_hi, bsr_high) are stubbed
  pending the charger-config publisher; the FSM scaffolding is
  there so dropping the threshold compare in is mechanical.
- AlertEvent payload carries kind, source="ltc4015", a snapshot of
  state_bits/status_bits/system_bits at emit time, plus seq +
  uptime_ms.

Glue
- services/fabric/remap.go grows export rules for state/self/* and
  event/self/* (no remapping — local topic = wire topic). Legacy
  hal/* exports stay until W12 cleanup so the transition doesn't
  break any consumers that haven't moved to the new surface yet.
- services/reactor/reactor.go starts the telemetry service in its
  own goroutine after the updater service.

Tests (services/telemetry/telemetry_test.go)
- TestPublishesBatteryFact: BatteryValue from HAL -> BatteryFact at
  state/self/power/battery with the right field mapping + seq=1 +
  non-negative uptime.
- TestPublishesChargerWithDecodedBooleans: ChargerValue with mixed
  set bits -> raw bits preserved AND decoded booleans agree
  (set bits true, unset bits false).
- TestPublishesEnvironmentFacts: temp + humidity round-trip.
- TestAllAlertKindsCount: exactly 14 entries; canonical strings
  match the spec byte-for-byte.
- TestChargerAlertFSMEdgeOnly: bit !set -> set fires once;
  subsequent retains keeping the bit set do NOT re-fire; clear
  -> set re-fires.
- TestChargerAlertFSMMultipleBitsTransitionTogether: two state
  bits flipping in the same publish emit two alerts.

Build sizes (pico_bb_proto_1):
  release default     : 185180  (+540 over phase 3)
  release flash_unsafe: 185176
  qa_reactor          : 136232
Adds the writer half of the abupdate metadata block surface so the
receiver actually persists a verified manifest's fields after the
verifier succeeds. The fabric-update branch ships an in-memory
implementation; fabric-security swaps in a flash-backed
implementation that survives reboots.

W11 — MetadataWriter
- New MetadataWriter interface with WriteStagedDescriptor and
  ClearStagedDescriptor. Symmetric with the existing MetadataReader
  so a single MemoryMetadata value can implement both.
- NewMemoryMetadata() returns a process-lifetime in-memory
  reader+writer. Used as the default in Options when neither
  Metadata nor MetadataWrite is supplied — gets the receiver
  end-to-end working without external wiring.
- noopMetadataWriter is the fallback when a caller supplied a
  MetadataReader (legacy) but no writer; it logs each call so the
  gap is visible during bring-up.

Receiver
- handleReceiver now calls metadataWrite.WriteStagedDescriptor with
  the manifest's {Version, BuildID, ImageID, PayloadLength,
  PayloadSHA256} after the sink commits. On write failure
  transitions to failed with last_error="metadata_write_failed:..."
  so the failure mode is observable on the wire.
- Republishes the software fact after a successful staging so
  payload_sha256 reaches the CM5 immediately rather than waiting
  for the next session-restart Republish edge.

Test
- TestReceiverFakeAcceptWritesStagedDescriptor: drives the receiver
  through fakeVerifierAccept, asserts the MemoryMetadata reader sees
  the descriptor + payload_sha256, then exercises the commit RPC
  which now succeeds (was returning nothing_staged before W11)
  because the reader has the descriptor — pending_version flows
  manifest.Version -> staged -> applying.

Build sizes (pico_bb_proto_1):
  release default: 185176 (tiny shrink from refactor)
Five fixes flagged in the post-phase-5 review.

H1 — commit RPC must not lie about apply success
- New Applier interface for the slot-switch + reboot hook. Default
  RefusingApplier returns ErrApplyUnavailable so the commit RPC
  refuses with `error: "apply_unavailable"` rather than transitioning
  to applying and silently never rebooting on a branch where the
  apply path doesn't exist. fabric-security supplies a real
  abupdate-backed implementation; tests pass fakeApplier when they
  need to drive the success path.
- handleCommit calls applier.Apply(desc) and translates failures into
  the wire reply directly. Only on Applier success do we transition
  to applying.

H2 — stale staged-descriptor leak
- handlePrepare clears the persisted staged descriptor before
  transitioning to ready. This guards (stage A) -> (prepare B) ->
  (stage B fails) from leaving descriptor A persisted and
  committable.
- handleCommit now requires BOTH a persisted descriptor AND
  state==staged before allowing apply. Either alone refuses with
  nothing_staged.
- handleReceiver clears the staged descriptor on every failure path
  (verifier reject, sink commit fail, metadata write fail) so a
  rejected new transfer can never accidentally commit a stale
  prior stage.

H3 — canonical charger boolean key names
- ChargerFact's state{}/status{}/system{} maps now emit the
  spec-canonical names from
  docs/firmware-alignment-update.md §"Telemetry/state facts" rather
  than the existing types.* display-table abbreviations. Wire keys
  are spec-frozen because the Lua import side keys off them:
    state{}: equalize_charge, absorb_charge, charger_suspended,
             precharge, cccv_charge, ntc_pause, timer_term,
             c_over_x_term, max_charge_time_fault,
             bat_missing_fault, bat_short_fault
    status{}: vin_uvcl_active, iin_limit_active, const_current,
              const_voltage
    system{}: charger_enabled, mppt_en_pin, equalize_req, drvcc_good,
              cell_count_error, ok_to_charge, no_rt, thermal_shutdown,
              vin_ovlo, vin_gt_vbat, intvcc_gt_4p3v, intvcc_gt_2p8v
- Three internal name tables (chargerStateNames,
  chargerStatusNames, chargerSystemNames) replace the indirect
  use of types.ChargerStateTable[]. The 27-boolean spec total
  (11+4+12) is checked at test time.
- TestPublishesChargerWithDecodedBooleans updated to the canonical
  names plus exact-size assertions on each map.

M1 — boot_id fallback now varies per cold boot
- The previous fallback returned the process-startup counter packed
  into 16 hex chars, so every cold boot got the same value when
  crypto/rand was unavailable. Master plan R3 explicitly forbids
  that.
- Fallback now mixes monotonic clock at generation time, MemStats
  fields (Alloc / Mallocs / HeapInuse / Frees — vary with HAL init
  timing), the per-call counter, and a final 11/17/5-shift mix step
  so each output byte depends on every input bit. Still not
  cryptographically random, but gives reliable per-boot variation
  even when crypto/rand is broken. Logged on entry so the
  failure-mode test suite can grep for it.

M2 — alert severity field
- AlertEvent gains a non-omitempty `severity` string. New
  alertSeverity(kind) returns "warning" for fault-class kinds
  (bat_missing, bat_short, max_charge_time_fault, bsr_high) and
  "info" for charge-phase / control-loop transitions. This addresses
  the Codex note that the planned alert payload includes severity
  but the publisher was emitting it as omitempty (so it fell off
  the wire when default).
- charger_config publisher and the 3 analog kinds (vin_lo, vin_hi,
  bsr_high) remain stubbed pending the LTC4015 effective-config
  read path; the FSM plumbing is in place so dropping thresholds
  in is mechanical when that lands.

New tests
- TestCommitWithoutStagedStateRefusesEvenWithDescriptor: descriptor
  in metadata + state != staged refuses with nothing_staged.
- TestCommitWithoutApplierReturnsApplyUnavailable: production
  default refuses; state stays at staged.
- TestCommitWithFakeApplierTransitionsToApplying: fakeApplier
  exercises the future fabric-security wiring.
- TestReceiverFailureClearsStaleStagedDescriptor: pre-staged
  descriptor + verifier reject leaves no committable descriptor.
- TestPrepareClearsStaleStagedDescriptor: prepare clears stale
  descriptor.

Build sizes (pico_bb_proto_1, code):
  release default     : 185176 (unchanged)
  release flash_unsafe: 185168 (unchanged)
  release debug_uart  : 185176 (unchanged)
  qa_reactor          : 136232 (unchanged)
Three follow-up fixes from the second Codex review pass.

H1 — commit RPC must publish state=applying + reply ok BEFORE reboot
- Applier interface split into CanApply (validation, returns error)
  and ArmReboot (schedules reboot, may not return). The previous
  single-method Apply() let real implementations reboot inside the
  call before handleCommit had a chance to publish the canonical
  applying retain or send the ok reply.
- handleCommit reordered:
    desc, present := metadata.StagedDescriptor()
    if !present || state != staged       -> refuse nothing_staged
    if applier.CanApply(desc) returns err -> refuse with err.Error()
    transitionTo(applying, "", desc.Version)  // retain published
    reply(ok, accepted)                        // wire reply sent
    applier.ArmReboot(desc)                    // may not return
- RefusingApplier.CanApply returns ErrApplyUnavailable; ArmReboot is
  the contract-required no-op that the commit handler never reaches
  on this branch.
- fakeApplier in tests tracks canCalls + rebootCalls separately so
  tests can verify the ordering. Existing
  TestCommitWithFakeApplierTransitionsToApplying updated to assert
  both hooks fire exactly once and ArmReboot got the right
  descriptor.

M2 — payload_sha256 separation
- MemoryMetadata previously had one payloadSHA field, mutated by
  WriteStagedDescriptor and read by SoftwareFact. So a stage write
  promoted the staged hash into the running fact, AND a subsequent
  ClearStagedDescriptor (from prepare or receiver failure) left the
  invalidated hash sitting on the wire.
- Split into runningPayloadSHA and (descriptor-embedded)
  staged-payload hash. WriteStagedDescriptor no longer touches the
  running hash; ClearStagedDescriptor only nulls the descriptor.
  PayloadSHA256() reads runningPayloadSHA, which is set once at boot
  via SetRunningPayloadSHA (fabric-security wires this from the
  active slot's flash metadata).
- Receiver no longer republishes the software fact on stage success
  — the staged hash isn't a property of the running image.
- TestReceiverFakeAcceptWritesStagedDescriptor now asserts that
  WriteStagedDescriptor leaves PayloadSHA256() at "" (running hash
  unset), guarding the regression.

M3 — boot_id fallback comment accuracy
- Removed the misleading "stack-address mixing" reference from the
  earlier comment (the code never did that). Replaced with an
  accurate description of what the mix actually does (UnixNano +
  MemStats fields + per-call counter + 3-stage shift mix) plus a
  blunt note that the fallback is NOT contract-grade and the right
  long-term fix is HAL RNG or a persisted boot counter in the
  abupdate metadata block. The `[updater] boot_id fallback engaged`
  log line stays as the canonical signal for the failure-mode test
  suite to detect the engagement.

Verified
- go test ./... clean.
- tinygo builds pass for: pico_bb_proto_1, pico_bb_proto_1 flash_unsafe,
  pico_bb_proto_1 debug_uart, qa_reactor pico_bb_proto_1.
- Build sizes unchanged across all four paths (185176 release default).

Remaining (still deferred, unchanged):
- pico-side flash backing for MemoryMetadata (W11 finish on
  pico2-a-b/abupdate).
- state/self/power/charger/config publisher + the 3 analog alert
  kinds (W7/W8 finish, blocked on LTC4015 effective-config read path).
- W12 legacy remap removal.
- W13 lua-side L4 acceptance work.
Codex review pass 3: the in-line comment above the fallback's tick++
still claimed the implementation mixed "the address of a stack-local
variable (varies due to stack layout / ASLR-equivalent on TinyGo)".
The code never actually did that — an earlier draft tried but the
mid-expression `unsafe.Pointer` casts didn't compile, so the line
was dropped without a corresponding doc update.

Replaced with a one-paragraph note that the fallback is best-effort
per-boot jitter (not contract-grade) and points readers to the more
detailed mix description below. The
`[updater] boot_id fallback engaged` log line is left explicit as
the failure-mode test grep target.

No code change.
Closes the remaining W7/W8 gaps Codex flagged in the second review pass.

W7 finish — state/self/power/charger/config publisher
- New ChargerConfigFact carrying { schema=1, source="ltc4015",
  thresholds{vin_lo_mV, vin_hi_mV, bsr_high_uohm_per_cell},
  alert_mask_bits, alert_mask{14 booleans}, seq, uptime_ms }.
  Strict no-operating-state-booleans contract per the spec.
- DefaultChargerConfig() returns conservative bring-up values
  (vin_lo=10500 mV, vin_hi=17000 mV, bsr_high=5000 uohm/cell). The
  LTC4015 driver isn't yet tracking effective programmed config —
  Service.SetChargerConfig swaps the values once that lands.
- Published at service startup AND on every state/fabric/link/+
  Ready-edge (mirrors updater W10) so the CM5 sees a fresh config
  retain on every newly established session, warm or cold.

W8 finish — analog alert kinds
- chargerAlertFSM now consumes BatteryValue too via observeBattery,
  tracking BSR_uOhmPerCell across observations.
- vin_lo: emits when ChargerValue.VIN_mV crosses below
  thresholds.vin_lo_mV (edge from >=threshold to <threshold).
- vin_hi: emits when VIN_mV crosses above thresholds.vin_hi_mV.
- bsr_high: emits when BatteryValue.BSR_uOhmPerCell crosses above
  thresholds.bsr_high_uohm_per_cell. Severity is "warning" per
  alertSeverity().
- All three honour the same edge-only contract as the bit-driven
  kinds: subsequent observations past the threshold do NOT re-emit.

ChargerAlertMask carries the 14 spec-frozen booleans
(vin_lo / vin_hi / bsr_high / bat_missing / bat_short /
max_charge_time_fault / absorb / equalize / cccv / precharge /
iin_limited / uvcl_active / cc_phase / cv_phase). On this branch
the mask is informational only — the FSM emits regardless of the
mask state. Once the LTC4015 driver actually programs the chip's
alert-enable register and reports it back, mask wiring through to
emission becomes a small Service.publishChargerConfig follow-up.

Tests
- TestPublishesChargerConfigAtStartup: config retain lands on Run
  start with schema=1, source=ltc4015, non-zero defaults.
- TestChargerAlertFSMVinLoEdge: prime above threshold, cross below,
  verify vin_lo emit; subsequent observation below threshold
  doesn't duplicate.
- TestChargerAlertFSMVinHiEdge: cross above, verify vin_hi.
- TestChargerAlertFSMBSRHighEdge: BatteryValue.BSR rising past
  threshold emits bsr_high with severity="warning".

Build sizes (pico_bb_proto_1, code):
  release default     : 184664 (-512 from phase 5; new code, dead-
                                code elim cleaned up the receiver
                                republishSoftware path that M2 dropped)
  release flash_unsafe: 184656
  qa_reactor          : 136232
Drops the pre-fabric-update MCU surface — config/device -> config/hal
import, rpc/hal/dump inline RPC handler, hal/cap/env / hal/cap/power /
hal/state exports, and fabric/out/rpc/hal/dump call export — in
favour of the new state/self/* + cmd/self/updater/{prepare,commit} +
event/self/* + receiver-topic surface added across the rest of this
branch.

remap.go
- importPublishRules: emptied. config/device -> config/hal is gone;
  config-like data flows through cmd/self/updater/prepare's metadata
  field.
- exportPublishRules: keeps state/self/* and event/self/* (W7/W8);
  drops the three legacy hal/* -> state/* exports.
- exportCallRules: emptied. The MCU is no longer an outbound RPC
  caller on this branch.
- importCallRules unchanged: still routes cmd/self/updater/{prepare,
  commit} -> rpc/updater/{prepare,commit}.

session.go
- onPub no longer has the config/device -> config/hal special case;
  every imported pub now flows through the generic
  conn.Publish + trackImportedRetain path.
- onCall no longer has the inline rpc/hal/dump handler. dumpReply,
  dumpCallTopic, tConfigHAL, configApplied, configCount,
  lastConfigErr fields all removed from session.
- Now-unused imports of devicecode-go/types pruned.

config.go
- decodeHALConfig / decodeHALConfigBytes / decodeHALState helpers
  removed (sole callers were the dump handler + config/hal special
  case). decodePayload survives — still used by session.onReply for
  forwarding RPC replies onto the originating Request's reply path.

Tests deleted (legacy-path-only, replaced by services/updater and
services/telemetry coverage):
  - TestPubImport, TestPubExport, TestUnretainExport, TestUnretain
  - TestDumpCallReturnsConfigState, TestDumpCallDoesNotBlockPing
  - TestCallExport, TestCallExportOnlyConfiguredRule,
    TestCallExportPeerReset
  - TestPendingWireCallsTimeout
  - TestEchoedHelloAckIgnoredDuringOutgoingCall, TestPreHelloDump
  - TestDrainOutgoingWireCallsReportsMarshalFailure,
    TestDrainOutgoingWireCallsReportsWriteFailure

Tests updated to canonical post-W12 shape:
  - TestImportPublishTopic: asserts every input now returns nil.
  - TestImportCallTopic: asserts the cmd/self/updater/{prepare,commit}
    routes, plus rpc/hal/dump now returning empty.
  - TestExportTopic: state/self/* and event/self/* paths; legacy
    hal/cap/* paths assert nil.
  - TestExportCallTopic, TestExportCallPatterns: assert empty rules.
  - TestCallImport: switched from the legacy rpc/hal/dump path to
    cmd/self/updater/prepare via the canonical importCallRules
    routing.
  - TestSessionResetUnretainsImports: installs a t.Cleanup-scoped
    temp importPublishRule (same harness pattern as
    TestInboundCallBusyAtCapacity) since the production rules are
    empty post-W12. The mechanism under test (retain tracking +
    session-reset teardown) is generic; the topic was just an
    example.

KNOWN: this commit breaks the lua-side smoke test on
chris-fabric-new-update which still pushes config/mcu retain (drops
on the MCU as no_route) and calls rpc/peer/mcu-1/hal/dump (drops as
no_route). The next commit on devicecode-lua updates smoke.lua to
observe state/self/* retains and call cmd/self/updater/prepare via
new outbound_call_rules in mcu-dev.json.

Build sizes (pico_bb_proto_1, code):
  release default     : 184664 (unchanged — dead code already trimmed
                                by tinygo elim of the inline dump path)
  release flash_unsafe: 184656
  release debug_uart  : 184664
  qa_reactor          : 136232
Codex review M2 follow-up: ChargerConfigFact was emitting
source="ltc4015" while actually carrying DefaultChargerConfig() —
that misrepresented the data on the wire. CM5 consumers couldn't
tell whether the config fact was effective programmed values or
fallback bring-up defaults.

Fix: ChargerConfig grows a Source string field. DefaultChargerConfig()
sets Source="ltc4015-default" so the wire value is honest about its
provenance. When the LTC4015 driver wires effective programmed
values via Service.SetChargerConfig, the caller passes
Source="ltc4015" and the next publish carries that on the wire.
publishChargerConfig defensively falls back to "ltc4015-default" if
a caller leaves Source empty so the wire never carries an empty
source field.

Tests
- TestPublishesChargerConfigAtStartup updated: asserts the initial
  publish carries source="ltc4015-default" rather than the
  misleading "ltc4015".
- TestSetChargerConfigPromotesSourceString: drives a separate
  Service through SetChargerConfig with Source="ltc4015" and
  asserts the resulting fact carries that source plus the
  driver-supplied thresholds. Documents the contract for fabric-
  security / charger-driver integration: pass Source="ltc4015" once
  effective register state is being read.

This is the smallest correct fix for the misrepresentation —
W7/W8's analog FSM still uses the threshold values regardless of
source, since the alert FSM operates on the actual numeric
thresholds the chip is programmed with (or the conservative
defaults until SetChargerConfig fires).
Hardware bring-up on RP2350 with the latest fabric-update build
panicked with "panic: no rng" before HAL even started:

  0.000 [main] bootstrapping bus …
  0.000 [main] starting hal.Run …
  panic: no rng

Root cause: TinyGo's crypto/rand on RP2350 (and other targets
without a hardware-RNG backend wired into its runtime) PANICS with
"no rng" inside rand.Read rather than returning an error. Our
boot_id generate() guarded against err != nil + all-zero bytes —
neither catches the panic — so the firmware crashed at the first
GenerateBootID call in main.go.

Fix
- boot_id.go now defers the crypto/rand attempt to a per-build helper
  tryCryptoRand(buf) bool. The shared file only handles the
  deterministic-mix fallback path.
- boot_id_host.go (//go:build !tinygo): real crypto/rand.Read for
  host tests + non-TinyGo builds where crypto/rand actually works.
- boot_id_tinygo.go (//go:build tinygo): always returns false, so
  generate() falls straight through to the deterministic mix.

Why not defer/recover
- Wrapping rand.Read in defer/recover does work, but pulls in
  TinyGo's panic-handling runtime and grew the binary by ~110 KB
  (measured pico_bb_proto_1 release default). Skipping crypto/rand
  entirely on TinyGo avoids that overhead — and the deterministic
  mix (UnixNano + MemStats fields + tick + 3-stage shift) is what
  master plan R3 already documented as the not-contract-grade
  fallback. The `[updater] boot_id fallback engaged` log line will
  fire on every RP2350 boot until TinyGo grows an RP2350 RNG
  backend or we route a HAL-supplied RNG into services/updater.

Verified
- go test ./services/updater/... clean (host build still uses real
  crypto/rand via boot_id_host.go).
- tinygo build paths all pass: pico_bb_proto_1,
  pico_bb_proto_1 flash_unsafe, pico_bb_proto_1 debug_uart,
  qa_reactor pico_bb_proto_1.
The fabric-update branch defaults to StubVerifier+RefusingApplier so
no commit can ever lie about apply success without a real verifier in
place. That blocks fw-update-e2e -- the harness needs a working
end-to-end stage+commit+reboot path before fabric-security ships
imagev1.

Land a bringup-only path:

- PassthroughVerifier(identity) accepts any artefact, streams it into
  the SlotSink while computing SHA-256, returns a synthetic Manifest
  populated from the build-time identity + computed hash + artefact
  length. No envelope, no signature -- temporary until imagev1 lands.

- newSlotSink is build-tag-split:
    !tinygo: returns the existing memorySink (RAM buffer; tests use it).
    tinygo && rp2350: returns abupdateSink that streams straight into
      the inactive A/B slot via abupdate.WriteChunk, and FlushFinals on
      Commit. A package-level sharedUpdater persists across the
      receiver path (writes the slot) and the applier path (reboots
      into it).

- ProductionApplier is build-tag-split:
    !tinygo: still RefusingApplier (host can't reboot).
    tinygo && rp2350: abupdateApplier whose ArmReboot calls
      abupdate.RebootIntoSlot (does not return on success).

- Reactor wires PassthroughVerifier + ProductionApplier as the
  defaults. When fabric-security lands the real verifier+manifest, the
  reactor switches over without touching the receiver.
When a frame got mangled (single-byte UART corruption mid-transfer),
logMalformed dropped it silently. CM5 kept streaming subsequent
chunks; the receiver rejected each as out-of-order; transfer died on
the phase timeout. Hardware proof: 33% transfer at 115200 baud, one
flipped byte, transfer abandoned 15s later.

If an incoming transfer is active when a frame fails to parse, emit
xfer_need(bytesWritten) so CM5 restarts from the next expected byte.
Standard recovery path — protocol already supports it on the sender
side; the receiver was just failing to issue the signal.

Stale need (not actually a chunk) is harmless: CM5 ignores need.next
values it has already passed.
A single-byte UART flip inside the base64 data string of an
xfer_chunk doesn't break JSON parsing -- json.Unmarshal still
succeeds; base64 still decodes; we silently absorb the wrong bytes.
The whole-payload xxHash mismatch only surfaces at xfer_commit, by
which point the entire transfer (~1min on 115200 baud at 99% then
the bin-checksum reveals the rot) has been wasted.

Add an optional `crc` field to xfer_chunk carrying xxHash32 of the
raw bytes. On mismatch send xfer_need(bytesWritten) -- same recovery
path the malformed-frame handler uses, so the sender retransmits the
gap. crc is `omitempty` and "absent" still works (we just lose the
per-chunk check).
Both recovery paths (malformed frame, per-chunk crc mismatch) send
xfer_need but were not bumping cur.deadline. A burst of bad frames
during recovery could still trip the phase_timeout idle watchdog and
abort mid-recovery. Treat recovery as progress.

Codex review M: malformed-frame xfer_need does not refresh deadline.
The default tinygo+rp2350 transfer sink buffered the whole artefact
in RAM before handing it to the receiver. ~424KB binaries blew past
what's safe on a Pico 2's SRAM and made the receiver interface
gratuitously stateful.

Switch to streaming-during-transfer:
- BeginStreamedStage prepares the inactive A/B slot via abupdate.
- WriteStreamedStage writes chunks straight to flash and updates a
  running SHA-256 of the streamed bytes.
- CommitStreamedStage flushes the final partial page and freezes
  the descriptor (length + payload sha).

The fabric transfer sink is now a thin shim over those entry points.
The receiver detects an empty payload.Artefact (the wire shape stays
the same) and consumes the pre-staged descriptor; on host builds the
streamed-stage stub returns "not present" so the existing
verifier+sink path still drives unit tests.

Pairs with fabric's recovery work: per-chunk crc + xfer_need on
malformed/corrupt frames now feed a sink that can't lose bytes
silently to RAM pressure.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant