feat: fabric session protocol with firmware transfer support by cpunt · Pull Request #40 · jangala-dev/devicecode-go

cpunt · 2026-04-13T11:20:45Z

Summary

Adds the MCU updater runtime: retained state/self/* facts, prepare/commit RPCs, staged descriptor handling, and updater/main transfer staging.
Routes Fabric update calls through the documented cmd/self/updater/{prepare,commit} and xfer_* flow, with legacy HAL/config routes removed from the update surface.
Streams incoming firmware data to the A/B update path on TinyGo and keeps host builds/test fakes in memory.

Why

This branch makes MCU updates work over the Fabric contract from docs/updating.md without adding extra wire routes beyond the Lua-side contract.

Testing

go test ./...

Implements the W1/W4/W5/W6 backbone of the fabric-update plan: W1 — Verifier interface + production stub - services/updater/verifier.go defines the Verifier interface with Verify(io.Reader, SlotSink) (Manifest, error). Production wiring passes StubVerifier(), which always rejects with ErrUnsignedNotSupported. Test fakes (fakeVerifierAccept, fakeVerifierReject) exercise both branches without touching the production stub. - SlotSink keeps the verifier-to-staging seam tiny so fabric-security can swap in an abupdate-backed sink without API churn. - Manifest is the small subset of the signed-image manifest the receiver propagates onward (version, build_id, image_id, sha256, length). Full canonical manifest stays in pico2-a-b/imagev1 which fabric-security adds. W4 — Updater state machine + RPC handlers - 7 canonical states + the empty "" (= nil) quiescent state, with Allowed() guard for tests that try invalid transitions. - handlePrepare clears last_error and transitions to ready, returning {ok=true, accepted=true}. Refuses with err="busy" when another prepare is in flight or state==applying. - handleCommit transitions to applying when a staged descriptor or in-state staged transition is present; refuses with err="nothing_staged" otherwise. Reboot/slot-switch is left to the abupdate metadata writer that lands in W11/fabric-security. - Local-bus topics rpc/updater/{prepare,commit} bind here; the wire topics cmd/self/updater/{prepare,commit} are routed in via fabric's importCallRules (services/fabric/remap.go). W5 — Identity facts (state/self/{software, updater, health}) - SoftwareFact = {version, build, image_id, boot_id, payload_sha256}. - UpdaterFact = {state, last_error, pending_version}, flat shape. - HealthFact = {state, reason?} with omitempty on reason. - Service emits the three facts on Run() startup and re-emits via Republish() — the latter is the hook fabric will call on every hello_ack edge (W10 will wire that in). W6 — boot_id RAM-only generator - 8 bytes from crypto/rand, hex-encoded to 16 lower-case chars. - Generated once at boot via main.go's call site between HAL ready and reactor.Run — i.e. before fabric opens, so the first software-fact publish carries a populated boot_id. - Held in RAM only via atomic.Pointer cache. Idempotent — repeat calls return the cached value. - Fallback: process-startup counter packed into 16 hex chars when crypto/rand fails or returns all-zero. Logged so the failure-mode hardware test suite (master R3) can grep for it. - Per-test reset via resetBootIDForTest() exercised by 4 unit tests covering len-16-hex, idempotence, uniqueness across "boots", and all-zero-sentinel rejection. Receiver wiring (W2 placeholder) - services/updater/receiver.go binds to the literal local topic raw/member/mcu/cap/updater/main/rpc/receive that CM5 sends as meta.receiver. Decodes ReceiverPayload, calls the verifier with a bytes.Reader over the artefact and a memorySink. On success publishes state=staged with manifest.version as pending_version and replies {ok:true, stage:"staged"}. On failure publishes state=failed with last_error and replies {ok:false, err:...}. - Stub verifier produces the failure path on every artefact, so the production build cannot stage; fakeVerifierAccept exercises the staged-publish path in tests. - The fabric-side glue that actually CALLS this receiver after xfer_commit lands in phase 2 (W2 + W3 stage/apply split). Glue - services/fabric/remap.go importCallRules: cmd/self/updater/prepare -> rpc/updater/prepare; same for commit. - services/reactor/reactor.go starts the updater service in its own goroutine before the env+power subscriptions, so retained facts land before the fabric link opens. - main.go calls updater.GenerateBootID() between HAL ready and reactor.Run, logging the chosen value. Tests (services/updater/updater_test.go) - Initial fact publish: software/updater/health all retain on Run startup with the configured Identity + 16-char boot_id. - Prepare happy path: ok=true, state -> ready, last_error cleared. - Commit without staged: refusal err="nothing_staged". - Commit with staged descriptor: ok=true, state -> applying, pending_version mirrors descriptor.Version. - Receiver under stub verifier: state -> failed, last_error contains the verifier_stub sentinel; reply ok=false. - Receiver under fakeVerifierAccept: state -> staged, pending_version = manifest.Version; reply ok=true stage=staged. - Receiver under fakeVerifierReject: state -> failed with the custom error string. - jsonDecode tolerates typed payloads, json.RawMessage, []byte, and nil — the four shapes the bus actually delivers in practice. - memorySink Abort clears the buffer; Commit closes further writes. Build sizes (pico_bb_proto_1): release default : code 184640 (was 280364 — TinyGo dead-code elimination cleaning up after the receiver/verifier refactor) release flash_unsafe: code 184628 qa_reactor : code 136232 Out of scope for this commit (later phases): - W2/W3 fabric-side stage/apply split + receiver invocation from transfer.go after xfer_commit. Receiver is wired but not yet invoked by fabric. - W7/W8 retained telemetry publishers + charger alert FSM. - W9 lane assignment routing into fabric's 3-lane writer. - W10 post-hello_ack republish hook. - W11 abupdate metadata extension (payload_sha256 + staged descriptor read/write). - W12 legacy Go remap removal. - W13 L4 acceptance prep on devicecode-lua.

Implements W2/W3 of the fabric-update plan: fabric's onTransferCommit now hands the committed artefact to the local-bus receiver topic named in meta.receiver, gates xfer_done on the receiver's reply, and defers slot-switch/reboot to the updater commit RPC. W2 — Receiver invocation - transfer.go's onTransferCommit, after sink.Commit() succeeds and before sending xfer_done, decodes meta.receiver from the xfer_begin meta blob and calls that local-bus topic with {link_id, xfer_id, size, checksum, meta, artefact}. - The reply gates the next frame: ok=true -> xfer_done; ok=false -> xfer_abort with err = receiver's err string. Synchronous call with receiverCallTimeout (5s); session goroutine blocks intentionally during the call because we MUST decide xfer_done vs xfer_abort before processing further frames. - decodeReceiverReply tolerates two reply shapes — typed Go struct (in-process tests) and JSON bus payload (real receiver via services/updater.ReceiverReply). W3 — Stage/apply split - transferSink interface grows Bytes() []byte. nil means the sink streamed elsewhere (e.g. flash_unsafe writes directly to flash and doesn't keep a RAM copy). Buffer sink returns its committed slice; flash_unsafe sink and fakeTransferSink return nil. - onTransferCommit calls the receiver only when (a) Bytes() returns non-nil AND (b) meta.receiver is present. Otherwise it falls through to the legacy xfer_done -> sink.Apply() flow — keeping flash_unsafe builds and tests that don't model the receiver unaffected. - The receiver path NO LONGER calls sink.Apply(). Slot-switch and reboot are deferred to cmd/self/updater/commit, which the updater state machine in services/updater handles. fabric-security wires the actual reboot through abupdate when it lands. New default sink (transfer_sink_buffer.go) - Replaces the rejecting transfer_sink_rp2350.go default and the rejecting transfer_sink_stub.go host build with an in-RAM buffer capped at 64 KB. Hitting the cap returns ErrArtefactTooLarge from WriteChunk, which propagates to xfer_abort. - Bytes() returns the committed slice for the receiver. Apply() is a no-op (slot-switch is the updater RPC's job). - The 64 KB cap is conservative — large-firmware streaming-into-flash for unverified artefacts is fabric-security's job; this branch only needs enough to round-trip the smoke-test traffic and the receiver path with a stub verifier. Tests - TestTransferReceiverInvokedAfterCommit: registers a local-bus receiver handler, drives a 4-byte transfer with meta.receiver=["test","receiver"], asserts the receiver got the expected payload (link_id, xfer_id, checksum, artefact bytes) and fabric sent xfer_done after the receiver replied ok=true. - TestTransferReceiverRejectAbortsTransfer: receiver replies {ok=false, err="manifest_check_failed"}; fabric sends xfer_abort with err mirroring the receiver reason. - bufferingSinkAdapter wraps the production bufferSink for these tests so the receiver path is exercised against the real sink, not fakeTransferSink (whose nil Bytes() is the legacy fallback path). - All existing transfer tests (legacy fallback path with fakeTransferSink) continue to pass — the change is additive. Build sizes (pico_bb_proto_1): release default : 184616 (effectively unchanged) release flash_unsafe: 184628 Out of scope for this commit (still later): - W7 retained telemetry publishers + W8 charger alert FSM. - W9 lane assignment routing into the 3-lane writer. - W10 post-hello_ack republish hook (Republish() exists; needs to be invoked from the fabric session lifecycle). - W11 abupdate metadata extension (payload_sha256 + staged descriptor read/write). - W12 legacy Go remap removal. - W13 L4 acceptance prep on devicecode-lua.

The updater service now watches the fabric link-state retain at state/fabric/link/+ and calls Republish() on every !Ready -> Ready transition. Mirrors the spec in docs/firmware-alignment-update.md: Retained `state/self/*` is republished: - immediately after every successful boot - on every newly established session (`hello_ack`), warm or cold Implementation: - Run() grows a fourth subscription on TopicFabricLink (state/fabric/link/+), tracking per-link Ready flag in a small map to detect edges and avoid double-firing on subsequent Ready=true retains that keep the same edge state. - decodeLinkState pulls the link_id from the topic tail and Ready from the payload (typed map or JSON-marshalable struct — covers both the in-process linkStatePayload published by services/fabric/session.go and any future test-side fakes). Test (TestRepublishOnLinkReadyEdge): - Drains the initial fact retains, publishes a Ready=false link state (no republish expected), then a Ready=true edge (republish asserted via the next state/self/software emit), then a second Ready=true retain (no double-fire — channel must stay quiet for the settle window).

New services/telemetry package subscribes to the existing HAL value topics (hal/cap/env/..., hal/cap/power/...) and republishes them under the canonical state/self/* surface using integer engineering units. Also runs the W8 charger alert FSM emitting event/self/power/charger/alert with 14 canonical kinds. W7 — retained-state publishers - state/self/power/battery: pack_mV, per_cell_mV, ibat_mA, temp_mC, bsr_uohm_per_cell. Sourced from types.BatteryValue. - state/self/power/charger: vin_mV, vsys_mV, iin_mA, raw bitfields (state_bits/status_bits/system_bits) AND 3 decoded boolean maps (state{}, status{}, system{}) carrying the canonical names from the existing types.ChargerStateTable / ChargeStatusTable / SystemStatusTable. The 27-boolean spec (11 + 4 + 12) follows from the size of those tables. - state/self/environment/temperature: deci_c. - state/self/environment/humidity: rh_x100. - state/self/runtime/memory: alloc_bytes from runtime.MemStats. Tick every 3 s — matches the existing reactor mem-stat cadence. - state/self/power/charger/config: deferred until W7 stretch (LTC4015 effective config tracking is a separate read path through the charger driver). When it lands, it'll feed the analog-threshold branch of the alert FSM. Every fact carries a monotonic per-topic seq counter and a service-monotonic uptime_ms — recommended in the spec for "where cheap" cases, which all of these are. W8 — charger alert FSM (services/telemetry/alerts.go) - 14 canonical AlertKind values, frozen by the spec. AllAlertKinds is a guarded slice; the unit test compares it byte-for-byte against the spec's snake_case list to catch accidental rename. - chargerAlertFSM holds the previous ChargerValue and on each new observation walks the bit-set transitions: state bits -> bat_missing, bat_short, max_charge_time_fault, absorb, equalize, cccv, precharge status bits -> iin_limited, uvcl_active, cc_phase, cv_phase emitting one AlertEvent per !set->set edge, sparse (retained=false). - Analog-threshold kinds (vin_lo, vin_hi, bsr_high) are stubbed pending the charger-config publisher; the FSM scaffolding is there so dropping the threshold compare in is mechanical. - AlertEvent payload carries kind, source="ltc4015", a snapshot of state_bits/status_bits/system_bits at emit time, plus seq + uptime_ms. Glue - services/fabric/remap.go grows export rules for state/self/* and event/self/* (no remapping — local topic = wire topic). Legacy hal/* exports stay until W12 cleanup so the transition doesn't break any consumers that haven't moved to the new surface yet. - services/reactor/reactor.go starts the telemetry service in its own goroutine after the updater service. Tests (services/telemetry/telemetry_test.go) - TestPublishesBatteryFact: BatteryValue from HAL -> BatteryFact at state/self/power/battery with the right field mapping + seq=1 + non-negative uptime. - TestPublishesChargerWithDecodedBooleans: ChargerValue with mixed set bits -> raw bits preserved AND decoded booleans agree (set bits true, unset bits false). - TestPublishesEnvironmentFacts: temp + humidity round-trip. - TestAllAlertKindsCount: exactly 14 entries; canonical strings match the spec byte-for-byte. - TestChargerAlertFSMEdgeOnly: bit !set -> set fires once; subsequent retains keeping the bit set do NOT re-fire; clear -> set re-fires. - TestChargerAlertFSMMultipleBitsTransitionTogether: two state bits flipping in the same publish emit two alerts. Build sizes (pico_bb_proto_1): release default : 185180 (+540 over phase 3) release flash_unsafe: 185176 qa_reactor : 136232

Adds the writer half of the abupdate metadata block surface so the receiver actually persists a verified manifest's fields after the verifier succeeds. The fabric-update branch ships an in-memory implementation; fabric-security swaps in a flash-backed implementation that survives reboots. W11 — MetadataWriter - New MetadataWriter interface with WriteStagedDescriptor and ClearStagedDescriptor. Symmetric with the existing MetadataReader so a single MemoryMetadata value can implement both. - NewMemoryMetadata() returns a process-lifetime in-memory reader+writer. Used as the default in Options when neither Metadata nor MetadataWrite is supplied — gets the receiver end-to-end working without external wiring. - noopMetadataWriter is the fallback when a caller supplied a MetadataReader (legacy) but no writer; it logs each call so the gap is visible during bring-up. Receiver - handleReceiver now calls metadataWrite.WriteStagedDescriptor with the manifest's {Version, BuildID, ImageID, PayloadLength, PayloadSHA256} after the sink commits. On write failure transitions to failed with last_error="metadata_write_failed:..." so the failure mode is observable on the wire. - Republishes the software fact after a successful staging so payload_sha256 reaches the CM5 immediately rather than waiting for the next session-restart Republish edge. Test - TestReceiverFakeAcceptWritesStagedDescriptor: drives the receiver through fakeVerifierAccept, asserts the MemoryMetadata reader sees the descriptor + payload_sha256, then exercises the commit RPC which now succeeds (was returning nothing_staged before W11) because the reader has the descriptor — pending_version flows manifest.Version -> staged -> applying. Build sizes (pico_bb_proto_1): release default: 185176 (tiny shrink from refactor)

Five fixes flagged in the post-phase-5 review. H1 — commit RPC must not lie about apply success - New Applier interface for the slot-switch + reboot hook. Default RefusingApplier returns ErrApplyUnavailable so the commit RPC refuses with `error: "apply_unavailable"` rather than transitioning to applying and silently never rebooting on a branch where the apply path doesn't exist. fabric-security supplies a real abupdate-backed implementation; tests pass fakeApplier when they need to drive the success path. - handleCommit calls applier.Apply(desc) and translates failures into the wire reply directly. Only on Applier success do we transition to applying. H2 — stale staged-descriptor leak - handlePrepare clears the persisted staged descriptor before transitioning to ready. This guards (stage A) -> (prepare B) -> (stage B fails) from leaving descriptor A persisted and committable. - handleCommit now requires BOTH a persisted descriptor AND state==staged before allowing apply. Either alone refuses with nothing_staged. - handleReceiver clears the staged descriptor on every failure path (verifier reject, sink commit fail, metadata write fail) so a rejected new transfer can never accidentally commit a stale prior stage. H3 — canonical charger boolean key names - ChargerFact's state{}/status{}/system{} maps now emit the spec-canonical names from docs/firmware-alignment-update.md §"Telemetry/state facts" rather than the existing types.* display-table abbreviations. Wire keys are spec-frozen because the Lua import side keys off them: state{}: equalize_charge, absorb_charge, charger_suspended, precharge, cccv_charge, ntc_pause, timer_term, c_over_x_term, max_charge_time_fault, bat_missing_fault, bat_short_fault status{}: vin_uvcl_active, iin_limit_active, const_current, const_voltage system{}: charger_enabled, mppt_en_pin, equalize_req, drvcc_good, cell_count_error, ok_to_charge, no_rt, thermal_shutdown, vin_ovlo, vin_gt_vbat, intvcc_gt_4p3v, intvcc_gt_2p8v - Three internal name tables (chargerStateNames, chargerStatusNames, chargerSystemNames) replace the indirect use of types.ChargerStateTable[]. The 27-boolean spec total (11+4+12) is checked at test time. - TestPublishesChargerWithDecodedBooleans updated to the canonical names plus exact-size assertions on each map. M1 — boot_id fallback now varies per cold boot - The previous fallback returned the process-startup counter packed into 16 hex chars, so every cold boot got the same value when crypto/rand was unavailable. Master plan R3 explicitly forbids that. - Fallback now mixes monotonic clock at generation time, MemStats fields (Alloc / Mallocs / HeapInuse / Frees — vary with HAL init timing), the per-call counter, and a final 11/17/5-shift mix step so each output byte depends on every input bit. Still not cryptographically random, but gives reliable per-boot variation even when crypto/rand is broken. Logged on entry so the failure-mode test suite can grep for it. M2 — alert severity field - AlertEvent gains a non-omitempty `severity` string. New alertSeverity(kind) returns "warning" for fault-class kinds (bat_missing, bat_short, max_charge_time_fault, bsr_high) and "info" for charge-phase / control-loop transitions. This addresses the Codex note that the planned alert payload includes severity but the publisher was emitting it as omitempty (so it fell off the wire when default). - charger_config publisher and the 3 analog kinds (vin_lo, vin_hi, bsr_high) remain stubbed pending the LTC4015 effective-config read path; the FSM plumbing is in place so dropping thresholds in is mechanical when that lands. New tests - TestCommitWithoutStagedStateRefusesEvenWithDescriptor: descriptor in metadata + state != staged refuses with nothing_staged. - TestCommitWithoutApplierReturnsApplyUnavailable: production default refuses; state stays at staged. - TestCommitWithFakeApplierTransitionsToApplying: fakeApplier exercises the future fabric-security wiring. - TestReceiverFailureClearsStaleStagedDescriptor: pre-staged descriptor + verifier reject leaves no committable descriptor. - TestPrepareClearsStaleStagedDescriptor: prepare clears stale descriptor. Build sizes (pico_bb_proto_1, code): release default : 185176 (unchanged) release flash_unsafe: 185168 (unchanged) release debug_uart : 185176 (unchanged) qa_reactor : 136232 (unchanged)

Three follow-up fixes from the second Codex review pass. H1 — commit RPC must publish state=applying + reply ok BEFORE reboot - Applier interface split into CanApply (validation, returns error) and ArmReboot (schedules reboot, may not return). The previous single-method Apply() let real implementations reboot inside the call before handleCommit had a chance to publish the canonical applying retain or send the ok reply. - handleCommit reordered: desc, present := metadata.StagedDescriptor() if !present || state != staged -> refuse nothing_staged if applier.CanApply(desc) returns err -> refuse with err.Error() transitionTo(applying, "", desc.Version) // retain published reply(ok, accepted) // wire reply sent applier.ArmReboot(desc) // may not return - RefusingApplier.CanApply returns ErrApplyUnavailable; ArmReboot is the contract-required no-op that the commit handler never reaches on this branch. - fakeApplier in tests tracks canCalls + rebootCalls separately so tests can verify the ordering. Existing TestCommitWithFakeApplierTransitionsToApplying updated to assert both hooks fire exactly once and ArmReboot got the right descriptor. M2 — payload_sha256 separation - MemoryMetadata previously had one payloadSHA field, mutated by WriteStagedDescriptor and read by SoftwareFact. So a stage write promoted the staged hash into the running fact, AND a subsequent ClearStagedDescriptor (from prepare or receiver failure) left the invalidated hash sitting on the wire. - Split into runningPayloadSHA and (descriptor-embedded) staged-payload hash. WriteStagedDescriptor no longer touches the running hash; ClearStagedDescriptor only nulls the descriptor. PayloadSHA256() reads runningPayloadSHA, which is set once at boot via SetRunningPayloadSHA (fabric-security wires this from the active slot's flash metadata). - Receiver no longer republishes the software fact on stage success — the staged hash isn't a property of the running image. - TestReceiverFakeAcceptWritesStagedDescriptor now asserts that WriteStagedDescriptor leaves PayloadSHA256() at "" (running hash unset), guarding the regression. M3 — boot_id fallback comment accuracy - Removed the misleading "stack-address mixing" reference from the earlier comment (the code never did that). Replaced with an accurate description of what the mix actually does (UnixNano + MemStats fields + per-call counter + 3-stage shift mix) plus a blunt note that the fallback is NOT contract-grade and the right long-term fix is HAL RNG or a persisted boot counter in the abupdate metadata block. The `[updater] boot_id fallback engaged` log line stays as the canonical signal for the failure-mode test suite to detect the engagement. Verified - go test ./... clean. - tinygo builds pass for: pico_bb_proto_1, pico_bb_proto_1 flash_unsafe, pico_bb_proto_1 debug_uart, qa_reactor pico_bb_proto_1. - Build sizes unchanged across all four paths (185176 release default). Remaining (still deferred, unchanged): - pico-side flash backing for MemoryMetadata (W11 finish on pico2-a-b/abupdate). - state/self/power/charger/config publisher + the 3 analog alert kinds (W7/W8 finish, blocked on LTC4015 effective-config read path). - W12 legacy remap removal. - W13 lua-side L4 acceptance work.

Codex review pass 3: the in-line comment above the fallback's tick++ still claimed the implementation mixed "the address of a stack-local variable (varies due to stack layout / ASLR-equivalent on TinyGo)". The code never actually did that — an earlier draft tried but the mid-expression `unsafe.Pointer` casts didn't compile, so the line was dropped without a corresponding doc update. Replaced with a one-paragraph note that the fallback is best-effort per-boot jitter (not contract-grade) and points readers to the more detailed mix description below. The `[updater] boot_id fallback engaged` log line is left explicit as the failure-mode test grep target. No code change.

Closes the remaining W7/W8 gaps Codex flagged in the second review pass. W7 finish — state/self/power/charger/config publisher - New ChargerConfigFact carrying { schema=1, source="ltc4015", thresholds{vin_lo_mV, vin_hi_mV, bsr_high_uohm_per_cell}, alert_mask_bits, alert_mask{14 booleans}, seq, uptime_ms }. Strict no-operating-state-booleans contract per the spec. - DefaultChargerConfig() returns conservative bring-up values (vin_lo=10500 mV, vin_hi=17000 mV, bsr_high=5000 uohm/cell). The LTC4015 driver isn't yet tracking effective programmed config — Service.SetChargerConfig swaps the values once that lands. - Published at service startup AND on every state/fabric/link/+ Ready-edge (mirrors updater W10) so the CM5 sees a fresh config retain on every newly established session, warm or cold. W8 finish — analog alert kinds - chargerAlertFSM now consumes BatteryValue too via observeBattery, tracking BSR_uOhmPerCell across observations. - vin_lo: emits when ChargerValue.VIN_mV crosses below thresholds.vin_lo_mV (edge from >=threshold to <threshold). - vin_hi: emits when VIN_mV crosses above thresholds.vin_hi_mV. - bsr_high: emits when BatteryValue.BSR_uOhmPerCell crosses above thresholds.bsr_high_uohm_per_cell. Severity is "warning" per alertSeverity(). - All three honour the same edge-only contract as the bit-driven kinds: subsequent observations past the threshold do NOT re-emit. ChargerAlertMask carries the 14 spec-frozen booleans (vin_lo / vin_hi / bsr_high / bat_missing / bat_short / max_charge_time_fault / absorb / equalize / cccv / precharge / iin_limited / uvcl_active / cc_phase / cv_phase). On this branch the mask is informational only — the FSM emits regardless of the mask state. Once the LTC4015 driver actually programs the chip's alert-enable register and reports it back, mask wiring through to emission becomes a small Service.publishChargerConfig follow-up. Tests - TestPublishesChargerConfigAtStartup: config retain lands on Run start with schema=1, source=ltc4015, non-zero defaults. - TestChargerAlertFSMVinLoEdge: prime above threshold, cross below, verify vin_lo emit; subsequent observation below threshold doesn't duplicate. - TestChargerAlertFSMVinHiEdge: cross above, verify vin_hi. - TestChargerAlertFSMBSRHighEdge: BatteryValue.BSR rising past threshold emits bsr_high with severity="warning". Build sizes (pico_bb_proto_1, code): release default : 184664 (-512 from phase 5; new code, dead- code elim cleaned up the receiver republishSoftware path that M2 dropped) release flash_unsafe: 184656 qa_reactor : 136232

Drops the pre-fabric-update MCU surface — config/device -> config/hal import, rpc/hal/dump inline RPC handler, hal/cap/env / hal/cap/power / hal/state exports, and fabric/out/rpc/hal/dump call export — in favour of the new state/self/* + cmd/self/updater/{prepare,commit} + event/self/* + receiver-topic surface added across the rest of this branch. remap.go - importPublishRules: emptied. config/device -> config/hal is gone; config-like data flows through cmd/self/updater/prepare's metadata field. - exportPublishRules: keeps state/self/* and event/self/* (W7/W8); drops the three legacy hal/* -> state/* exports. - exportCallRules: emptied. The MCU is no longer an outbound RPC caller on this branch. - importCallRules unchanged: still routes cmd/self/updater/{prepare, commit} -> rpc/updater/{prepare,commit}. session.go - onPub no longer has the config/device -> config/hal special case; every imported pub now flows through the generic conn.Publish + trackImportedRetain path. - onCall no longer has the inline rpc/hal/dump handler. dumpReply, dumpCallTopic, tConfigHAL, configApplied, configCount, lastConfigErr fields all removed from session. - Now-unused imports of devicecode-go/types pruned. config.go - decodeHALConfig / decodeHALConfigBytes / decodeHALState helpers removed (sole callers were the dump handler + config/hal special case). decodePayload survives — still used by session.onReply for forwarding RPC replies onto the originating Request's reply path. Tests deleted (legacy-path-only, replaced by services/updater and services/telemetry coverage): - TestPubImport, TestPubExport, TestUnretainExport, TestUnretain - TestDumpCallReturnsConfigState, TestDumpCallDoesNotBlockPing - TestCallExport, TestCallExportOnlyConfiguredRule, TestCallExportPeerReset - TestPendingWireCallsTimeout - TestEchoedHelloAckIgnoredDuringOutgoingCall, TestPreHelloDump - TestDrainOutgoingWireCallsReportsMarshalFailure, TestDrainOutgoingWireCallsReportsWriteFailure Tests updated to canonical post-W12 shape: - TestImportPublishTopic: asserts every input now returns nil. - TestImportCallTopic: asserts the cmd/self/updater/{prepare,commit} routes, plus rpc/hal/dump now returning empty. - TestExportTopic: state/self/* and event/self/* paths; legacy hal/cap/* paths assert nil. - TestExportCallTopic, TestExportCallPatterns: assert empty rules. - TestCallImport: switched from the legacy rpc/hal/dump path to cmd/self/updater/prepare via the canonical importCallRules routing. - TestSessionResetUnretainsImports: installs a t.Cleanup-scoped temp importPublishRule (same harness pattern as TestInboundCallBusyAtCapacity) since the production rules are empty post-W12. The mechanism under test (retain tracking + session-reset teardown) is generic; the topic was just an example. KNOWN: this commit breaks the lua-side smoke test on chris-fabric-new-update which still pushes config/mcu retain (drops on the MCU as no_route) and calls rpc/peer/mcu-1/hal/dump (drops as no_route). The next commit on devicecode-lua updates smoke.lua to observe state/self/* retains and call cmd/self/updater/prepare via new outbound_call_rules in mcu-dev.json. Build sizes (pico_bb_proto_1, code): release default : 184664 (unchanged — dead code already trimmed by tinygo elim of the inline dump path) release flash_unsafe: 184656 release debug_uart : 184664 qa_reactor : 136232

Codex review M2 follow-up: ChargerConfigFact was emitting source="ltc4015" while actually carrying DefaultChargerConfig() — that misrepresented the data on the wire. CM5 consumers couldn't tell whether the config fact was effective programmed values or fallback bring-up defaults. Fix: ChargerConfig grows a Source string field. DefaultChargerConfig() sets Source="ltc4015-default" so the wire value is honest about its provenance. When the LTC4015 driver wires effective programmed values via Service.SetChargerConfig, the caller passes Source="ltc4015" and the next publish carries that on the wire. publishChargerConfig defensively falls back to "ltc4015-default" if a caller leaves Source empty so the wire never carries an empty source field. Tests - TestPublishesChargerConfigAtStartup updated: asserts the initial publish carries source="ltc4015-default" rather than the misleading "ltc4015". - TestSetChargerConfigPromotesSourceString: drives a separate Service through SetChargerConfig with Source="ltc4015" and asserts the resulting fact carries that source plus the driver-supplied thresholds. Documents the contract for fabric- security / charger-driver integration: pass Source="ltc4015" once effective register state is being read. This is the smallest correct fix for the misrepresentation — W7/W8's analog FSM still uses the threshold values regardless of source, since the alert FSM operates on the actual numeric thresholds the chip is programmed with (or the conservative defaults until SetChargerConfig fires).

Hardware bring-up on RP2350 with the latest fabric-update build panicked with "panic: no rng" before HAL even started: 0.000 [main] bootstrapping bus … 0.000 [main] starting hal.Run … panic: no rng Root cause: TinyGo's crypto/rand on RP2350 (and other targets without a hardware-RNG backend wired into its runtime) PANICS with "no rng" inside rand.Read rather than returning an error. Our boot_id generate() guarded against err != nil + all-zero bytes — neither catches the panic — so the firmware crashed at the first GenerateBootID call in main.go. Fix - boot_id.go now defers the crypto/rand attempt to a per-build helper tryCryptoRand(buf) bool. The shared file only handles the deterministic-mix fallback path. - boot_id_host.go (//go:build !tinygo): real crypto/rand.Read for host tests + non-TinyGo builds where crypto/rand actually works. - boot_id_tinygo.go (//go:build tinygo): always returns false, so generate() falls straight through to the deterministic mix. Why not defer/recover - Wrapping rand.Read in defer/recover does work, but pulls in TinyGo's panic-handling runtime and grew the binary by ~110 KB (measured pico_bb_proto_1 release default). Skipping crypto/rand entirely on TinyGo avoids that overhead — and the deterministic mix (UnixNano + MemStats fields + tick + 3-stage shift) is what master plan R3 already documented as the not-contract-grade fallback. The `[updater] boot_id fallback engaged` log line will fire on every RP2350 boot until TinyGo grows an RP2350 RNG backend or we route a HAL-supplied RNG into services/updater. Verified - go test ./services/updater/... clean (host build still uses real crypto/rand via boot_id_host.go). - tinygo build paths all pass: pico_bb_proto_1, pico_bb_proto_1 flash_unsafe, pico_bb_proto_1 debug_uart, qa_reactor pico_bb_proto_1.

The fabric-update branch defaults to StubVerifier+RefusingApplier so no commit can ever lie about apply success without a real verifier in place. That blocks fw-update-e2e -- the harness needs a working end-to-end stage+commit+reboot path before fabric-security ships imagev1. Land a bringup-only path: - PassthroughVerifier(identity) accepts any artefact, streams it into the SlotSink while computing SHA-256, returns a synthetic Manifest populated from the build-time identity + computed hash + artefact length. No envelope, no signature -- temporary until imagev1 lands. - newSlotSink is build-tag-split: !tinygo: returns the existing memorySink (RAM buffer; tests use it). tinygo && rp2350: returns abupdateSink that streams straight into the inactive A/B slot via abupdate.WriteChunk, and FlushFinals on Commit. A package-level sharedUpdater persists across the receiver path (writes the slot) and the applier path (reboots into it). - ProductionApplier is build-tag-split: !tinygo: still RefusingApplier (host can't reboot). tinygo && rp2350: abupdateApplier whose ArmReboot calls abupdate.RebootIntoSlot (does not return on success). - Reactor wires PassthroughVerifier + ProductionApplier as the defaults. When fabric-security lands the real verifier+manifest, the reactor switches over without touching the receiver.

When a frame got mangled (single-byte UART corruption mid-transfer), logMalformed dropped it silently. CM5 kept streaming subsequent chunks; the receiver rejected each as out-of-order; transfer died on the phase timeout. Hardware proof: 33% transfer at 115200 baud, one flipped byte, transfer abandoned 15s later. If an incoming transfer is active when a frame fails to parse, emit xfer_need(bytesWritten) so CM5 restarts from the next expected byte. Standard recovery path — protocol already supports it on the sender side; the receiver was just failing to issue the signal. Stale need (not actually a chunk) is harmless: CM5 ignores need.next values it has already passed.

A single-byte UART flip inside the base64 data string of an xfer_chunk doesn't break JSON parsing -- json.Unmarshal still succeeds; base64 still decodes; we silently absorb the wrong bytes. The whole-payload xxHash mismatch only surfaces at xfer_commit, by which point the entire transfer (~1min on 115200 baud at 99% then the bin-checksum reveals the rot) has been wasted. Add an optional `crc` field to xfer_chunk carrying xxHash32 of the raw bytes. On mismatch send xfer_need(bytesWritten) -- same recovery path the malformed-frame handler uses, so the sender retransmits the gap. crc is `omitempty` and "absent" still works (we just lose the per-chunk check).

Both recovery paths (malformed frame, per-chunk crc mismatch) send xfer_need but were not bumping cur.deadline. A burst of bad frames during recovery could still trip the phase_timeout idle watchdog and abort mid-recovery. Treat recovery as progress. Codex review M: malformed-frame xfer_need does not refresh deadline.

The default tinygo+rp2350 transfer sink buffered the whole artefact in RAM before handing it to the receiver. ~424KB binaries blew past what's safe on a Pico 2's SRAM and made the receiver interface gratuitously stateful. Switch to streaming-during-transfer: - BeginStreamedStage prepares the inactive A/B slot via abupdate. - WriteStreamedStage writes chunks straight to flash and updates a running SHA-256 of the streamed bytes. - CommitStreamedStage flushes the final partial page and freezes the descriptor (length + payload sha). The fabric transfer sink is now a thin shim over those entry points. The receiver detects an empty payload.Artefact (the wire shape stays the same) and consumes the pre-staged descriptor; on host builds the streamed-stage stub returns "not present" so the existing verifier+sink path still drives unit tests. Pairs with fabric's recovery work: per-chunk crc + xfer_need on malformed/corrupt frames now feed a sink that can't lose bytes silently to RAM pressure.

cpunt changed the base branch from main to fabric-protocol April 13, 2026 11:21

cpunt marked this pull request as draft April 13, 2026 11:23

cpunt mentioned this pull request Apr 16, 2026

fix silent RX drops: enlarge ring to 4 KiB, add drop counters jangala-dev/tinygo-uartx#5

Closed

cpunt changed the base branch from fabric-protocol to main April 28, 2026 20:21

cpunt changed the base branch from main to fabric-protocol April 28, 2026 20:21

cpunt force-pushed the fabric-update branch from acc086e to 491e732 Compare May 18, 2026 19:07

cpunt added 20 commits May 19, 2026 09:06

fabric: stabilize MCU firmware updates

8f1c827

updater: tighten update branch comments

fc19ef9

updater: remove test-only production helpers

d2ec15b

cpunt force-pushed the fabric-update branch from 491e732 to d2ec15b Compare May 19, 2026 09:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: fabric session protocol with firmware transfer support#40

feat: fabric session protocol with firmware transfer support#40
cpunt wants to merge 20 commits into
fabric-protocolfrom
fabric-update

cpunt commented Apr 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cpunt commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cpunt commented Apr 13, 2026 •

edited

Loading