feat: fabric session protocol with firmware transfer support#40
Draft
cpunt wants to merge 20 commits into
Draft
Conversation
Implements the W1/W4/W5/W6 backbone of the fabric-update plan:
W1 — Verifier interface + production stub
- services/updater/verifier.go defines the Verifier interface with
Verify(io.Reader, SlotSink) (Manifest, error). Production wiring
passes StubVerifier(), which always rejects with
ErrUnsignedNotSupported. Test fakes (fakeVerifierAccept,
fakeVerifierReject) exercise both branches without touching the
production stub.
- SlotSink keeps the verifier-to-staging seam tiny so fabric-security
can swap in an abupdate-backed sink without API churn.
- Manifest is the small subset of the signed-image manifest the
receiver propagates onward (version, build_id, image_id, sha256,
length). Full canonical manifest stays in pico2-a-b/imagev1 which
fabric-security adds.
W4 — Updater state machine + RPC handlers
- 7 canonical states + the empty "" (= nil) quiescent state, with
Allowed() guard for tests that try invalid transitions.
- handlePrepare clears last_error and transitions to ready, returning
{ok=true, accepted=true}. Refuses with err="busy" when another
prepare is in flight or state==applying.
- handleCommit transitions to applying when a staged descriptor or
in-state staged transition is present; refuses with
err="nothing_staged" otherwise. Reboot/slot-switch is left to the
abupdate metadata writer that lands in W11/fabric-security.
- Local-bus topics rpc/updater/{prepare,commit} bind here; the wire
topics cmd/self/updater/{prepare,commit} are routed in via fabric's
importCallRules (services/fabric/remap.go).
W5 — Identity facts (state/self/{software, updater, health})
- SoftwareFact = {version, build, image_id, boot_id, payload_sha256}.
- UpdaterFact = {state, last_error, pending_version}, flat shape.
- HealthFact = {state, reason?} with omitempty on reason.
- Service emits the three facts on Run() startup and re-emits via
Republish() — the latter is the hook fabric will call on every
hello_ack edge (W10 will wire that in).
W6 — boot_id RAM-only generator
- 8 bytes from crypto/rand, hex-encoded to 16 lower-case chars.
- Generated once at boot via main.go's call site between HAL ready
and reactor.Run — i.e. before fabric opens, so the first
software-fact publish carries a populated boot_id.
- Held in RAM only via atomic.Pointer cache. Idempotent — repeat
calls return the cached value.
- Fallback: process-startup counter packed into 16 hex chars when
crypto/rand fails or returns all-zero. Logged so the failure-mode
hardware test suite (master R3) can grep for it.
- Per-test reset via resetBootIDForTest() exercised by 4 unit tests
covering len-16-hex, idempotence, uniqueness across "boots", and
all-zero-sentinel rejection.
Receiver wiring (W2 placeholder)
- services/updater/receiver.go binds to the literal local topic
raw/member/mcu/cap/updater/main/rpc/receive that CM5 sends as
meta.receiver. Decodes ReceiverPayload, calls the verifier with a
bytes.Reader over the artefact and a memorySink. On success
publishes state=staged with manifest.version as pending_version
and replies {ok:true, stage:"staged"}. On failure publishes
state=failed with last_error and replies {ok:false, err:...}.
- Stub verifier produces the failure path on every artefact, so the
production build cannot stage; fakeVerifierAccept exercises the
staged-publish path in tests.
- The fabric-side glue that actually CALLS this receiver after
xfer_commit lands in phase 2 (W2 + W3 stage/apply split).
Glue
- services/fabric/remap.go importCallRules: cmd/self/updater/prepare
-> rpc/updater/prepare; same for commit.
- services/reactor/reactor.go starts the updater service in its own
goroutine before the env+power subscriptions, so retained facts
land before the fabric link opens.
- main.go calls updater.GenerateBootID() between HAL ready and
reactor.Run, logging the chosen value.
Tests (services/updater/updater_test.go)
- Initial fact publish: software/updater/health all retain on Run
startup with the configured Identity + 16-char boot_id.
- Prepare happy path: ok=true, state -> ready, last_error cleared.
- Commit without staged: refusal err="nothing_staged".
- Commit with staged descriptor: ok=true, state -> applying,
pending_version mirrors descriptor.Version.
- Receiver under stub verifier: state -> failed, last_error contains
the verifier_stub sentinel; reply ok=false.
- Receiver under fakeVerifierAccept: state -> staged,
pending_version = manifest.Version; reply ok=true stage=staged.
- Receiver under fakeVerifierReject: state -> failed with the
custom error string.
- jsonDecode tolerates typed payloads, json.RawMessage, []byte, and
nil — the four shapes the bus actually delivers in practice.
- memorySink Abort clears the buffer; Commit closes further writes.
Build sizes (pico_bb_proto_1):
release default : code 184640 (was 280364 — TinyGo dead-code
elimination cleaning up after
the receiver/verifier refactor)
release flash_unsafe: code 184628
qa_reactor : code 136232
Out of scope for this commit (later phases):
- W2/W3 fabric-side stage/apply split + receiver invocation from
transfer.go after xfer_commit. Receiver is wired but not yet
invoked by fabric.
- W7/W8 retained telemetry publishers + charger alert FSM.
- W9 lane assignment routing into fabric's 3-lane writer.
- W10 post-hello_ack republish hook.
- W11 abupdate metadata extension (payload_sha256 + staged
descriptor read/write).
- W12 legacy Go remap removal.
- W13 L4 acceptance prep on devicecode-lua.
Implements W2/W3 of the fabric-update plan: fabric's onTransferCommit
now hands the committed artefact to the local-bus receiver topic
named in meta.receiver, gates xfer_done on the receiver's reply, and
defers slot-switch/reboot to the updater commit RPC.
W2 — Receiver invocation
- transfer.go's onTransferCommit, after sink.Commit() succeeds and
before sending xfer_done, decodes meta.receiver from the xfer_begin
meta blob and calls that local-bus topic with
{link_id, xfer_id, size, checksum, meta, artefact}.
- The reply gates the next frame: ok=true -> xfer_done; ok=false ->
xfer_abort with err = receiver's err string. Synchronous call with
receiverCallTimeout (5s); session goroutine blocks intentionally
during the call because we MUST decide xfer_done vs xfer_abort
before processing further frames.
- decodeReceiverReply tolerates two reply shapes — typed Go struct
(in-process tests) and JSON bus payload (real receiver via
services/updater.ReceiverReply).
W3 — Stage/apply split
- transferSink interface grows Bytes() []byte. nil means the sink
streamed elsewhere (e.g. flash_unsafe writes directly to flash and
doesn't keep a RAM copy). Buffer sink returns its committed slice;
flash_unsafe sink and fakeTransferSink return nil.
- onTransferCommit calls the receiver only when (a) Bytes() returns
non-nil AND (b) meta.receiver is present. Otherwise it falls
through to the legacy xfer_done -> sink.Apply() flow — keeping
flash_unsafe builds and tests that don't model the receiver
unaffected.
- The receiver path NO LONGER calls sink.Apply(). Slot-switch and
reboot are deferred to cmd/self/updater/commit, which the updater
state machine in services/updater handles. fabric-security wires
the actual reboot through abupdate when it lands.
New default sink (transfer_sink_buffer.go)
- Replaces the rejecting transfer_sink_rp2350.go default and the
rejecting transfer_sink_stub.go host build with an in-RAM buffer
capped at 64 KB. Hitting the cap returns ErrArtefactTooLarge from
WriteChunk, which propagates to xfer_abort.
- Bytes() returns the committed slice for the receiver. Apply() is
a no-op (slot-switch is the updater RPC's job).
- The 64 KB cap is conservative — large-firmware streaming-into-flash
for unverified artefacts is fabric-security's job; this branch only
needs enough to round-trip the smoke-test traffic and the receiver
path with a stub verifier.
Tests
- TestTransferReceiverInvokedAfterCommit: registers a local-bus
receiver handler, drives a 4-byte transfer with
meta.receiver=["test","receiver"], asserts the receiver got the
expected payload (link_id, xfer_id, checksum, artefact bytes) and
fabric sent xfer_done after the receiver replied ok=true.
- TestTransferReceiverRejectAbortsTransfer: receiver replies
{ok=false, err="manifest_check_failed"}; fabric sends xfer_abort
with err mirroring the receiver reason.
- bufferingSinkAdapter wraps the production bufferSink for these
tests so the receiver path is exercised against the real sink, not
fakeTransferSink (whose nil Bytes() is the legacy fallback path).
- All existing transfer tests (legacy fallback path with
fakeTransferSink) continue to pass — the change is additive.
Build sizes (pico_bb_proto_1):
release default : 184616 (effectively unchanged)
release flash_unsafe: 184628
Out of scope for this commit (still later):
- W7 retained telemetry publishers + W8 charger alert FSM.
- W9 lane assignment routing into the 3-lane writer.
- W10 post-hello_ack republish hook (Republish() exists; needs to be
invoked from the fabric session lifecycle).
- W11 abupdate metadata extension (payload_sha256 + staged
descriptor read/write).
- W12 legacy Go remap removal.
- W13 L4 acceptance prep on devicecode-lua.
The updater service now watches the fabric link-state retain at
state/fabric/link/+ and calls Republish() on every !Ready -> Ready
transition. Mirrors the spec in
docs/firmware-alignment-update.md:
Retained `state/self/*` is republished:
- immediately after every successful boot
- on every newly established session (`hello_ack`), warm or cold
Implementation:
- Run() grows a fourth subscription on TopicFabricLink
(state/fabric/link/+), tracking per-link Ready flag in a small map
to detect edges and avoid double-firing on subsequent Ready=true
retains that keep the same edge state.
- decodeLinkState pulls the link_id from the topic tail and Ready
from the payload (typed map or JSON-marshalable struct — covers
both the in-process linkStatePayload published by
services/fabric/session.go and any future test-side fakes).
Test (TestRepublishOnLinkReadyEdge):
- Drains the initial fact retains, publishes a Ready=false link
state (no republish expected), then a Ready=true edge (republish
asserted via the next state/self/software emit), then a second
Ready=true retain (no double-fire — channel must stay quiet for
the settle window).
New services/telemetry package subscribes to the existing HAL value
topics (hal/cap/env/..., hal/cap/power/...) and republishes them
under the canonical state/self/* surface using integer engineering
units. Also runs the W8 charger alert FSM emitting
event/self/power/charger/alert with 14 canonical kinds.
W7 — retained-state publishers
- state/self/power/battery: pack_mV, per_cell_mV, ibat_mA, temp_mC,
bsr_uohm_per_cell. Sourced from types.BatteryValue.
- state/self/power/charger: vin_mV, vsys_mV, iin_mA, raw bitfields
(state_bits/status_bits/system_bits) AND 3 decoded boolean maps
(state{}, status{}, system{}) carrying the canonical names from
the existing types.ChargerStateTable / ChargeStatusTable /
SystemStatusTable. The 27-boolean spec (11 + 4 + 12) follows from
the size of those tables.
- state/self/environment/temperature: deci_c.
- state/self/environment/humidity: rh_x100.
- state/self/runtime/memory: alloc_bytes from runtime.MemStats. Tick
every 3 s — matches the existing reactor mem-stat cadence.
- state/self/power/charger/config: deferred until W7 stretch (LTC4015
effective config tracking is a separate read path through the
charger driver). When it lands, it'll feed the analog-threshold
branch of the alert FSM.
Every fact carries a monotonic per-topic seq counter and a
service-monotonic uptime_ms — recommended in the spec for
"where cheap" cases, which all of these are.
W8 — charger alert FSM (services/telemetry/alerts.go)
- 14 canonical AlertKind values, frozen by the spec.
AllAlertKinds is a guarded slice; the unit test compares it
byte-for-byte against the spec's snake_case list to catch
accidental rename.
- chargerAlertFSM holds the previous ChargerValue and on each new
observation walks the bit-set transitions:
state bits -> bat_missing, bat_short, max_charge_time_fault,
absorb, equalize, cccv, precharge
status bits -> iin_limited, uvcl_active, cc_phase, cv_phase
emitting one AlertEvent per !set->set edge, sparse (retained=false).
- Analog-threshold kinds (vin_lo, vin_hi, bsr_high) are stubbed
pending the charger-config publisher; the FSM scaffolding is
there so dropping the threshold compare in is mechanical.
- AlertEvent payload carries kind, source="ltc4015", a snapshot of
state_bits/status_bits/system_bits at emit time, plus seq +
uptime_ms.
Glue
- services/fabric/remap.go grows export rules for state/self/* and
event/self/* (no remapping — local topic = wire topic). Legacy
hal/* exports stay until W12 cleanup so the transition doesn't
break any consumers that haven't moved to the new surface yet.
- services/reactor/reactor.go starts the telemetry service in its
own goroutine after the updater service.
Tests (services/telemetry/telemetry_test.go)
- TestPublishesBatteryFact: BatteryValue from HAL -> BatteryFact at
state/self/power/battery with the right field mapping + seq=1 +
non-negative uptime.
- TestPublishesChargerWithDecodedBooleans: ChargerValue with mixed
set bits -> raw bits preserved AND decoded booleans agree
(set bits true, unset bits false).
- TestPublishesEnvironmentFacts: temp + humidity round-trip.
- TestAllAlertKindsCount: exactly 14 entries; canonical strings
match the spec byte-for-byte.
- TestChargerAlertFSMEdgeOnly: bit !set -> set fires once;
subsequent retains keeping the bit set do NOT re-fire; clear
-> set re-fires.
- TestChargerAlertFSMMultipleBitsTransitionTogether: two state
bits flipping in the same publish emit two alerts.
Build sizes (pico_bb_proto_1):
release default : 185180 (+540 over phase 3)
release flash_unsafe: 185176
qa_reactor : 136232
Adds the writer half of the abupdate metadata block surface so the
receiver actually persists a verified manifest's fields after the
verifier succeeds. The fabric-update branch ships an in-memory
implementation; fabric-security swaps in a flash-backed
implementation that survives reboots.
W11 — MetadataWriter
- New MetadataWriter interface with WriteStagedDescriptor and
ClearStagedDescriptor. Symmetric with the existing MetadataReader
so a single MemoryMetadata value can implement both.
- NewMemoryMetadata() returns a process-lifetime in-memory
reader+writer. Used as the default in Options when neither
Metadata nor MetadataWrite is supplied — gets the receiver
end-to-end working without external wiring.
- noopMetadataWriter is the fallback when a caller supplied a
MetadataReader (legacy) but no writer; it logs each call so the
gap is visible during bring-up.
Receiver
- handleReceiver now calls metadataWrite.WriteStagedDescriptor with
the manifest's {Version, BuildID, ImageID, PayloadLength,
PayloadSHA256} after the sink commits. On write failure
transitions to failed with last_error="metadata_write_failed:..."
so the failure mode is observable on the wire.
- Republishes the software fact after a successful staging so
payload_sha256 reaches the CM5 immediately rather than waiting
for the next session-restart Republish edge.
Test
- TestReceiverFakeAcceptWritesStagedDescriptor: drives the receiver
through fakeVerifierAccept, asserts the MemoryMetadata reader sees
the descriptor + payload_sha256, then exercises the commit RPC
which now succeeds (was returning nothing_staged before W11)
because the reader has the descriptor — pending_version flows
manifest.Version -> staged -> applying.
Build sizes (pico_bb_proto_1):
release default: 185176 (tiny shrink from refactor)
Five fixes flagged in the post-phase-5 review.
H1 — commit RPC must not lie about apply success
- New Applier interface for the slot-switch + reboot hook. Default
RefusingApplier returns ErrApplyUnavailable so the commit RPC
refuses with `error: "apply_unavailable"` rather than transitioning
to applying and silently never rebooting on a branch where the
apply path doesn't exist. fabric-security supplies a real
abupdate-backed implementation; tests pass fakeApplier when they
need to drive the success path.
- handleCommit calls applier.Apply(desc) and translates failures into
the wire reply directly. Only on Applier success do we transition
to applying.
H2 — stale staged-descriptor leak
- handlePrepare clears the persisted staged descriptor before
transitioning to ready. This guards (stage A) -> (prepare B) ->
(stage B fails) from leaving descriptor A persisted and
committable.
- handleCommit now requires BOTH a persisted descriptor AND
state==staged before allowing apply. Either alone refuses with
nothing_staged.
- handleReceiver clears the staged descriptor on every failure path
(verifier reject, sink commit fail, metadata write fail) so a
rejected new transfer can never accidentally commit a stale
prior stage.
H3 — canonical charger boolean key names
- ChargerFact's state{}/status{}/system{} maps now emit the
spec-canonical names from
docs/firmware-alignment-update.md §"Telemetry/state facts" rather
than the existing types.* display-table abbreviations. Wire keys
are spec-frozen because the Lua import side keys off them:
state{}: equalize_charge, absorb_charge, charger_suspended,
precharge, cccv_charge, ntc_pause, timer_term,
c_over_x_term, max_charge_time_fault,
bat_missing_fault, bat_short_fault
status{}: vin_uvcl_active, iin_limit_active, const_current,
const_voltage
system{}: charger_enabled, mppt_en_pin, equalize_req, drvcc_good,
cell_count_error, ok_to_charge, no_rt, thermal_shutdown,
vin_ovlo, vin_gt_vbat, intvcc_gt_4p3v, intvcc_gt_2p8v
- Three internal name tables (chargerStateNames,
chargerStatusNames, chargerSystemNames) replace the indirect
use of types.ChargerStateTable[]. The 27-boolean spec total
(11+4+12) is checked at test time.
- TestPublishesChargerWithDecodedBooleans updated to the canonical
names plus exact-size assertions on each map.
M1 — boot_id fallback now varies per cold boot
- The previous fallback returned the process-startup counter packed
into 16 hex chars, so every cold boot got the same value when
crypto/rand was unavailable. Master plan R3 explicitly forbids
that.
- Fallback now mixes monotonic clock at generation time, MemStats
fields (Alloc / Mallocs / HeapInuse / Frees — vary with HAL init
timing), the per-call counter, and a final 11/17/5-shift mix step
so each output byte depends on every input bit. Still not
cryptographically random, but gives reliable per-boot variation
even when crypto/rand is broken. Logged on entry so the
failure-mode test suite can grep for it.
M2 — alert severity field
- AlertEvent gains a non-omitempty `severity` string. New
alertSeverity(kind) returns "warning" for fault-class kinds
(bat_missing, bat_short, max_charge_time_fault, bsr_high) and
"info" for charge-phase / control-loop transitions. This addresses
the Codex note that the planned alert payload includes severity
but the publisher was emitting it as omitempty (so it fell off
the wire when default).
- charger_config publisher and the 3 analog kinds (vin_lo, vin_hi,
bsr_high) remain stubbed pending the LTC4015 effective-config
read path; the FSM plumbing is in place so dropping thresholds
in is mechanical when that lands.
New tests
- TestCommitWithoutStagedStateRefusesEvenWithDescriptor: descriptor
in metadata + state != staged refuses with nothing_staged.
- TestCommitWithoutApplierReturnsApplyUnavailable: production
default refuses; state stays at staged.
- TestCommitWithFakeApplierTransitionsToApplying: fakeApplier
exercises the future fabric-security wiring.
- TestReceiverFailureClearsStaleStagedDescriptor: pre-staged
descriptor + verifier reject leaves no committable descriptor.
- TestPrepareClearsStaleStagedDescriptor: prepare clears stale
descriptor.
Build sizes (pico_bb_proto_1, code):
release default : 185176 (unchanged)
release flash_unsafe: 185168 (unchanged)
release debug_uart : 185176 (unchanged)
qa_reactor : 136232 (unchanged)
Three follow-up fixes from the second Codex review pass.
H1 — commit RPC must publish state=applying + reply ok BEFORE reboot
- Applier interface split into CanApply (validation, returns error)
and ArmReboot (schedules reboot, may not return). The previous
single-method Apply() let real implementations reboot inside the
call before handleCommit had a chance to publish the canonical
applying retain or send the ok reply.
- handleCommit reordered:
desc, present := metadata.StagedDescriptor()
if !present || state != staged -> refuse nothing_staged
if applier.CanApply(desc) returns err -> refuse with err.Error()
transitionTo(applying, "", desc.Version) // retain published
reply(ok, accepted) // wire reply sent
applier.ArmReboot(desc) // may not return
- RefusingApplier.CanApply returns ErrApplyUnavailable; ArmReboot is
the contract-required no-op that the commit handler never reaches
on this branch.
- fakeApplier in tests tracks canCalls + rebootCalls separately so
tests can verify the ordering. Existing
TestCommitWithFakeApplierTransitionsToApplying updated to assert
both hooks fire exactly once and ArmReboot got the right
descriptor.
M2 — payload_sha256 separation
- MemoryMetadata previously had one payloadSHA field, mutated by
WriteStagedDescriptor and read by SoftwareFact. So a stage write
promoted the staged hash into the running fact, AND a subsequent
ClearStagedDescriptor (from prepare or receiver failure) left the
invalidated hash sitting on the wire.
- Split into runningPayloadSHA and (descriptor-embedded)
staged-payload hash. WriteStagedDescriptor no longer touches the
running hash; ClearStagedDescriptor only nulls the descriptor.
PayloadSHA256() reads runningPayloadSHA, which is set once at boot
via SetRunningPayloadSHA (fabric-security wires this from the
active slot's flash metadata).
- Receiver no longer republishes the software fact on stage success
— the staged hash isn't a property of the running image.
- TestReceiverFakeAcceptWritesStagedDescriptor now asserts that
WriteStagedDescriptor leaves PayloadSHA256() at "" (running hash
unset), guarding the regression.
M3 — boot_id fallback comment accuracy
- Removed the misleading "stack-address mixing" reference from the
earlier comment (the code never did that). Replaced with an
accurate description of what the mix actually does (UnixNano +
MemStats fields + per-call counter + 3-stage shift mix) plus a
blunt note that the fallback is NOT contract-grade and the right
long-term fix is HAL RNG or a persisted boot counter in the
abupdate metadata block. The `[updater] boot_id fallback engaged`
log line stays as the canonical signal for the failure-mode test
suite to detect the engagement.
Verified
- go test ./... clean.
- tinygo builds pass for: pico_bb_proto_1, pico_bb_proto_1 flash_unsafe,
pico_bb_proto_1 debug_uart, qa_reactor pico_bb_proto_1.
- Build sizes unchanged across all four paths (185176 release default).
Remaining (still deferred, unchanged):
- pico-side flash backing for MemoryMetadata (W11 finish on
pico2-a-b/abupdate).
- state/self/power/charger/config publisher + the 3 analog alert
kinds (W7/W8 finish, blocked on LTC4015 effective-config read path).
- W12 legacy remap removal.
- W13 lua-side L4 acceptance work.
Codex review pass 3: the in-line comment above the fallback's tick++ still claimed the implementation mixed "the address of a stack-local variable (varies due to stack layout / ASLR-equivalent on TinyGo)". The code never actually did that — an earlier draft tried but the mid-expression `unsafe.Pointer` casts didn't compile, so the line was dropped without a corresponding doc update. Replaced with a one-paragraph note that the fallback is best-effort per-boot jitter (not contract-grade) and points readers to the more detailed mix description below. The `[updater] boot_id fallback engaged` log line is left explicit as the failure-mode test grep target. No code change.
Closes the remaining W7/W8 gaps Codex flagged in the second review pass.
W7 finish — state/self/power/charger/config publisher
- New ChargerConfigFact carrying { schema=1, source="ltc4015",
thresholds{vin_lo_mV, vin_hi_mV, bsr_high_uohm_per_cell},
alert_mask_bits, alert_mask{14 booleans}, seq, uptime_ms }.
Strict no-operating-state-booleans contract per the spec.
- DefaultChargerConfig() returns conservative bring-up values
(vin_lo=10500 mV, vin_hi=17000 mV, bsr_high=5000 uohm/cell). The
LTC4015 driver isn't yet tracking effective programmed config —
Service.SetChargerConfig swaps the values once that lands.
- Published at service startup AND on every state/fabric/link/+
Ready-edge (mirrors updater W10) so the CM5 sees a fresh config
retain on every newly established session, warm or cold.
W8 finish — analog alert kinds
- chargerAlertFSM now consumes BatteryValue too via observeBattery,
tracking BSR_uOhmPerCell across observations.
- vin_lo: emits when ChargerValue.VIN_mV crosses below
thresholds.vin_lo_mV (edge from >=threshold to <threshold).
- vin_hi: emits when VIN_mV crosses above thresholds.vin_hi_mV.
- bsr_high: emits when BatteryValue.BSR_uOhmPerCell crosses above
thresholds.bsr_high_uohm_per_cell. Severity is "warning" per
alertSeverity().
- All three honour the same edge-only contract as the bit-driven
kinds: subsequent observations past the threshold do NOT re-emit.
ChargerAlertMask carries the 14 spec-frozen booleans
(vin_lo / vin_hi / bsr_high / bat_missing / bat_short /
max_charge_time_fault / absorb / equalize / cccv / precharge /
iin_limited / uvcl_active / cc_phase / cv_phase). On this branch
the mask is informational only — the FSM emits regardless of the
mask state. Once the LTC4015 driver actually programs the chip's
alert-enable register and reports it back, mask wiring through to
emission becomes a small Service.publishChargerConfig follow-up.
Tests
- TestPublishesChargerConfigAtStartup: config retain lands on Run
start with schema=1, source=ltc4015, non-zero defaults.
- TestChargerAlertFSMVinLoEdge: prime above threshold, cross below,
verify vin_lo emit; subsequent observation below threshold
doesn't duplicate.
- TestChargerAlertFSMVinHiEdge: cross above, verify vin_hi.
- TestChargerAlertFSMBSRHighEdge: BatteryValue.BSR rising past
threshold emits bsr_high with severity="warning".
Build sizes (pico_bb_proto_1, code):
release default : 184664 (-512 from phase 5; new code, dead-
code elim cleaned up the receiver
republishSoftware path that M2 dropped)
release flash_unsafe: 184656
qa_reactor : 136232
Drops the pre-fabric-update MCU surface — config/device -> config/hal
import, rpc/hal/dump inline RPC handler, hal/cap/env / hal/cap/power /
hal/state exports, and fabric/out/rpc/hal/dump call export — in
favour of the new state/self/* + cmd/self/updater/{prepare,commit} +
event/self/* + receiver-topic surface added across the rest of this
branch.
remap.go
- importPublishRules: emptied. config/device -> config/hal is gone;
config-like data flows through cmd/self/updater/prepare's metadata
field.
- exportPublishRules: keeps state/self/* and event/self/* (W7/W8);
drops the three legacy hal/* -> state/* exports.
- exportCallRules: emptied. The MCU is no longer an outbound RPC
caller on this branch.
- importCallRules unchanged: still routes cmd/self/updater/{prepare,
commit} -> rpc/updater/{prepare,commit}.
session.go
- onPub no longer has the config/device -> config/hal special case;
every imported pub now flows through the generic
conn.Publish + trackImportedRetain path.
- onCall no longer has the inline rpc/hal/dump handler. dumpReply,
dumpCallTopic, tConfigHAL, configApplied, configCount,
lastConfigErr fields all removed from session.
- Now-unused imports of devicecode-go/types pruned.
config.go
- decodeHALConfig / decodeHALConfigBytes / decodeHALState helpers
removed (sole callers were the dump handler + config/hal special
case). decodePayload survives — still used by session.onReply for
forwarding RPC replies onto the originating Request's reply path.
Tests deleted (legacy-path-only, replaced by services/updater and
services/telemetry coverage):
- TestPubImport, TestPubExport, TestUnretainExport, TestUnretain
- TestDumpCallReturnsConfigState, TestDumpCallDoesNotBlockPing
- TestCallExport, TestCallExportOnlyConfiguredRule,
TestCallExportPeerReset
- TestPendingWireCallsTimeout
- TestEchoedHelloAckIgnoredDuringOutgoingCall, TestPreHelloDump
- TestDrainOutgoingWireCallsReportsMarshalFailure,
TestDrainOutgoingWireCallsReportsWriteFailure
Tests updated to canonical post-W12 shape:
- TestImportPublishTopic: asserts every input now returns nil.
- TestImportCallTopic: asserts the cmd/self/updater/{prepare,commit}
routes, plus rpc/hal/dump now returning empty.
- TestExportTopic: state/self/* and event/self/* paths; legacy
hal/cap/* paths assert nil.
- TestExportCallTopic, TestExportCallPatterns: assert empty rules.
- TestCallImport: switched from the legacy rpc/hal/dump path to
cmd/self/updater/prepare via the canonical importCallRules
routing.
- TestSessionResetUnretainsImports: installs a t.Cleanup-scoped
temp importPublishRule (same harness pattern as
TestInboundCallBusyAtCapacity) since the production rules are
empty post-W12. The mechanism under test (retain tracking +
session-reset teardown) is generic; the topic was just an
example.
KNOWN: this commit breaks the lua-side smoke test on
chris-fabric-new-update which still pushes config/mcu retain (drops
on the MCU as no_route) and calls rpc/peer/mcu-1/hal/dump (drops as
no_route). The next commit on devicecode-lua updates smoke.lua to
observe state/self/* retains and call cmd/self/updater/prepare via
new outbound_call_rules in mcu-dev.json.
Build sizes (pico_bb_proto_1, code):
release default : 184664 (unchanged — dead code already trimmed
by tinygo elim of the inline dump path)
release flash_unsafe: 184656
release debug_uart : 184664
qa_reactor : 136232
Codex review M2 follow-up: ChargerConfigFact was emitting source="ltc4015" while actually carrying DefaultChargerConfig() — that misrepresented the data on the wire. CM5 consumers couldn't tell whether the config fact was effective programmed values or fallback bring-up defaults. Fix: ChargerConfig grows a Source string field. DefaultChargerConfig() sets Source="ltc4015-default" so the wire value is honest about its provenance. When the LTC4015 driver wires effective programmed values via Service.SetChargerConfig, the caller passes Source="ltc4015" and the next publish carries that on the wire. publishChargerConfig defensively falls back to "ltc4015-default" if a caller leaves Source empty so the wire never carries an empty source field. Tests - TestPublishesChargerConfigAtStartup updated: asserts the initial publish carries source="ltc4015-default" rather than the misleading "ltc4015". - TestSetChargerConfigPromotesSourceString: drives a separate Service through SetChargerConfig with Source="ltc4015" and asserts the resulting fact carries that source plus the driver-supplied thresholds. Documents the contract for fabric- security / charger-driver integration: pass Source="ltc4015" once effective register state is being read. This is the smallest correct fix for the misrepresentation — W7/W8's analog FSM still uses the threshold values regardless of source, since the alert FSM operates on the actual numeric thresholds the chip is programmed with (or the conservative defaults until SetChargerConfig fires).
Hardware bring-up on RP2350 with the latest fabric-update build panicked with "panic: no rng" before HAL even started: 0.000 [main] bootstrapping bus … 0.000 [main] starting hal.Run … panic: no rng Root cause: TinyGo's crypto/rand on RP2350 (and other targets without a hardware-RNG backend wired into its runtime) PANICS with "no rng" inside rand.Read rather than returning an error. Our boot_id generate() guarded against err != nil + all-zero bytes — neither catches the panic — so the firmware crashed at the first GenerateBootID call in main.go. Fix - boot_id.go now defers the crypto/rand attempt to a per-build helper tryCryptoRand(buf) bool. The shared file only handles the deterministic-mix fallback path. - boot_id_host.go (//go:build !tinygo): real crypto/rand.Read for host tests + non-TinyGo builds where crypto/rand actually works. - boot_id_tinygo.go (//go:build tinygo): always returns false, so generate() falls straight through to the deterministic mix. Why not defer/recover - Wrapping rand.Read in defer/recover does work, but pulls in TinyGo's panic-handling runtime and grew the binary by ~110 KB (measured pico_bb_proto_1 release default). Skipping crypto/rand entirely on TinyGo avoids that overhead — and the deterministic mix (UnixNano + MemStats fields + tick + 3-stage shift) is what master plan R3 already documented as the not-contract-grade fallback. The `[updater] boot_id fallback engaged` log line will fire on every RP2350 boot until TinyGo grows an RP2350 RNG backend or we route a HAL-supplied RNG into services/updater. Verified - go test ./services/updater/... clean (host build still uses real crypto/rand via boot_id_host.go). - tinygo build paths all pass: pico_bb_proto_1, pico_bb_proto_1 flash_unsafe, pico_bb_proto_1 debug_uart, qa_reactor pico_bb_proto_1.
The fabric-update branch defaults to StubVerifier+RefusingApplier so
no commit can ever lie about apply success without a real verifier in
place. That blocks fw-update-e2e -- the harness needs a working
end-to-end stage+commit+reboot path before fabric-security ships
imagev1.
Land a bringup-only path:
- PassthroughVerifier(identity) accepts any artefact, streams it into
the SlotSink while computing SHA-256, returns a synthetic Manifest
populated from the build-time identity + computed hash + artefact
length. No envelope, no signature -- temporary until imagev1 lands.
- newSlotSink is build-tag-split:
!tinygo: returns the existing memorySink (RAM buffer; tests use it).
tinygo && rp2350: returns abupdateSink that streams straight into
the inactive A/B slot via abupdate.WriteChunk, and FlushFinals on
Commit. A package-level sharedUpdater persists across the
receiver path (writes the slot) and the applier path (reboots
into it).
- ProductionApplier is build-tag-split:
!tinygo: still RefusingApplier (host can't reboot).
tinygo && rp2350: abupdateApplier whose ArmReboot calls
abupdate.RebootIntoSlot (does not return on success).
- Reactor wires PassthroughVerifier + ProductionApplier as the
defaults. When fabric-security lands the real verifier+manifest, the
reactor switches over without touching the receiver.
When a frame got mangled (single-byte UART corruption mid-transfer), logMalformed dropped it silently. CM5 kept streaming subsequent chunks; the receiver rejected each as out-of-order; transfer died on the phase timeout. Hardware proof: 33% transfer at 115200 baud, one flipped byte, transfer abandoned 15s later. If an incoming transfer is active when a frame fails to parse, emit xfer_need(bytesWritten) so CM5 restarts from the next expected byte. Standard recovery path — protocol already supports it on the sender side; the receiver was just failing to issue the signal. Stale need (not actually a chunk) is harmless: CM5 ignores need.next values it has already passed.
A single-byte UART flip inside the base64 data string of an xfer_chunk doesn't break JSON parsing -- json.Unmarshal still succeeds; base64 still decodes; we silently absorb the wrong bytes. The whole-payload xxHash mismatch only surfaces at xfer_commit, by which point the entire transfer (~1min on 115200 baud at 99% then the bin-checksum reveals the rot) has been wasted. Add an optional `crc` field to xfer_chunk carrying xxHash32 of the raw bytes. On mismatch send xfer_need(bytesWritten) -- same recovery path the malformed-frame handler uses, so the sender retransmits the gap. crc is `omitempty` and "absent" still works (we just lose the per-chunk check).
Both recovery paths (malformed frame, per-chunk crc mismatch) send xfer_need but were not bumping cur.deadline. A burst of bad frames during recovery could still trip the phase_timeout idle watchdog and abort mid-recovery. Treat recovery as progress. Codex review M: malformed-frame xfer_need does not refresh deadline.
The default tinygo+rp2350 transfer sink buffered the whole artefact in RAM before handing it to the receiver. ~424KB binaries blew past what's safe on a Pico 2's SRAM and made the receiver interface gratuitously stateful. Switch to streaming-during-transfer: - BeginStreamedStage prepares the inactive A/B slot via abupdate. - WriteStreamedStage writes chunks straight to flash and updates a running SHA-256 of the streamed bytes. - CommitStreamedStage flushes the final partial page and freezes the descriptor (length + payload sha). The fabric transfer sink is now a thin shim over those entry points. The receiver detects an empty payload.Artefact (the wire shape stays the same) and consumes the pre-staged descriptor; on host builds the streamed-stage stub returns "not present" so the existing verifier+sink path still drives unit tests. Pairs with fabric's recovery work: per-chunk crc + xfer_need on malformed/corrupt frames now feed a sink that can't lose bytes silently to RAM pressure.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
state/self/*facts, prepare/commit RPCs, staged descriptor handling, and updater/main transfer staging.cmd/self/updater/{prepare,commit}andxfer_*flow, with legacy HAL/config routes removed from the update surface.Why
This branch makes MCU updates work over the Fabric contract from
docs/updating.mdwithout adding extra wire routes beyond the Lua-side contract.Testing
go test ./...