ci: restructure pipeline — sequential fail-fast with fixture DB by Kpa-clawbot · Pull Request #1 · efiten/meshcore-analyzer

Kpa-clawbot · 2026-03-29T17:39:25Z

Summary

Restructures the entire CI/CD pipeline into a clean sequential fail-fast chain.

PR pipeline (`pull_request` to master):

Go unit tests — run first, fail-fast
Playwright E2E tests — runs ONLY if Go tests pass. Uses Go server (not Node.js) with real data fixture DB (test-fixtures/e2e-fixture.db). Fail-fast on first failure. Frontend coverage collection via Istanbul instrumentation.
Docker build — runs ONLY if both above pass. Verify containers build, don't push.
No deployment on PRs

Master pipeline (`push` to master):

Same chain + deploy to staging + badge publishing.

Removed

ALL Node.js server-side unit tests (test-server-routes, test-server-helpers, test-decoder, test-decoder-spec, test-aging, test-packet-filter, etc.) — these tested the deprecated JS server
npm ci / npm run test steps
JS server coverage collection (server.js with COVERAGE=1)
"Detect changed files" logic — just run everything
"Skip if docs-only change" logic — just run everything
Cancel-workflow-on-failure API hacks
The disabled node-test job (if: false)

Added

Real data fixture DB (test-fixtures/e2e-fixture.db) captured from staging — 200 nodes, 31 observers, 500 packets for deterministic E2E tests
scripts/capture-fixture.sh — script to refresh the fixture DB from staging API
.gitignore exception for the fixture DB

Kept

Go test job with coverage + badges
Playwright E2E against Go server with fixture DB
Frontend JS coverage (instrument public/, collect window.__coverage__)
Docker build verification
Deploy + publish (master only)

Restructures the entire CI/CD pipeline: PR pipeline (pull_request to master): 1. Go unit tests — run first, fail-fast 2. Playwright E2E — runs ONLY if Go tests pass, uses Go server with real data fixture DB (test-fixtures/e2e-fixture.db), fail-fast on first failure, with frontend coverage collection 3. Docker build — runs ONLY if both above pass, verify only Master pipeline (push to master): - Same chain + deploy to staging + badge publishing Removed: - ALL Node.js server-side unit tests (deprecated JS server) - npm ci / npm run test steps - JS server coverage collection (COVERAGE=1) - Detect changed files logic — just run everything - Skip if docs-only change logic — just run everything - Cancel-workflow-on-failure API hacks Added: - Real data fixture DB captured from staging (200 nodes, 31 observers, 500 packets) for deterministic E2E tests - scripts/capture-fixture.sh to refresh the fixture from staging - .gitignore exception for the fixture DB

…failure The test() helper was catching errors and collecting them into results[], only checking for failures after all tests completed. This meant the runner kept going through remaining tests even after a failure, wasting CI time and obscuring the root cause. Now process.exit(1) is called immediately when a test fails, giving true fail-fast behavior.

…act v6

The collector script takes 8+ min navigating every page for coverage. E2E tests already extract window.__coverage__ to .nyc_output/e2e-coverage.json. This cuts pipeline from ~11 min to ~2-3 min.

…conflicts (Kpa-clawbot#481) ## Problem Every PR that touches `public/` files requires manually bumping cache buster timestamps in `index.html` (e.g. `?v=1775111407`). Since all PRs change the same lines in the same file, this causes **constant merge conflicts** — it's been the #1 source of unnecessary PR friction. ## Solution Replace all hardcoded `?v=TIMESTAMP` values in `index.html` with a `?v=__BUST__` placeholder. The Go server replaces `__BUST__` with the current Unix timestamp **once at startup** when it reads `index.html`, then serves the pre-processed HTML from memory. Every server restart automatically picks up fresh cache busters — no manual intervention needed. ## What changed | File | Change | |------|--------| | `public/index.html` | All `v=1775111407` → `v=__BUST__` (28 occurrences) | | `cmd/server/main.go` | `spaHandler` reads index.html at init, replaces `__BUST__` with Unix timestamp, serves from memory for `/`, `/index.html`, and SPA fallback | | `cmd/server/helpers_test.go` | New `TestSpaHandlerCacheBust` — verifies placeholder replacement works for root, SPA fallback, and direct `/index.html` requests. Also added tests for root `/` and `/index.html` routes | | `AGENTS.md` | Rule 3 updated: cache busters are now automatic, agents should not manually edit them | ## Testing - `go build ./...` — compiles cleanly - `go test ./...` — all tests pass (including new cache-bust tests) - `node test-frontend-helpers.js && node test-packet-filter.js && node test-aging.js` — all frontend tests pass - No hardcoded timestamps remain in `index.html` --------- Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: you <you@example.com>

…alysis Weak passphrases with no KDF stretching are the #1 practical threat. Timestamp in plaintext block 0 serves as known-plaintext oracle for instant key verification from a single captured packet. Key findings: - decode_base64() output used directly as AES key, no KDF - Short passphrases produce <16 byte keys (reduced key space) - No salt means global precomputed attacks work - 3-word passphrase crackable in ~2 min on commodity GPU Reviewed by djb and Dijkstra personas. Corrections applied: - GPU throughput upgraded from 10^9 to 10^10 AES/sec baseline - Oracle strengthened: bytes 4+ (type byte, sender name) also predictable - Dictionary size assumptions made explicit - Zipf's law caveat added (humans don't choose uniformly) - base64 short-passphrase key truncation issue documented

…event silent reconnect-loop death (Kpa-clawbot#1216) RED commit: `1cd25f7b` — CI (failing on assertion): https://github.com/Kpa-clawbot/CoreScope/actions?query=sha%3A1cd25f7b1bdd0091f689dd64ce1bfec6d031191f Fixes Kpa-clawbot#1212 ## Root cause NOT that `AutoReconnect` was off — it was set; `MaxReconnectInterval=30s` was set (PR Kpa-clawbot#949); a `SetReconnectingHandler` was wired. The defect was an **observability gap**: `SetReconnectingHandler` fires only INSIDE paho's reconnect goroutine. If that goroutine never iterates (status race after the recovered handler panic at 21:07:13, or an internal abort), operators see ONLY the `disconnected: pingresp not received` line and then total silence. They cannot distinguish "paho is patiently retrying" from "paho gave up and the goroutine is gone." That ambiguity is what turned a 30s blip into 6h of downtime. ## Changes ### `cmd/ingestor/main.go` — `SetConnectionAttemptHandler` Fires on every TCP/TLS dial — the initial `Connect()` AND every reconnect — independent of paho's internal reconnect-loop state. Logs: ``` MQTT [staging] connection attempt #1 to tcp://broker:1883 MQTT [staging] connection attempt #2 to tcp://broker:1883 ``` Per-source attempt counter via `atomic.AddInt64`. ### `cmd/ingestor/mqtt_watchdog.go` (new) — per-source stall watchdog Satisfies the watchdog acceptance criterion. Even when paho reports `connected`, if no MQTT messages have flowed for >5m, log a WARN line every 60s: ``` MQTT [staging] WATCHDOG: client reports connected to tcp://broker:1883 but no messages received for 7m30s (threshold 5m) — possible half-open socket or upstream stall ``` Catches half-open TCP and broker-accepted-but-not-forwarding scenarios that look "connected" to paho. Hot-path cost: one `atomic.StoreInt64` per inbound message. Watchdog scans the registry once a minute. ### Tests (`cmd/ingestor/mqtt_reconnect_test.go`, new) - `TestBuildMQTTOpts_InstrumentsConnectionAttempt` — asserts `OnConnectAttempt` is wired in `buildMQTTOpts`. - `TestMQTTStallWatchdog_FiresOnSilentSource` — connected + 10m silent + 5m threshold → stall flagged. - `TestMQTTStallWatchdog_QuietWhenRecent` — recent message → no stall. - `TestMQTTStallWatchdog_QuietWhenDisconnected` — disconnected → no stall (paho's reconnect logging covers it). ## TDD - RED `1cd25f7b` — 2 assertion failures (compile OK, stub returns no-stall, `OnConnectAttempt` nil). - GREEN `2527be6f` — implementation; all ingestor tests pass. ## Out of scope - Slice-bounds decode panic (Kpa-clawbot#1211, separate PR). - A full in-process MQTT broker integration test would require a new dep (mochi-mqtt) — the observability and watchdog behaviors are independently verifiable by the unit tests above, and the reconnect path itself is paho's responsibility (we already test it's configured via `mqtt_opts_test.go`). --------- Co-authored-by: bot <bot@example.com> Co-authored-by: OpenClaw Bot <bot@openclaw.local> Co-authored-by: corescope-bot <bot@corescope.local> Co-authored-by: openclaw-bot <openclaw-bot@users.noreply.github.com>

… <500ms (Kpa-clawbot#1226) ## Summary Fixes Kpa-clawbot#1225 — channel messages endpoint took ~30s on staging. ## Root cause `(*DB).GetChannelMessages` SELECTed every observation row for the channel (one row per observation, not per transmission), JSON-unmarshalled each row into a Go map, dedupe-folded by `(sender, packetHash)`, then sliced the tail in Go for pagination. On staging `#wardriving`: - `transmissions` rows with `channel_hash='#wardriving' AND payload_type=5`: **5,703** - `observations` joined to those: **274,632** (~48× amplification) - `time curl /api/channels/%23wardriving/messages?limit=50`: **30.04s / 31.41s / 31.48s / 35.33s / 34.05s** (5 calls before I killed the loop) `EXPLAIN QUERY PLAN` showed the index `idx_tx_channel_hash` was being used — the cost was entirely in fetching, unmarshalling, and folding the full observation set per request even for `limit=50`. Hypothesis #1 from the issue (full table scan on `messages/decoded`) is rejected; #2 (missing index) is rejected; the actual cause was **pagination in Go instead of SQL** — request cost was O(observations) not O(limit). ## Fix Move pagination into SQL on the `transmissions` table. Because `transmissions.hash` is `UNIQUE` and the original dedup key was `(sender, hash)`, each transmission collapses to exactly one logical message — paginating on transmissions is semantically equivalent to the prior in-Go dedup + tail slice. New shape: 1. `COUNT(*)` on transmissions for total (uses `idx_tx_channel_hash`). 2. `SELECT id FROM transmissions … ORDER BY first_seen DESC LIMIT ? OFFSET ?` to pick the page of newest transmissions. 3. `SELECT … FROM observations WHERE transmission_id IN (…page ids…)` — typically 50 ids → a few hundred observation rows. 4. Reassemble in pageIDs order, preserving the ASC-by-`first_seen` API contract. Region filtering, observation-count-as-`repeats`, and "first observation wins for hops/snr/observer" semantics are preserved (observations are scanned `ORDER BY o.id ASC`). ## Perf measurements **Before** (staging `#wardriving`, limit=50, 5 samples killed mid-loop): 30.04s, 31.41s, 31.48s, 35.33s, 34.05s. **Synthetic regression test** (`TestGetChannelMessagesPerfLargeChannel`): 3000 tx × 50 obs. - Broken impl: ~4.5s (test fails the 500ms budget — the RED commit). - Fixed impl: well under 500ms (test passes). **After (staging)**: will measure post-deploy and post-comment on issue with numbers. Synthetic scaling: staging is ~2× the test's transmission count, fixed-path cost scales with `limit` (50) + `COUNT(*)` (~5k rows on index) — expect <100ms p99. ## TDD - RED: `697c290d` — perf test asserts <500ms on 3k×50 dataset; fails at ~4.5s. - GREEN: `3f1f82d3` — fix; full suite green, perf test passes. ## Hypotheses status | # | Hypothesis | Verdict | |---|---|---| | 1 | Endpoint slow on prod-sized data | **CONFIRMED** (different mechanism — see root cause) | | 2 | Missing channel_hash index | Rejected (`idx_tx_channel_hash` exists & used) | | 3 | Frontend re-render storm | Not investigated (backend was clearly the bottleneck) | | 4 | Decode in request path | Rejected (decode is at ingest time; JSON unmarshal of cached `decoded_json` is the cost, addressed by reducing row count) | | 5 | WS subscription failure | Rejected | | 6 | Staging artifact | Rejected (reproducible) | ## Out of scope - The in-memory `(*PacketStore).GetChannelMessages` path (used when `s.db == nil`) has the same shape but operates on bounded in-memory data; not touched. If we ever fall back to it in production we'll revisit. --------- Co-authored-by: clawbot <bot@corescope>

…pa-clawbot#1279 P0+P1) (Kpa-clawbot#1280) Addresses the four P0+P1 firmware reconciliation gaps from the umbrella audit (issue Kpa-clawbot#1279). RED commit: `0a4c084e` (asserts on stub returns; all 13 assertions fail). GREEN commit: `13867681`. ## What's in this PR ### P0 — silently dropped data - **#1 GRP_DATA (0x06) decoder.** Outer envelope is the same shape as GRP_TXT (`channel_hash(1)+MAC(2)+ciphertext`) per `firmware/src/helpers/BaseChatMesh.cpp:476,500`. Factored `decryptChannelBlock(...)` helper used by both 5 and 6. When a channel key matches, the inner is parsed per `firmware/src/helpers/BaseChatMesh.cpp:382-385` as `data_type(uint16 LE) + data_len(1) + blob(data_len)`. Surfaces `{channelHash, MAC, dataType, dataLen, decryptedBlob}` on decrypt or `{channelHash, MAC, encryptedData}` otherwise. Server-side decoder surfaces envelope only (no key store). - **#2 MULTIPART (0x0A) decoder.** Per `firmware/src/Mesh.cpp:289`, byte0 = `(remaining<<4) | inner_type`. When `inner_type == PAYLOAD_TYPE_ACK (0x03)`, next 4 bytes are the LE ack_crc per `firmware/src/Mesh.cpp:292-307`. Surfaces `{remaining, innerType, innerTypeName, innerAckCrc | innerPayload}`. ### P1 — mis-classified / opaque - **#3 `advertRole()` raw-type fix.** Per `firmware/src/helpers/AdvertDataHelpers.h:7-12`, ADV_TYPE_NONE = 0 and 5-15 are FUTURE. The previous boolean fallback collapsed both into `"companion"`, silently relabelling unknown/reserved types. New behaviour: type 0 → `none`, 1 → `companion`, 2-4 → `repeater`/`room`/`sensor`, 5-15 → `type-N`. `ValidateAdvert` accepts the new labels. - **#4 CONTROL (0x0B) byte0 flags + length.** Per `firmware/src/Mesh.cpp:69` + `createControlData` at `Mesh.cpp:609`, byte0 high-bit marks the zero-hop direct subset. Surfaces `{ctrlFlags, ctrlZeroHop, ctrlLength}`. ### Drift fix - `cmd/server/store.go` `payloadTypeNames` now includes `6: GRP_DATA` and `10: MULTIPART` (previously omitted; canonical decoder map already had them). ## Lockstep & TDD Both `cmd/ingestor/decoder.go` and `cmd/server/decoder.go` updated in the same commits — same wire-vector tests live in both packages (`cmd/{ingestor,server}/issue1279_test.go`). Per-item RED→GREEN visible in `git log`. | Item | Tests | RED proof | |---|---|---| | #1 GRP_DATA | ingestor: NoKey + DecryptedInner; server: Envelope | 6 assertions failed pre-impl | | #2 MULTIPART | ingestor + server: Ack + NonAck | 8 assertions failed pre-impl | | #3 advertRole | ingestor + server: 7-row table | 3 assertions failed pre-impl | | #4 CONTROL | ingestor + server: ZeroHop + MultiHop | 6 assertions failed pre-impl | ## What's NOT in this PR The umbrella issue lists P2 items that ship in follow-up PRs: - Live + compare legend entries for the long tail of newly-named types (Kpa-clawbot#1274 + others). - TransportCodes UI surface + filter grammar. - feat1/feat2 capability badges. - `payloadTypeNames` consolidation across server/ingestor (drift-prevention). Leave the umbrella open after this merges. Refs Kpa-clawbot#1279 --------- Co-authored-by: OpenClaw Bot <bot@openclaw.local>

Kpa-clawbot and others added 6 commits March 29, 2026 17:39

ci: remove if:always() — fail fast on any step failure

2c857cc

merge master: skip flaky packet detail tests

987c378

merge master: resolve conflict, keep restructured pipeline with artif…

135e288

…act v6

ci: disable coverage collector, use E2E window.__coverage__ instead

8366790

The collector script takes 8+ min navigating every page for coverage. E2E tests already extract window.__coverage__ to .nyc_output/e2e-coverage.json. This cuts pipeline from ~11 min to ~2-3 min.

efiten force-pushed the master branch 4 times, most recently from 41bf3b9 to f0fc940 Compare April 1, 2026 21:20

efiten force-pushed the master branch 2 times, most recently from 493d9e8 to 568de4b Compare May 1, 2026 22:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: restructure pipeline — sequential fail-fast with fixture DB#1

ci: restructure pipeline — sequential fail-fast with fixture DB#1
Kpa-clawbot wants to merge 6 commits into
efiten:masterfrom
Kpa-clawbot:ci/sequential-pipeline-with-fixture-db

Kpa-clawbot commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Kpa-clawbot commented Mar 29, 2026

Summary

PR pipeline (pull_request to master):

Master pipeline (push to master):

Removed

Added

Kept

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

PR pipeline (`pull_request` to master):

Master pipeline (`push` to master):