Skip to content

ci: restructure pipeline — sequential fail-fast with fixture DB#1

Open
Kpa-clawbot wants to merge 6 commits into
efiten:masterfrom
Kpa-clawbot:ci/sequential-pipeline-with-fixture-db
Open

ci: restructure pipeline — sequential fail-fast with fixture DB#1
Kpa-clawbot wants to merge 6 commits into
efiten:masterfrom
Kpa-clawbot:ci/sequential-pipeline-with-fixture-db

Conversation

@Kpa-clawbot
Copy link
Copy Markdown

Summary

Restructures the entire CI/CD pipeline into a clean sequential fail-fast chain.

PR pipeline (pull_request to master):

  1. Go unit tests — run first, fail-fast
  2. Playwright E2E tests — runs ONLY if Go tests pass. Uses Go server (not Node.js) with real data fixture DB (test-fixtures/e2e-fixture.db). Fail-fast on first failure. Frontend coverage collection via Istanbul instrumentation.
  3. Docker build — runs ONLY if both above pass. Verify containers build, don't push.
  4. No deployment on PRs

Master pipeline (push to master):

Same chain + deploy to staging + badge publishing.

Removed

  • ALL Node.js server-side unit tests (test-server-routes, test-server-helpers, test-decoder, test-decoder-spec, test-aging, test-packet-filter, etc.) — these tested the deprecated JS server
  • npm ci / npm run test steps
  • JS server coverage collection (server.js with COVERAGE=1)
  • "Detect changed files" logic — just run everything
  • "Skip if docs-only change" logic — just run everything
  • Cancel-workflow-on-failure API hacks
  • The disabled node-test job (if: false)

Added

  • Real data fixture DB (test-fixtures/e2e-fixture.db) captured from staging — 200 nodes, 31 observers, 500 packets for deterministic E2E tests
  • scripts/capture-fixture.sh — script to refresh the fixture DB from staging API
  • .gitignore exception for the fixture DB

Kept

  • Go test job with coverage + badges
  • Playwright E2E against Go server with fixture DB
  • Frontend JS coverage (instrument public/, collect window.__coverage__)
  • Docker build verification
  • Deploy + publish (master only)

Kpa-clawbot and others added 6 commits March 29, 2026 17:39
Restructures the entire CI/CD pipeline:

PR pipeline (pull_request to master):
1. Go unit tests — run first, fail-fast
2. Playwright E2E — runs ONLY if Go tests pass, uses Go server
   with real data fixture DB (test-fixtures/e2e-fixture.db),
   fail-fast on first failure, with frontend coverage collection
3. Docker build — runs ONLY if both above pass, verify only

Master pipeline (push to master):
- Same chain + deploy to staging + badge publishing

Removed:
- ALL Node.js server-side unit tests (deprecated JS server)
- npm ci / npm run test steps
- JS server coverage collection (COVERAGE=1)
- Detect changed files logic — just run everything
- Skip if docs-only change logic — just run everything
- Cancel-workflow-on-failure API hacks

Added:
- Real data fixture DB captured from staging (200 nodes, 31
  observers, 500 packets) for deterministic E2E tests
- scripts/capture-fixture.sh to refresh the fixture from staging
- .gitignore exception for the fixture DB
…failure

The test() helper was catching errors and collecting them into results[],
only checking for failures after all tests completed. This meant the runner
kept going through remaining tests even after a failure, wasting CI time
and obscuring the root cause.

Now process.exit(1) is called immediately when a test fails, giving true
fail-fast behavior.
The collector script takes 8+ min navigating every page for coverage.
E2E tests already extract window.__coverage__ to .nyc_output/e2e-coverage.json.
This cuts pipeline from ~11 min to ~2-3 min.
@efiten efiten force-pushed the master branch 4 times, most recently from 41bf3b9 to f0fc940 Compare April 1, 2026 21:20
efiten pushed a commit that referenced this pull request Apr 2, 2026
…conflicts (Kpa-clawbot#481)

## Problem

Every PR that touches `public/` files requires manually bumping cache
buster timestamps in `index.html` (e.g. `?v=1775111407`). Since all PRs
change the same lines in the same file, this causes **constant merge
conflicts** — it's been the #1 source of unnecessary PR friction.

## Solution

Replace all hardcoded `?v=TIMESTAMP` values in `index.html` with a
`?v=__BUST__` placeholder. The Go server replaces `__BUST__` with the
current Unix timestamp **once at startup** when it reads `index.html`,
then serves the pre-processed HTML from memory.

Every server restart automatically picks up fresh cache busters — no
manual intervention needed.

## What changed

| File | Change |
|------|--------|
| `public/index.html` | All `v=1775111407` → `v=__BUST__` (28
occurrences) |
| `cmd/server/main.go` | `spaHandler` reads index.html at init, replaces
`__BUST__` with Unix timestamp, serves from memory for `/`,
`/index.html`, and SPA fallback |
| `cmd/server/helpers_test.go` | New `TestSpaHandlerCacheBust` —
verifies placeholder replacement works for root, SPA fallback, and
direct `/index.html` requests. Also added tests for root `/` and
`/index.html` routes |
| `AGENTS.md` | Rule 3 updated: cache busters are now automatic, agents
should not manually edit them |

## Testing

- `go build ./...` — compiles cleanly
- `go test ./...` — all tests pass (including new cache-bust tests)
- `node test-frontend-helpers.js && node test-packet-filter.js && node
test-aging.js` — all frontend tests pass
- No hardcoded timestamps remain in `index.html`

---------

Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: you <you@example.com>
Kpa-clawbot pushed a commit that referenced this pull request Apr 5, 2026
…alysis

Weak passphrases with no KDF stretching are the #1 practical threat.
Timestamp in plaintext block 0 serves as known-plaintext oracle for
instant key verification from a single captured packet.

Key findings:
- decode_base64() output used directly as AES key, no KDF
- Short passphrases produce <16 byte keys (reduced key space)
- No salt means global precomputed attacks work
- 3-word passphrase crackable in ~2 min on commodity GPU

Reviewed by djb and Dijkstra personas. Corrections applied:
- GPU throughput upgraded from 10^9 to 10^10 AES/sec baseline
- Oracle strengthened: bytes 4+ (type byte, sender name) also predictable
- Dictionary size assumptions made explicit
- Zipf's law caveat added (humans don't choose uniformly)
- base64 short-passphrase key truncation issue documented
@efiten efiten force-pushed the master branch 2 times, most recently from 493d9e8 to 568de4b Compare May 1, 2026 22:16
efiten pushed a commit that referenced this pull request May 17, 2026
…event silent reconnect-loop death (Kpa-clawbot#1216)

RED commit: `1cd25f7b` — CI (failing on assertion):
https://github.com/Kpa-clawbot/CoreScope/actions?query=sha%3A1cd25f7b1bdd0091f689dd64ce1bfec6d031191f

Fixes Kpa-clawbot#1212

## Root cause

NOT that `AutoReconnect` was off — it was set;
`MaxReconnectInterval=30s` was set (PR Kpa-clawbot#949); a `SetReconnectingHandler`
was wired. The defect was an **observability gap**:

`SetReconnectingHandler` fires only INSIDE paho's reconnect goroutine.
If that goroutine never iterates (status race after the recovered
handler panic at 21:07:13, or an internal abort), operators see ONLY the
`disconnected: pingresp not received` line and then total silence. They
cannot distinguish "paho is patiently retrying" from "paho gave up and
the goroutine is gone." That ambiguity is what turned a 30s blip into 6h
of downtime.

## Changes

### `cmd/ingestor/main.go` — `SetConnectionAttemptHandler`
Fires on every TCP/TLS dial — the initial `Connect()` AND every
reconnect — independent of paho's internal reconnect-loop state. Logs:

```
MQTT [staging] connection attempt #1 to tcp://broker:1883
MQTT [staging] connection attempt #2 to tcp://broker:1883
```

Per-source attempt counter via `atomic.AddInt64`.

### `cmd/ingestor/mqtt_watchdog.go` (new) — per-source stall watchdog
Satisfies the watchdog acceptance criterion. Even when paho reports
`connected`, if no MQTT messages have flowed for >5m, log a WARN line
every 60s:

```
MQTT [staging] WATCHDOG: client reports connected to tcp://broker:1883 but no messages received for 7m30s (threshold 5m) — possible half-open socket or upstream stall
```

Catches half-open TCP and broker-accepted-but-not-forwarding scenarios
that look "connected" to paho.

Hot-path cost: one `atomic.StoreInt64` per inbound message. Watchdog
scans the registry once a minute.

### Tests (`cmd/ingestor/mqtt_reconnect_test.go`, new)
- `TestBuildMQTTOpts_InstrumentsConnectionAttempt` — asserts
`OnConnectAttempt` is wired in `buildMQTTOpts`.
- `TestMQTTStallWatchdog_FiresOnSilentSource` — connected + 10m silent +
5m threshold → stall flagged.
- `TestMQTTStallWatchdog_QuietWhenRecent` — recent message → no stall.
- `TestMQTTStallWatchdog_QuietWhenDisconnected` — disconnected → no
stall (paho's reconnect logging covers it).

## TDD
- RED `1cd25f7b` — 2 assertion failures (compile OK, stub returns
no-stall, `OnConnectAttempt` nil).
- GREEN `2527be6f` — implementation; all ingestor tests pass.

## Out of scope
- Slice-bounds decode panic (Kpa-clawbot#1211, separate PR).
- A full in-process MQTT broker integration test would require a new dep
(mochi-mqtt) — the observability and watchdog behaviors are
independently verifiable by the unit tests above, and the reconnect path
itself is paho's responsibility (we already test it's configured via
`mqtt_opts_test.go`).

---------

Co-authored-by: bot <bot@example.com>
Co-authored-by: OpenClaw Bot <bot@openclaw.local>
Co-authored-by: corescope-bot <bot@corescope.local>
Co-authored-by: openclaw-bot <openclaw-bot@users.noreply.github.com>
efiten pushed a commit that referenced this pull request May 17, 2026
… <500ms (Kpa-clawbot#1226)

## Summary
Fixes Kpa-clawbot#1225 — channel messages endpoint took ~30s on staging.

## Root cause
`(*DB).GetChannelMessages` SELECTed every observation row for the
channel (one row per observation, not per transmission),
JSON-unmarshalled each row into a Go map, dedupe-folded by `(sender,
packetHash)`, then sliced the tail in Go for pagination.

On staging `#wardriving`:
- `transmissions` rows with `channel_hash='#wardriving' AND
payload_type=5`: **5,703**
- `observations` joined to those: **274,632** (~48× amplification)
- `time curl /api/channels/%23wardriving/messages?limit=50`: **30.04s /
31.41s / 31.48s / 35.33s / 34.05s** (5 calls before I killed the loop)

`EXPLAIN QUERY PLAN` showed the index `idx_tx_channel_hash` was being
used — the cost was entirely in fetching, unmarshalling, and folding the
full observation set per request even for `limit=50`.

Hypothesis #1 from the issue (full table scan on `messages/decoded`) is
rejected; #2 (missing index) is rejected; the actual cause was
**pagination in Go instead of SQL** — request cost was O(observations)
not O(limit).

## Fix
Move pagination into SQL on the `transmissions` table. Because
`transmissions.hash` is `UNIQUE` and the original dedup key was
`(sender, hash)`, each transmission collapses to exactly one logical
message — paginating on transmissions is semantically equivalent to the
prior in-Go dedup + tail slice.

New shape:
1. `COUNT(*)` on transmissions for total (uses `idx_tx_channel_hash`).
2. `SELECT id FROM transmissions … ORDER BY first_seen DESC LIMIT ?
OFFSET ?` to pick the page of newest transmissions.
3. `SELECT … FROM observations WHERE transmission_id IN (…page ids…)` —
typically 50 ids → a few hundred observation rows.
4. Reassemble in pageIDs order, preserving the ASC-by-`first_seen` API
contract.

Region filtering, observation-count-as-`repeats`, and "first observation
wins for hops/snr/observer" semantics are preserved (observations are
scanned `ORDER BY o.id ASC`).

## Perf measurements
**Before** (staging `#wardriving`, limit=50, 5 samples killed mid-loop):
30.04s, 31.41s, 31.48s, 35.33s, 34.05s.
**Synthetic regression test**
(`TestGetChannelMessagesPerfLargeChannel`): 3000 tx × 50 obs.
- Broken impl: ~4.5s (test fails the 500ms budget — the RED commit).
- Fixed impl: well under 500ms (test passes).
**After (staging)**: will measure post-deploy and post-comment on issue
with numbers. Synthetic scaling: staging is ~2× the test's transmission
count, fixed-path cost scales with `limit` (50) + `COUNT(*)` (~5k rows
on index) — expect <100ms p99.

## TDD
- RED: `697c290d` — perf test asserts <500ms on 3k×50 dataset; fails at
~4.5s.
- GREEN: `3f1f82d3` — fix; full suite green, perf test passes.

## Hypotheses status
| # | Hypothesis | Verdict |
|---|---|---|
| 1 | Endpoint slow on prod-sized data | **CONFIRMED** (different
mechanism — see root cause) |
| 2 | Missing channel_hash index | Rejected (`idx_tx_channel_hash`
exists & used) |
| 3 | Frontend re-render storm | Not investigated (backend was clearly
the bottleneck) |
| 4 | Decode in request path | Rejected (decode is at ingest time; JSON
unmarshal of cached `decoded_json` is the cost, addressed by reducing
row count) |
| 5 | WS subscription failure | Rejected |
| 6 | Staging artifact | Rejected (reproducible) |

## Out of scope
- The in-memory `(*PacketStore).GetChannelMessages` path (used when
`s.db == nil`) has the same shape but operates on bounded in-memory
data; not touched. If we ever fall back to it in production we'll
revisit.

---------

Co-authored-by: clawbot <bot@corescope>
efiten pushed a commit that referenced this pull request May 19, 2026
…pa-clawbot#1279 P0+P1) (Kpa-clawbot#1280)

Addresses the four P0+P1 firmware reconciliation gaps from the umbrella
audit (issue Kpa-clawbot#1279). RED commit: `0a4c084e` (asserts on stub returns;
all 13 assertions fail). GREEN commit: `13867681`.

## What's in this PR

### P0 — silently dropped data

- **#1 GRP_DATA (0x06) decoder.** Outer envelope is the same shape as
GRP_TXT (`channel_hash(1)+MAC(2)+ciphertext`) per
`firmware/src/helpers/BaseChatMesh.cpp:476,500`. Factored
`decryptChannelBlock(...)` helper used by both 5 and 6. When a channel
key matches, the inner is parsed per
`firmware/src/helpers/BaseChatMesh.cpp:382-385` as `data_type(uint16 LE)
+ data_len(1) + blob(data_len)`. Surfaces `{channelHash, MAC, dataType,
dataLen, decryptedBlob}` on decrypt or `{channelHash, MAC,
encryptedData}` otherwise. Server-side decoder surfaces envelope only
(no key store).
- **#2 MULTIPART (0x0A) decoder.** Per `firmware/src/Mesh.cpp:289`,
byte0 = `(remaining<<4) | inner_type`. When `inner_type ==
PAYLOAD_TYPE_ACK (0x03)`, next 4 bytes are the LE ack_crc per
`firmware/src/Mesh.cpp:292-307`. Surfaces `{remaining, innerType,
innerTypeName, innerAckCrc | innerPayload}`.

### P1 — mis-classified / opaque

- **#3 `advertRole()` raw-type fix.** Per
`firmware/src/helpers/AdvertDataHelpers.h:7-12`, ADV_TYPE_NONE = 0 and
5-15 are FUTURE. The previous boolean fallback collapsed both into
`"companion"`, silently relabelling unknown/reserved types. New
behaviour: type 0 → `none`, 1 → `companion`, 2-4 →
`repeater`/`room`/`sensor`, 5-15 → `type-N`. `ValidateAdvert` accepts
the new labels.
- **#4 CONTROL (0x0B) byte0 flags + length.** Per
`firmware/src/Mesh.cpp:69` + `createControlData` at `Mesh.cpp:609`,
byte0 high-bit marks the zero-hop direct subset. Surfaces `{ctrlFlags,
ctrlZeroHop, ctrlLength}`.

### Drift fix

- `cmd/server/store.go` `payloadTypeNames` now includes `6: GRP_DATA`
and `10: MULTIPART` (previously omitted; canonical decoder map already
had them).

## Lockstep & TDD

Both `cmd/ingestor/decoder.go` and `cmd/server/decoder.go` updated in
the same commits — same wire-vector tests live in both packages
(`cmd/{ingestor,server}/issue1279_test.go`). Per-item RED→GREEN visible
in `git log`.

| Item | Tests | RED proof |
|---|---|---|
| #1 GRP_DATA | ingestor: NoKey + DecryptedInner; server: Envelope | 6
assertions failed pre-impl |
| #2 MULTIPART | ingestor + server: Ack + NonAck | 8 assertions failed
pre-impl |
| #3 advertRole | ingestor + server: 7-row table | 3 assertions failed
pre-impl |
| #4 CONTROL | ingestor + server: ZeroHop + MultiHop | 6 assertions
failed pre-impl |

## What's NOT in this PR

The umbrella issue lists P2 items that ship in follow-up PRs:

- Live + compare legend entries for the long tail of newly-named types
(Kpa-clawbot#1274 + others).
- TransportCodes UI surface + filter grammar.
- feat1/feat2 capability badges.
- `payloadTypeNames` consolidation across server/ingestor
(drift-prevention).

Leave the umbrella open after this merges.

Refs Kpa-clawbot#1279

---------

Co-authored-by: OpenClaw Bot <bot@openclaw.local>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant