From 89fab0bd4fb619662379eb7734f54b83717d94d6 Mon Sep 17 00:00:00 2001 From: jevansnyc Date: Wed, 1 Apr 2026 13:17:15 -0500 Subject: [PATCH 1/2] Add JS Asset Auditor engineering spec Engineering spec for the /audit-js-assets . Covers sweep protocol, Chrome DevTools MCP tooling, heuristic filtering, slug generation, init and diff modes. Closes #606 --- .../2026-04-01-js-asset-auditor-design.md | 216 ++++++++++++++++++ 1 file changed, 216 insertions(+) create mode 100644 docs/superpowers/specs/2026-04-01-js-asset-auditor-design.md diff --git a/docs/superpowers/specs/2026-04-01-js-asset-auditor-design.md b/docs/superpowers/specs/2026-04-01-js-asset-auditor-design.md new file mode 100644 index 00000000..d6168592 --- /dev/null +++ b/docs/superpowers/specs/2026-04-01-js-asset-auditor-design.md @@ -0,0 +1,216 @@ +# JS Asset Auditor — Engineering Spec + +**Date:** 2026-04-01 +**Status:** Approved for engineering breakdown +**Related:** [JS Asset Proxy spec](2026-04-01-js-asset-proxy-design.md) + +--- + +## Context + +The JS Asset Proxy requires a `js-assets.toml` file declaring which third-party JS assets to proxy. Without tooling, populating this file requires manually inspecting network requests in browser DevTools, extracting URLs, generating opaque slugs, and writing TOML — a tedious error-prone process that is a barrier to publisher onboarding. + +The Auditor eliminates this friction. It sweeps a publisher's page using the Chrome DevTools MCP, detects third-party JS assets, auto-generates `js-assets.toml` entries, and auto-detects `inject_in_head` from the page DOM. The operator's only remaining decision is reviewing the output before committing. + +It also runs as a monitoring tool — `--diff` mode compares a new sweep against the existing config and surfaces new or removed assets, giving publishers ongoing visibility into their third-party JS footprint. + +**Implementation:** Pure Claude Code skill — no Rust, no compiled code, no additional dependencies. Uses the Chrome DevTools MCP already configured in `.claude/settings.json`. + +--- + +## Command Interface + +```bash +/audit-js-assets https://www.publisher.com # init — generate js-assets.toml +/audit-js-assets https://www.publisher.com --diff # diff — compare against existing file +``` + +--- + +## Sweep Protocol + +1. Read `trusted-server.toml` → extract `publisher.domain` (defines first-party boundary) +2. Open Chrome via `mcp__chrome-devtools__new_page`, navigate to target URL via `mcp__chrome-devtools__navigate_page` +3. Wait for full page load + ~6s settle window for async script loads (`mcp__chrome-devtools__wait_for`) +4. In parallel: + - `mcp__chrome-devtools__list_network_requests` → filter for requests where URL ends in `.js` or `Content-Type: application/javascript`, and origin ≠ `publisher.domain` + - `mcp__chrome-devtools__evaluate_script` → `Array.from(document.head.querySelectorAll('script[src]')).map(s => s.src)` → collect head-loaded script URLs +5. Apply heuristic filter (see below) +6. For each surviving asset, generate a `[[js_assets]]` entry (see below) +7. Write output (init or diff mode) +8. Print terminal summary +9. Close page via `mcp__chrome-devtools__close_page` + +--- + +## Heuristic Filter + +The following origin categories are excluded silently. The terminal summary reports what was filtered and why so operators can manually add entries if needed. + +| Category | Excluded origins | +|---|---| +| Framework CDNs | `cdnjs.cloudflare.com`, `ajax.googleapis.com`, `cdn.jsdelivr.net`, `unpkg.com` | +| Error tracking | `sentry.io`, `bugsnag.com`, `rollbar.com` | +| Font services | `fonts.googleapis.com`, `fonts.gstatic.com` | +| Social embeds | `platform.twitter.com`, `connect.facebook.net` | + +**`googletagmanager.com` is not filtered** — GTM is ad tech and should be proxied. + +Everything else surfaces for operator review. + +--- + +## Asset Entry Generation + +| Field | Derivation | +|---|---| +| `slug` | `{publisher_prefix}:{asset_stem}` — see slug algorithm below | +| `path` | `/{publisher_prefix}/{asset_stem}.js`, or wildcard variant if versioned path detected | +| `origin_url` | Full captured URL, with wildcard substitution applied if versioned | +| `ttl_sec` | Omitted — proxy defaults to 1800 (wildcard) or 3600 (fixed) | +| `inject_in_head` | `true` if URL appeared in head script list from DOM evaluation, else `false` | + +### Slug algorithm + +``` +publisher_prefix = first_8_chars(base62(sha256(publisher.domain + origin_url))) +asset_stem = filename_without_extension(origin_url) +slug = "{publisher_prefix}:{asset_stem}" +``` + +**Rationale:** Fully opaque and hash-derived — no human naming required, no ambiguity for cryptic vendor filenames. The KV metadata (`origin_url`, `content_type`, `asset_slug`) serves as the lookup table. Operators can query `js-asset:{slug}` in the KV store to retrieve full provenance. The terminal summary also prints slug → origin_url at generation time. + +**Important:** This algorithm must produce identical output to the Proxy's KV key derivation. Engineering should implement this as a shared utility (e.g., a small JS/TS helper in the skill, or a standalone `scripts/` utility) rather than duplicating the logic. + +### Wildcard detection + +Path segments matching either pattern are replaced with `*`: +- Semver: `\d+\.\d+[\.\d-]*` (e.g., `1.19.8-hcskhn`) +- Hash-like: `[a-f0-9]{6,}` or `[A-Za-z0-9]{8,}` between path separators + +The original URL is preserved as a comment above the generated entry so operators can verify the wildcard substitution is correct. + +--- + +## Init Mode Output + +### `js-assets.toml` (written to repo root) + +```toml +# Generated by /audit-js-assets on 2026-04-01 +# Publisher: publisher.com +# Source URL: https://www.publisher.com + +[[js_assets]] +# https://web.prebidwrapper.com/golf-WnLmpLyEjL/default-v2/prebid-load.js +slug = "aB3kR7mN:prebid-load" +path = "/sdk/aB3kR7mN.js" +origin_url = "https://web.prebidwrapper.com/golf-WnLmpLyEjL/default-v2/prebid-load.js" +inject_in_head = true + +[[js_assets]] +# https://raven-static.vendor.io/prod/1.19.8-hcskhn/raven.js (wildcard detected) +slug = "xQ9pL2wY:raven" +path = "/raven-static/*" +origin_url = "https://raven-static.vendor.io/prod/*/raven.js" +inject_in_head = false +``` + +### Terminal summary + +``` +JS Asset Audit — publisher.com +──────────────────────────────── +Detected: 8 third-party JS requests +Filtered: 3 (cdnjs.cloudflare.com ×2, sentry.io ×1) +Surfaced: 5 assets → js-assets.toml + + aB3kR7mN inject_in_head=true web.prebidwrapper.com/.../prebid-load.js + xQ9pL2wY inject_in_head=false raven-static.vendor.io/prod/*/raven.js [wildcard] + zM4nK8vP inject_in_head=true googletagmanager.com/gtm.js + ... + +Review inject_in_head values and commit js-assets.toml when ready. +Diff mode: /audit-js-assets --diff +``` + +--- + +## Diff Mode Output + +Compares sweep results against the existing `js-assets.toml`. + +| Condition | Behavior | +|---|---| +| Asset in sweep, not in file | **New** — appended to `js-assets.toml` as a commented-out block | +| Asset in file, not in sweep | **Missing** — flagged in terminal summary with `⚠`. Never auto-removed. | +| Asset in both | **Confirmed** — listed as present | + +New entries are appended as TOML comments so the file stays valid and nothing is activated without the operator explicitly uncommenting. + +### `js-assets.toml` (new entry appended as comment) + +```toml +# --- NEW (detected by /audit-js-assets --diff on 2026-04-01, uncomment to activate) --- +# [[js_assets]] +# # https://googletagmanager.com/gtm.js +# slug = "zM4nK8vP:gtm" +# path = "/sdk/zM4nK8vP.js" +# origin_url = "https://googletagmanager.com/gtm.js" +# inject_in_head = true +``` + +### Terminal summary (diff mode) + +``` +JS Asset Audit (diff) — publisher.com +──────────────────────────────── +Confirmed: 4 assets still present on page +New: 1 asset detected (appended as comment to js-assets.toml) +Missing: 1 asset no longer seen on page ⚠ + + NEW zM4nK8vP googletagmanager.com/gtm.js → review in js-assets.toml + MISSING xQ9pL2wY raven-static.vendor.io/... → may have been removed or renamed +``` + +--- + +## Implementation + +The Auditor is a Claude Code skill file. No compiled code. + +**Skill location:** `.claude/skills/audit-js-assets.md` + +**MCP tools used:** +- `mcp__chrome-devtools__new_page` — open browser tab +- `mcp__chrome-devtools__navigate_page` — load publisher URL +- `mcp__chrome-devtools__wait_for` — settle after page load +- `mcp__chrome-devtools__list_network_requests` — capture JS requests +- `mcp__chrome-devtools__evaluate_script` — detect head-loaded scripts via DOM query +- `mcp__chrome-devtools__close_page` — clean up tab + +**File tools used:** +- `Read` — read `trusted-server.toml` (publisher domain) and existing `js-assets.toml` (diff mode) +- `Write` — write generated/updated `js-assets.toml` + +--- + +## Delivery Order + +The Auditor should be delivered **after Proxy Phase 1** (so `js-assets.toml` schema is defined) and **before Proxy Phase 2** (so engineering has real populated entries to test the cache pipeline against actual vendor origins). + +See [delivery order in the Proxy spec](2026-04-01-js-asset-proxy-design.md). + +--- + +## Verification + +- Run `/audit-js-assets https://www.publisher.com` against a known test publisher page with identified third-party JS +- Verify generated entries match actual third-party JS observed on the page (cross-check in browser DevTools) +- Verify `inject_in_head = true` only for scripts that appear in `` (not ``) +- Verify wildcard detection fires for versioned path segments and not for stable paths +- Verify GTM (`googletagmanager.com`) is captured and not filtered +- Verify framework CDNs (`cdnjs.cloudflare.com` etc.) are filtered with reason in summary +- Run `--diff` against an unchanged page → all entries confirmed, no new/missing +- Run `--diff` after adding a new vendor script to the page → appears as `NEW` in summary +- Run `--diff` after removing a script → appears as `MISSING ⚠` in summary, file unchanged From d8a0d84c914261ecd3d6ffd1d4c95369b4a2de86 Mon Sep 17 00:00:00 2001 From: Christian Date: Fri, 10 Apr 2026 15:46:42 -0500 Subject: [PATCH 2/2] Address PR feedback on JS Asset Auditor spec Fix incorrect MCP tool name prefix, replace misused wait_for with evaluate_script setTimeout, correct list_network_requests filtering to use resourceTypes, resolve path derivation contradiction with consistent /js-assets/{prefix}/{stem}.js formula, pin slug separator and base62 charset, add URL Processing section with normalization rules and first-party boundary definition, tighten wildcard regex to require mixed character classes, and move skill location to .claude/commands/. --- .../2026-04-01-js-asset-auditor-design.md | 113 ++++++++++++------ 1 file changed, 74 insertions(+), 39 deletions(-) diff --git a/docs/superpowers/specs/2026-04-01-js-asset-auditor-design.md b/docs/superpowers/specs/2026-04-01-js-asset-auditor-design.md index d6168592..aae3db57 100644 --- a/docs/superpowers/specs/2026-04-01-js-asset-auditor-design.md +++ b/docs/superpowers/specs/2026-04-01-js-asset-auditor-design.md @@ -2,7 +2,7 @@ **Date:** 2026-04-01 **Status:** Approved for engineering breakdown -**Related:** [JS Asset Proxy spec](2026-04-01-js-asset-proxy-design.md) +**Related:** [JS Asset Proxy spec](2026-04-01-js-asset-proxy-design.md) _(on `js-asset-proxy-spec` branch until merged)_ --- @@ -21,8 +21,9 @@ It also runs as a monitoring tool — `--diff` mode compares a new sweep against ## Command Interface ```bash -/audit-js-assets https://www.publisher.com # init — generate js-assets.toml -/audit-js-assets https://www.publisher.com --diff # diff — compare against existing file +/audit-js-assets https://www.publisher.com # init — generate js-assets.toml +/audit-js-assets https://www.publisher.com --diff # diff — compare against existing file +/audit-js-assets https://www.publisher.com --settle 15000 # longer settle for ad-tech-heavy pages ``` --- @@ -30,16 +31,38 @@ It also runs as a monitoring tool — `--diff` mode compares a new sweep against ## Sweep Protocol 1. Read `trusted-server.toml` → extract `publisher.domain` (defines first-party boundary) -2. Open Chrome via `mcp__chrome-devtools__new_page`, navigate to target URL via `mcp__chrome-devtools__navigate_page` -3. Wait for full page load + ~6s settle window for async script loads (`mcp__chrome-devtools__wait_for`) +2. Open Chrome via `mcp__plugin_chrome-devtools-mcp_chrome-devtools__new_page`, navigate to target URL via `mcp__plugin_chrome-devtools-mcp_chrome-devtools__navigate_page` +3. Wait for page load settle: `mcp__plugin_chrome-devtools-mcp_chrome-devtools__evaluate_script` with `await new Promise(r => setTimeout(r, SETTLE_MS))` where `SETTLE_MS` defaults to 6000 (configurable via `--settle `) 4. In parallel: - - `mcp__chrome-devtools__list_network_requests` → filter for requests where URL ends in `.js` or `Content-Type: application/javascript`, and origin ≠ `publisher.domain` - - `mcp__chrome-devtools__evaluate_script` → `Array.from(document.head.querySelectorAll('script[src]')).map(s => s.src)` → collect head-loaded script URLs -5. Apply heuristic filter (see below) + - `mcp__plugin_chrome-devtools-mcp_chrome-devtools__list_network_requests` with `resourceTypes: ["script"]` → post-filter to exclude first-party hosts (see URL Processing below) + - `mcp__plugin_chrome-devtools-mcp_chrome-devtools__evaluate_script` → `Array.from(document.head.querySelectorAll('script[src]')).map(s => s.src)` → collect head-loaded script URLs +5. Apply URL normalization (see below), then heuristic filter (see below) 6. For each surviving asset, generate a `[[js_assets]]` entry (see below) 7. Write output (init or diff mode) 8. Print terminal summary -9. Close page via `mcp__chrome-devtools__close_page` +9. Close page via `mcp__plugin_chrome-devtools-mcp_chrome-devtools__close_page` + +**`inject_in_head` semantics:** The DOM snapshot in step 4 captures the final state of `` after the settle window. Scripts that were briefly inserted and then removed by a loader will not appear. This is intentional — `inject_in_head = true` means "the script is present in `` at page-stable state." If a loader removes it before the snapshot, the proxy should not re-inject it. + +--- + +## URL Processing + +### First-party boundary + +A network request is **first-party** if the request URL's host, after stripping a leading `www.`, matches `publisher.domain` (from `trusted-server.toml`) after the same stripping. Matching is exact on the resulting strings. + +Publisher-owned CDN subdomains (e.g., `cdn.publisher.com`, `static.publisher.com`) are treated as third-party by default. If the publisher wants to exclude them, they can be added to a `first_party_hosts` list in the command invocation (e.g., `--first-party cdn.publisher.com`). + +### URL normalization + +Applied to every captured script URL before slug generation and before persisting `origin_url`: + +1. Strip fragment (`#...`) +2. Strip all query parameters — cache-busters (`?v=123`, `?cb=timestamp`), consent params, and session tokens all live in query strings. JS asset versioning uses path segments, not query params. +3. Strip trailing slash from the path + +The normalized URL is what gets stored in `origin_url` and fed into the slug hash. --- @@ -47,12 +70,14 @@ It also runs as a monitoring tool — `--diff` mode compares a new sweep against The following origin categories are excluded silently. The terminal summary reports what was filtered and why so operators can manually add entries if needed. -| Category | Excluded origins | -|---|---| +**Matching:** Filter entries match if the request URL's host ends with the filter entry, with a dot-boundary check. For example, `googletagmanager.com` in the filter matches `www.googletagmanager.com` but not `evil-googletagmanager.com`. + +| Category | Excluded origins | +| -------------- | ------------------------------------------------------------------------------ | | Framework CDNs | `cdnjs.cloudflare.com`, `ajax.googleapis.com`, `cdn.jsdelivr.net`, `unpkg.com` | -| Error tracking | `sentry.io`, `bugsnag.com`, `rollbar.com` | -| Font services | `fonts.googleapis.com`, `fonts.gstatic.com` | -| Social embeds | `platform.twitter.com`, `connect.facebook.net` | +| Error tracking | `sentry.io`, `bugsnag.com`, `rollbar.com` | +| Font services | `fonts.googleapis.com`, `fonts.gstatic.com` | +| Social embeds | `platform.twitter.com`, `platform.x.com`, `connect.facebook.net` | **`googletagmanager.com` is not filtered** — GTM is ad tech and should be proxied. @@ -62,31 +87,38 @@ Everything else surfaces for operator review. ## Asset Entry Generation -| Field | Derivation | -|---|---| -| `slug` | `{publisher_prefix}:{asset_stem}` — see slug algorithm below | -| `path` | `/{publisher_prefix}/{asset_stem}.js`, or wildcard variant if versioned path detected | -| `origin_url` | Full captured URL, with wildcard substitution applied if versioned | -| `ttl_sec` | Omitted — proxy defaults to 1800 (wildcard) or 3600 (fixed) | -| `inject_in_head` | `true` if URL appeared in head script list from DOM evaluation, else `false` | +| Field | Derivation | +| ---------------- | --------------------------------------------------------------------------------------------------- | +| `slug` | `{publisher_prefix}:{asset_stem}` — see slug algorithm below | +| `path` | Fixed: `/js-assets/{publisher_prefix}/{asset_stem}.js`. Wildcard: `/js-assets/{publisher_prefix}/*` | +| `origin_url` | Normalized URL (see URL Processing), with wildcard substitution applied if versioned | +| `ttl_sec` | Omitted — proxy defaults to 1800 (wildcard) or 3600 (fixed) | +| `stale_ttl_sec` | Omitted — proxy defaults to 86400 (24h) | +| `inject_in_head` | `true` if URL appeared in head script list from DOM evaluation, else `false` | ### Slug algorithm ``` -publisher_prefix = first_8_chars(base62(sha256(publisher.domain + origin_url))) +publisher_prefix = first_8_chars(base62(sha256(publisher.domain + "|" + origin_url))) asset_stem = filename_without_extension(origin_url) slug = "{publisher_prefix}:{asset_stem}" ``` +The pipe (`|`) separator is required — it cannot appear in domain names or at the start of a URL, so the hash input is unambiguous. The `origin_url` fed into the hash must be the normalized URL (see URL Processing). + +**base62 charset:** `0-9A-Za-z` (digits first, then uppercase, then lowercase). This matches the `base62` crate convention. + **Rationale:** Fully opaque and hash-derived — no human naming required, no ambiguity for cryptic vendor filenames. The KV metadata (`origin_url`, `content_type`, `asset_slug`) serves as the lookup table. Operators can query `js-asset:{slug}` in the KV store to retrieve full provenance. The terminal summary also prints slug → origin_url at generation time. **Important:** This algorithm must produce identical output to the Proxy's KV key derivation. Engineering should implement this as a shared utility (e.g., a small JS/TS helper in the skill, or a standalone `scripts/` utility) rather than duplicating the logic. ### Wildcard detection -Path segments matching either pattern are replaced with `*`: +Path segments matching any of these patterns are replaced with `*`: + - Semver: `\d+\.\d+[\.\d-]*` (e.g., `1.19.8-hcskhn`) -- Hash-like: `[a-f0-9]{6,}` or `[A-Za-z0-9]{8,}` between path separators +- Hex hash: `[a-f0-9]{8,}` between path separators (lowercase hex, minimum 8 characters) +- Mixed alphanumeric hash: `[A-Za-z0-9]{8,}` between path separators, **must contain at least one digit and at least one letter** — this excludes pure-alpha dictionary words like `analytics` or `bootstrap` The original URL is preserved as a comment above the generated entry so operators can verify the wildcard substitution is correct. @@ -104,14 +136,14 @@ The original URL is preserved as a comment above the generated entry so operator [[js_assets]] # https://web.prebidwrapper.com/golf-WnLmpLyEjL/default-v2/prebid-load.js slug = "aB3kR7mN:prebid-load" -path = "/sdk/aB3kR7mN.js" +path = "/js-assets/aB3kR7mN/prebid-load.js" origin_url = "https://web.prebidwrapper.com/golf-WnLmpLyEjL/default-v2/prebid-load.js" inject_in_head = true [[js_assets]] # https://raven-static.vendor.io/prod/1.19.8-hcskhn/raven.js (wildcard detected) slug = "xQ9pL2wY:raven" -path = "/raven-static/*" +path = "/js-assets/xQ9pL2wY/*" origin_url = "https://raven-static.vendor.io/prod/*/raven.js" inject_in_head = false ``` @@ -140,11 +172,11 @@ Diff mode: /audit-js-assets --diff Compares sweep results against the existing `js-assets.toml`. -| Condition | Behavior | -|---|---| -| Asset in sweep, not in file | **New** — appended to `js-assets.toml` as a commented-out block | +| Condition | Behavior | +| --------------------------- | ----------------------------------------------------------------------- | +| Asset in sweep, not in file | **New** — appended to `js-assets.toml` as a commented-out block | | Asset in file, not in sweep | **Missing** — flagged in terminal summary with `⚠`. Never auto-removed. | -| Asset in both | **Confirmed** — listed as present | +| Asset in both | **Confirmed** — listed as present | New entries are appended as TOML comments so the file stays valid and nothing is activated without the operator explicitly uncommenting. @@ -155,7 +187,7 @@ New entries are appended as TOML comments so the file stays valid and nothing is # [[js_assets]] # # https://googletagmanager.com/gtm.js # slug = "zM4nK8vP:gtm" -# path = "/sdk/zM4nK8vP.js" +# path = "/js-assets/zM4nK8vP/gtm.js" # origin_url = "https://googletagmanager.com/gtm.js" # inject_in_head = true ``` @@ -179,17 +211,20 @@ Missing: 1 asset no longer seen on page ⚠ The Auditor is a Claude Code skill file. No compiled code. -**Skill location:** `.claude/skills/audit-js-assets.md` +**Skill location:** `.claude/commands/audit-js-assets.md` **MCP tools used:** -- `mcp__chrome-devtools__new_page` — open browser tab -- `mcp__chrome-devtools__navigate_page` — load publisher URL -- `mcp__chrome-devtools__wait_for` — settle after page load -- `mcp__chrome-devtools__list_network_requests` — capture JS requests -- `mcp__chrome-devtools__evaluate_script` — detect head-loaded scripts via DOM query -- `mcp__chrome-devtools__close_page` — clean up tab + +- `mcp__plugin_chrome-devtools-mcp_chrome-devtools__new_page` — open browser tab +- `mcp__plugin_chrome-devtools-mcp_chrome-devtools__navigate_page` — load publisher URL +- `mcp__plugin_chrome-devtools-mcp_chrome-devtools__list_network_requests` — capture JS requests +- `mcp__plugin_chrome-devtools-mcp_chrome-devtools__evaluate_script` — settle window + detect head-loaded scripts via DOM query +- `mcp__plugin_chrome-devtools-mcp_chrome-devtools__close_page` — clean up tab + +**Permission grants required:** `navigate_page`, `list_network_requests`, and `close_page` are not currently approved in `.claude/settings.json`. Add them to `permissions.allow` before running the skill, or expect interactive permission prompts on first run. **File tools used:** + - `Read` — read `trusted-server.toml` (publisher domain) and existing `js-assets.toml` (diff mode) - `Write` — write generated/updated `js-assets.toml` @@ -199,7 +234,7 @@ The Auditor is a Claude Code skill file. No compiled code. The Auditor should be delivered **after Proxy Phase 1** (so `js-assets.toml` schema is defined) and **before Proxy Phase 2** (so engineering has real populated entries to test the cache pipeline against actual vendor origins). -See [delivery order in the Proxy spec](2026-04-01-js-asset-proxy-design.md). +See [delivery order in the Proxy spec](2026-04-01-js-asset-proxy-design.md) _(on `js-asset-proxy-spec` branch until merged)_. ---