Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
216 changes: 216 additions & 0 deletions docs/superpowers/specs/2026-04-01-js-asset-auditor-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,216 @@
# JS Asset Auditor — Engineering Spec

**Date:** 2026-04-01
**Status:** Approved for engineering breakdown
**Related:** [JS Asset Proxy spec](2026-04-01-js-asset-proxy-design.md)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick — Relative link to Proxy spec may be broken at merge time

[JS Asset Proxy spec](2026-04-01-js-asset-proxy-design.md) assumes the Proxy spec exists in the same directory. If this PR merges first, the link is dead. Same issue on line 207.

---

## Context

The JS Asset Proxy requires a `js-assets.toml` file declaring which third-party JS assets to proxy. Without tooling, populating this file requires manually inspecting network requests in browser DevTools, extracting URLs, generating opaque slugs, and writing TOML — a tedious error-prone process that is a barrier to publisher onboarding.

The Auditor eliminates this friction. It sweeps a publisher's page using the Chrome DevTools MCP, detects third-party JS assets, auto-generates `js-assets.toml` entries, and auto-detects `inject_in_head` from the page DOM. The operator's only remaining decision is reviewing the output before committing.

It also runs as a monitoring tool — `--diff` mode compares a new sweep against the existing config and surfaces new or removed assets, giving publishers ongoing visibility into their third-party JS footprint.

**Implementation:** Pure Claude Code skill — no Rust, no compiled code, no additional dependencies. Uses the Chrome DevTools MCP already configured in `.claude/settings.json`.

---

## Command Interface

```bash
/audit-js-assets https://www.publisher.com # init — generate js-assets.toml
/audit-js-assets https://www.publisher.com --diff # diff — compare against existing file
```

---

## Sweep Protocol

1. Read `trusted-server.toml` → extract `publisher.domain` (defines first-party boundary)
2. Open Chrome via `mcp__chrome-devtools__new_page`, navigate to target URL via `mcp__chrome-devtools__navigate_page`
3. Wait for full page load + ~6s settle window for async script loads (`mcp__chrome-devtools__wait_for`)
4. In parallel:
- `mcp__chrome-devtools__list_network_requests` → filter for requests where URL ends in `.js` or `Content-Type: application/javascript`, and origin ≠ `publisher.domain`
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question — Third-party boundary and filter matching semantics need to be explicit

origin ≠ publisher.domain leaves a few important cases open: www. variants, publisher-owned subdomains/CDNs, and how the heuristic filter list should match real hosts like www.googletagmanager.com instead of just apex domains.

Could we specify whether host matching here is exact-host, eTLD+1, or suffix-with-dot-boundary matching? That will directly affect what gets surfaced vs filtered.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔧 wrench — MCP tool names are wrong throughout the spec

Every mcp__chrome-devtools__* reference uses an incorrect prefix. The actual tool names in .claude/settings.json and .claude/settings.local.json are:

mcp__plugin_chrome-devtools-mcp_chrome-devtools__new_page
mcp__plugin_chrome-devtools-mcp_chrome-devtools__navigate_page
mcp__plugin_chrome-devtools-mcp_chrome-devtools__list_network_requests
...etc

Affects 12 occurrences across the Sweep Protocol (lines 36-43) and Implementation (lines 188-193) sections. A skill implemented from this spec will fail on every MCP call.

Fix: Replace all mcp__chrome-devtools__ with mcp__plugin_chrome-devtools-mcp_chrome-devtools__.

- `mcp__chrome-devtools__evaluate_script` → `Array.from(document.head.querySelectorAll('script[src]')).map(s => s.src)` → collect head-loaded script URLs
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔧 wrenchwait_for does not provide a settle window

The spec says: "Wait for full page load + ~6s settle window for async script loads (wait_for)"

But wait_for waits for specific text to appear on the page (params: text — list of strings, timeout — max wait ms). It is not a network-idle or timeout mechanism.

Fix: Use evaluate_script with:

await new Promise(r => setTimeout(r, 6000))

Or poll list_network_requests to detect when no new script requests appear.


🤔 thinking — 6s may be too short for complex ad tech pages

Header bidding waterfalls, consent-gated loading, and lazy demand partners can take 10-15s. Consider making this configurable (e.g., --settle 10s) or documenting the tradeoff.

5. Apply heuristic filter (see below)
6. For each surviving asset, generate a `[[js_assets]]` entry (see below)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔧 wrenchlist_network_requests filtering claim is inaccurate

The spec says: "filter for requests where URL ends in .js or Content-Type: application/javascript"

The tool supports resourceTypes filtering (e.g., ["script"]) but does not support URL pattern or Content-Type filtering. Those must be done in post-processing.

Fix: Rewrite to: "list_network_requests with resourceTypes: ["script"] → post-filter to exclude first-party origins matching publisher.domain"

7. Write output (init or diff mode)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 praiseinject_in_head detection via DOM query is sound

document.head.querySelectorAll('script[src]') correctly captures what's in <head> at load time, which is exactly what the Proxy needs to replicate at runtime.

8. Print terminal summary
9. Close page via `mcp__chrome-devtools__close_page`

---

## Heuristic Filter

The following origin categories are excluded silently. The terminal summary reports what was filtered and why so operators can manually add entries if needed.

| Category | Excluded origins |
|---|---|
| Framework CDNs | `cdnjs.cloudflare.com`, `ajax.googleapis.com`, `cdn.jsdelivr.net`, `unpkg.com` |
| Error tracking | `sentry.io`, `bugsnag.com`, `rollbar.com` |
| Font services | `fonts.googleapis.com`, `fonts.gstatic.com` |
| Social embeds | `platform.twitter.com`, `connect.facebook.net` |

**`googletagmanager.com` is not filtered** — GTM is ad tech and should be proxied.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 thinkingplatform.twitter.com may be outdated

Twitter rebranded to X. The domain may already be platform.x.com or will change. Consider listing both, or verifying the current domain.

Everything else surfaces for operator review.

---

## Asset Entry Generation

| Field | Derivation |
|---|---|
| `slug` | `{publisher_prefix}:{asset_stem}` — see slug algorithm below |
| `path` | `/{publisher_prefix}/{asset_stem}.js`, or wildcard variant if versioned path detected |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question — The path derivation formula conflicts with the examples below

The table says /{publisher_prefix}/{asset_stem}.js, but the examples later use /sdk/aB3kR7mN.js and /raven-static/*. That leaves the implementation having to infer the canonical route shape.

Can we pick one formula and update the examples so they match it exactly?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔧 wrench — Path derivation algorithm contradicts the examples

This table says path = "/{publisher_prefix}/{asset_stem}.js". For the first example (publisher_prefix = aB3kR7mN, asset_stem = prebid-load), this yields /aB3kR7mN/prebid-load.js.

But the example on line 119 shows path = "/sdk/aB3kR7mN.js" — a completely different structure. The Proxy spec shows the same contradictory examples.

Since path is the routing key between Auditor and Proxy, this must be resolved.

Fix: Align the algorithm description with the examples (or vice versa).

| `origin_url` | Full captured URL, with wildcard substitution applied if versioned |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

questionorigin_url needs an explicit normalization policy

This reads as though we would persist the captured URL more or less verbatim. That seems likely to commit cache-busters, per-session tokens, consent params, or other transient query values into js-assets.toml, which would create churn and could leak noisy values into config.

Could we define the normalization rules here: whether fragments are dropped, which query params are preserved or stripped, and whether wildcarding happens before or after query normalization?

| `ttl_sec` | Omitted — proxy defaults to 1800 (wildcard) or 3600 (fixed) |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 thinkingstale_ttl_sec is not mentioned

The Proxy spec defines stale_ttl_sec: Option<u32> (default 86400) in the JsAsset struct. This field is omitted from the Auditor's Asset Entry Generation table and generated TOML examples. While the proxy defaults handle the omission, it would be consistent to mention it (like ttl_sec is): "Omitted — proxy defaults to 86400".

| `inject_in_head` | `true` if URL appeared in head script list from DOM evaluation, else `false` |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question — How should inject_in_head behave for dynamically inserted scripts that are no longer in document.head at snapshot time?

The current rule makes sense for static markup, but some loaders briefly insert or move script tags and then remove them. In those cases the network sweep may see the request even though the later DOM query does not.

Do we want the source of truth here to be the final DOM snapshot only, or should the implementation treat some dynamic head insertions as inject_in_head = true as well?


### Slug algorithm

```
publisher_prefix = first_8_chars(base62(sha256(publisher.domain + origin_url)))
asset_stem = filename_without_extension(origin_url)
slug = "{publisher_prefix}:{asset_stem}"
```

**Rationale:** Fully opaque and hash-derived — no human naming required, no ambiguity for cryptic vendor filenames. The KV metadata (`origin_url`, `content_type`, `asset_slug`) serves as the lookup table. Operators can query `js-asset:{slug}` in the KV store to retrieve full provenance. The terminal summary also prints slug → origin_url at generation time.

**Important:** This algorithm must produce identical output to the Proxy's KV key derivation. Engineering should implement this as a shared utility (e.g., a small JS/TS helper in the skill, or a standalone `scripts/` utility) rather than duplicating the logic.

### Wildcard detection

Path segments matching either pattern are replaced with `*`:
- Semver: `\d+\.\d+[\.\d-]*` (e.g., `1.19.8-hcskhn`)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question — Slug algorithm has two ambiguities

1. Concatenation separator: sha256(publisher.domain + origin_url) — what separates the two? Raw concatenation produces example.comhttps://vendor.io/script.js. Since Auditor and Proxy must produce identical slugs, the separator (or its explicit absence) needs to be specified.

2. base62 character set: base62 is not standardized — different implementations use [0-9A-Za-z] vs [0-9a-zA-Z] vs [A-Za-z0-9]. The spec should pin the exact ordering or name a reference implementation.

- Hash-like: `[a-f0-9]{6,}` or `[A-Za-z0-9]{8,}` between path separators
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question — The hash-like wildcard heuristic looks broad enough to overmatch stable paths

[A-Za-z0-9]{8,} will match a lot of stable path segments in the wild, not just hashes. That seems likely to over-wildcard legitimate routes and produce unstable proxy config.

Would it make sense to tighten this to a higher-entropy pattern or otherwise define a more hash-specific rule so engineering does not have to guess where the false-positive boundary should be?


The original URL is preserved as a comment above the generated entry so operators can verify the wildcard substitution is correct.

---

## Init Mode Output

### `js-assets.toml` (written to repo root)

```toml
# Generated by /audit-js-assets on 2026-04-01
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 thinking — Hash-like regex [A-Za-z0-9]{8,} is too broad

The hex pattern [a-f0-9]{6,} is well-scoped, but [A-Za-z0-9]{8,} between path separators would match stable, legitimate segments like analytics (9), bootstrap (9), modernizr (9), dashboard (9). These would be incorrectly wildcarded.

Consider requiring mixed character classes (must contain both letters and digits), a higher minimum (12+), or excluding common dictionary words.

# Publisher: publisher.com
# Source URL: https://www.publisher.com

[[js_assets]]
# https://web.prebidwrapper.com/golf-WnLmpLyEjL/default-v2/prebid-load.js
slug = "aB3kR7mN:prebid-load"
path = "/sdk/aB3kR7mN.js"
origin_url = "https://web.prebidwrapper.com/golf-WnLmpLyEjL/default-v2/prebid-load.js"
inject_in_head = true

[[js_assets]]
# https://raven-static.vendor.io/prod/1.19.8-hcskhn/raven.js (wildcard detected)
slug = "xQ9pL2wY:raven"
path = "/raven-static/*"
origin_url = "https://raven-static.vendor.io/prod/*/raven.js"
inject_in_head = false
```

### Terminal summary

```
JS Asset Audit — publisher.com
────────────────────────────────
Detected: 8 third-party JS requests
Filtered: 3 (cdnjs.cloudflare.com ×2, sentry.io ×1)
Surfaced: 5 assets → js-assets.toml

aB3kR7mN inject_in_head=true web.prebidwrapper.com/.../prebid-load.js
xQ9pL2wY inject_in_head=false raven-static.vendor.io/prod/*/raven.js [wildcard]
zM4nK8vP inject_in_head=true googletagmanager.com/gtm.js
...

Review inject_in_head values and commit js-assets.toml when ready.
Diff mode: /audit-js-assets <url> --diff
```

---

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 praise — Diff mode design is excellent

Appending new entries as commented-out TOML blocks keeps the file valid, requires explicit operator action, and provides full provenance. The "never auto-remove" policy for missing assets is the right default for a monitoring tool.

## Diff Mode Output

Compares sweep results against the existing `js-assets.toml`.

| Condition | Behavior |
|---|---|
| Asset in sweep, not in file | **New** — appended to `js-assets.toml` as a commented-out block |
| Asset in file, not in sweep | **Missing** — flagged in terminal summary with `⚠`. Never auto-removed. |
| Asset in both | **Confirmed** — listed as present |

New entries are appended as TOML comments so the file stays valid and nothing is activated without the operator explicitly uncommenting.

### `js-assets.toml` (new entry appended as comment)

```toml
# --- NEW (detected by /audit-js-assets --diff on 2026-04-01, uncomment to activate) ---
# [[js_assets]]
# # https://googletagmanager.com/gtm.js
# slug = "zM4nK8vP:gtm"
# path = "/sdk/zM4nK8vP.js"
# origin_url = "https://googletagmanager.com/gtm.js"
# inject_in_head = true
```

### Terminal summary (diff mode)

```
JS Asset Audit (diff) — publisher.com
────────────────────────────────
Confirmed: 4 assets still present on page
New: 1 asset detected (appended as comment to js-assets.toml)
Missing: 1 asset no longer seen on page ⚠

NEW zM4nK8vP googletagmanager.com/gtm.js → review in js-assets.toml
MISSING xQ9pL2wY raven-static.vendor.io/... → may have been removed or renamed
```

---

## Implementation

The Auditor is a Claude Code skill file. No compiled code.

**Skill location:** `.claude/skills/audit-js-assets.md`

**MCP tools used:**
- `mcp__chrome-devtools__new_page` — open browser tab
- `mcp__chrome-devtools__navigate_page` — load publisher URL
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question — Skill location: .claude/skills/ vs .claude/commands/

The spec places the skill at .claude/skills/audit-js-assets.md, but the project has no .claude/skills/ directory. All existing slash commands live in .claude/commands/ (check-ci.md, verify.md, test-all.md, etc.).

Is there a reason to introduce a new directory? If not, .claude/commands/audit-js-assets.md would be consistent with existing patterns.

- `mcp__chrome-devtools__wait_for` — settle after page load
- `mcp__chrome-devtools__list_network_requests` — capture JS requests
- `mcp__chrome-devtools__evaluate_script` — detect head-loaded scripts via DOM query
- `mcp__chrome-devtools__close_page` — clean up tab

**File tools used:**
- `Read` — read `trusted-server.toml` (publisher domain) and existing `js-assets.toml` (diff mode)
- `Write` — write generated/updated `js-assets.toml`

---

## Delivery Order

The Auditor should be delivered **after Proxy Phase 1** (so `js-assets.toml` schema is defined) and **before Proxy Phase 2** (so engineering has real populated entries to test the cache pipeline against actual vendor origins).

See [delivery order in the Proxy spec](2026-04-01-js-asset-proxy-design.md).

---

## Verification

- Run `/audit-js-assets https://www.publisher.com` against a known test publisher page with identified third-party JS
- Verify generated entries match actual third-party JS observed on the page (cross-check in browser DevTools)
- Verify `inject_in_head = true` only for scripts that appear in `<head>` (not `<body>`)
- Verify wildcard detection fires for versioned path segments and not for stable paths
- Verify GTM (`googletagmanager.com`) is captured and not filtered
- Verify framework CDNs (`cdnjs.cloudflare.com` etc.) are filtered with reason in summary
- Run `--diff` against an unchanged page → all entries confirmed, no new/missing
- Run `--diff` after adding a new vendor script to the page → appears as `NEW` in summary
- Run `--diff` after removing a script → appears as `MISSING ⚠` in summary, file unchanged
Loading