-
Notifications
You must be signed in to change notification settings - Fork 8
JS Asset Auditor engineering spec #608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,216 @@ | ||
| # JS Asset Auditor — Engineering Spec | ||
|
|
||
| **Date:** 2026-04-01 | ||
| **Status:** Approved for engineering breakdown | ||
| **Related:** [JS Asset Proxy spec](2026-04-01-js-asset-proxy-design.md) | ||
|
|
||
| --- | ||
|
|
||
| ## Context | ||
|
|
||
| The JS Asset Proxy requires a `js-assets.toml` file declaring which third-party JS assets to proxy. Without tooling, populating this file requires manually inspecting network requests in browser DevTools, extracting URLs, generating opaque slugs, and writing TOML — a tedious error-prone process that is a barrier to publisher onboarding. | ||
|
|
||
| The Auditor eliminates this friction. It sweeps a publisher's page using the Chrome DevTools MCP, detects third-party JS assets, auto-generates `js-assets.toml` entries, and auto-detects `inject_in_head` from the page DOM. The operator's only remaining decision is reviewing the output before committing. | ||
|
|
||
| It also runs as a monitoring tool — `--diff` mode compares a new sweep against the existing config and surfaces new or removed assets, giving publishers ongoing visibility into their third-party JS footprint. | ||
|
|
||
| **Implementation:** Pure Claude Code skill — no Rust, no compiled code, no additional dependencies. Uses the Chrome DevTools MCP already configured in `.claude/settings.json`. | ||
|
|
||
| --- | ||
|
|
||
| ## Command Interface | ||
|
|
||
| ```bash | ||
| /audit-js-assets https://www.publisher.com # init — generate js-assets.toml | ||
| /audit-js-assets https://www.publisher.com --diff # diff — compare against existing file | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Sweep Protocol | ||
|
|
||
| 1. Read `trusted-server.toml` → extract `publisher.domain` (defines first-party boundary) | ||
| 2. Open Chrome via `mcp__chrome-devtools__new_page`, navigate to target URL via `mcp__chrome-devtools__navigate_page` | ||
| 3. Wait for full page load + ~6s settle window for async script loads (`mcp__chrome-devtools__wait_for`) | ||
| 4. In parallel: | ||
| - `mcp__chrome-devtools__list_network_requests` → filter for requests where URL ends in `.js` or `Content-Type: application/javascript`, and origin ≠ `publisher.domain` | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ❓ question — Third-party boundary and filter matching semantics need to be explicit
Could we specify whether host matching here is exact-host, eTLD+1, or suffix-with-dot-boundary matching? That will directly affect what gets surfaced vs filtered.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔧 wrench — MCP tool names are wrong throughout the spec Every Affects 12 occurrences across the Sweep Protocol (lines 36-43) and Implementation (lines 188-193) sections. A skill implemented from this spec will fail on every MCP call. Fix: Replace all |
||
| - `mcp__chrome-devtools__evaluate_script` → `Array.from(document.head.querySelectorAll('script[src]')).map(s => s.src)` → collect head-loaded script URLs | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔧 wrench — The spec says: "Wait for full page load + ~6s settle window for async script loads ( But Fix: Use await new Promise(r => setTimeout(r, 6000))Or poll 🤔 thinking — 6s may be too short for complex ad tech pages Header bidding waterfalls, consent-gated loading, and lazy demand partners can take 10-15s. Consider making this configurable (e.g., |
||
| 5. Apply heuristic filter (see below) | ||
| 6. For each surviving asset, generate a `[[js_assets]]` entry (see below) | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔧 wrench — The spec says: "filter for requests where URL ends in The tool supports Fix: Rewrite to: " |
||
| 7. Write output (init or diff mode) | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👍 praise —
|
||
| 8. Print terminal summary | ||
| 9. Close page via `mcp__chrome-devtools__close_page` | ||
|
|
||
| --- | ||
|
|
||
| ## Heuristic Filter | ||
|
|
||
| The following origin categories are excluded silently. The terminal summary reports what was filtered and why so operators can manually add entries if needed. | ||
|
|
||
| | Category | Excluded origins | | ||
| |---|---| | ||
| | Framework CDNs | `cdnjs.cloudflare.com`, `ajax.googleapis.com`, `cdn.jsdelivr.net`, `unpkg.com` | | ||
| | Error tracking | `sentry.io`, `bugsnag.com`, `rollbar.com` | | ||
| | Font services | `fonts.googleapis.com`, `fonts.gstatic.com` | | ||
| | Social embeds | `platform.twitter.com`, `connect.facebook.net` | | ||
|
|
||
| **`googletagmanager.com` is not filtered** — GTM is ad tech and should be proxied. | ||
|
|
||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🤔 thinking — Twitter rebranded to X. The domain may already be |
||
| Everything else surfaces for operator review. | ||
|
|
||
| --- | ||
|
|
||
| ## Asset Entry Generation | ||
|
|
||
| | Field | Derivation | | ||
| |---|---| | ||
| | `slug` | `{publisher_prefix}:{asset_stem}` — see slug algorithm below | | ||
| | `path` | `/{publisher_prefix}/{asset_stem}.js`, or wildcard variant if versioned path detected | | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ❓ question — The The table says Can we pick one formula and update the examples so they match it exactly?
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔧 wrench — Path derivation algorithm contradicts the examples This table says But the example on line 119 shows Since Fix: Align the algorithm description with the examples (or vice versa). |
||
| | `origin_url` | Full captured URL, with wildcard substitution applied if versioned | | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ❓ question — This reads as though we would persist the captured URL more or less verbatim. That seems likely to commit cache-busters, per-session tokens, consent params, or other transient query values into Could we define the normalization rules here: whether fragments are dropped, which query params are preserved or stripped, and whether wildcarding happens before or after query normalization? |
||
| | `ttl_sec` | Omitted — proxy defaults to 1800 (wildcard) or 3600 (fixed) | | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🤔 thinking — The Proxy spec defines |
||
| | `inject_in_head` | `true` if URL appeared in head script list from DOM evaluation, else `false` | | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ❓ question — How should The current rule makes sense for static markup, but some loaders briefly insert or move script tags and then remove them. In those cases the network sweep may see the request even though the later DOM query does not. Do we want the source of truth here to be the final DOM snapshot only, or should the implementation treat some dynamic head insertions as |
||
|
|
||
| ### Slug algorithm | ||
|
|
||
| ``` | ||
| publisher_prefix = first_8_chars(base62(sha256(publisher.domain + origin_url))) | ||
| asset_stem = filename_without_extension(origin_url) | ||
| slug = "{publisher_prefix}:{asset_stem}" | ||
| ``` | ||
|
|
||
| **Rationale:** Fully opaque and hash-derived — no human naming required, no ambiguity for cryptic vendor filenames. The KV metadata (`origin_url`, `content_type`, `asset_slug`) serves as the lookup table. Operators can query `js-asset:{slug}` in the KV store to retrieve full provenance. The terminal summary also prints slug → origin_url at generation time. | ||
|
|
||
| **Important:** This algorithm must produce identical output to the Proxy's KV key derivation. Engineering should implement this as a shared utility (e.g., a small JS/TS helper in the skill, or a standalone `scripts/` utility) rather than duplicating the logic. | ||
|
|
||
| ### Wildcard detection | ||
|
|
||
| Path segments matching either pattern are replaced with `*`: | ||
| - Semver: `\d+\.\d+[\.\d-]*` (e.g., `1.19.8-hcskhn`) | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ❓ question — Slug algorithm has two ambiguities 1. Concatenation separator: 2. base62 character set: base62 is not standardized — different implementations use |
||
| - Hash-like: `[a-f0-9]{6,}` or `[A-Za-z0-9]{8,}` between path separators | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ❓ question — The hash-like wildcard heuristic looks broad enough to overmatch stable paths
Would it make sense to tighten this to a higher-entropy pattern or otherwise define a more hash-specific rule so engineering does not have to guess where the false-positive boundary should be? |
||
|
|
||
| The original URL is preserved as a comment above the generated entry so operators can verify the wildcard substitution is correct. | ||
|
|
||
| --- | ||
|
|
||
| ## Init Mode Output | ||
|
|
||
| ### `js-assets.toml` (written to repo root) | ||
|
|
||
| ```toml | ||
| # Generated by /audit-js-assets on 2026-04-01 | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🤔 thinking — Hash-like regex The hex pattern Consider requiring mixed character classes (must contain both letters and digits), a higher minimum (12+), or excluding common dictionary words. |
||
| # Publisher: publisher.com | ||
| # Source URL: https://www.publisher.com | ||
|
|
||
| [[js_assets]] | ||
| # https://web.prebidwrapper.com/golf-WnLmpLyEjL/default-v2/prebid-load.js | ||
| slug = "aB3kR7mN:prebid-load" | ||
| path = "/sdk/aB3kR7mN.js" | ||
| origin_url = "https://web.prebidwrapper.com/golf-WnLmpLyEjL/default-v2/prebid-load.js" | ||
| inject_in_head = true | ||
|
|
||
| [[js_assets]] | ||
| # https://raven-static.vendor.io/prod/1.19.8-hcskhn/raven.js (wildcard detected) | ||
| slug = "xQ9pL2wY:raven" | ||
| path = "/raven-static/*" | ||
| origin_url = "https://raven-static.vendor.io/prod/*/raven.js" | ||
| inject_in_head = false | ||
| ``` | ||
|
|
||
| ### Terminal summary | ||
|
|
||
| ``` | ||
| JS Asset Audit — publisher.com | ||
| ──────────────────────────────── | ||
| Detected: 8 third-party JS requests | ||
| Filtered: 3 (cdnjs.cloudflare.com ×2, sentry.io ×1) | ||
| Surfaced: 5 assets → js-assets.toml | ||
|
|
||
| aB3kR7mN inject_in_head=true web.prebidwrapper.com/.../prebid-load.js | ||
| xQ9pL2wY inject_in_head=false raven-static.vendor.io/prod/*/raven.js [wildcard] | ||
| zM4nK8vP inject_in_head=true googletagmanager.com/gtm.js | ||
| ... | ||
|
|
||
| Review inject_in_head values and commit js-assets.toml when ready. | ||
| Diff mode: /audit-js-assets <url> --diff | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👍 praise — Diff mode design is excellent Appending new entries as commented-out TOML blocks keeps the file valid, requires explicit operator action, and provides full provenance. The "never auto-remove" policy for missing assets is the right default for a monitoring tool. |
||
| ## Diff Mode Output | ||
|
|
||
| Compares sweep results against the existing `js-assets.toml`. | ||
|
|
||
| | Condition | Behavior | | ||
| |---|---| | ||
| | Asset in sweep, not in file | **New** — appended to `js-assets.toml` as a commented-out block | | ||
| | Asset in file, not in sweep | **Missing** — flagged in terminal summary with `⚠`. Never auto-removed. | | ||
| | Asset in both | **Confirmed** — listed as present | | ||
|
|
||
| New entries are appended as TOML comments so the file stays valid and nothing is activated without the operator explicitly uncommenting. | ||
|
|
||
| ### `js-assets.toml` (new entry appended as comment) | ||
|
|
||
| ```toml | ||
| # --- NEW (detected by /audit-js-assets --diff on 2026-04-01, uncomment to activate) --- | ||
| # [[js_assets]] | ||
| # # https://googletagmanager.com/gtm.js | ||
| # slug = "zM4nK8vP:gtm" | ||
| # path = "/sdk/zM4nK8vP.js" | ||
| # origin_url = "https://googletagmanager.com/gtm.js" | ||
| # inject_in_head = true | ||
| ``` | ||
|
|
||
| ### Terminal summary (diff mode) | ||
|
|
||
| ``` | ||
| JS Asset Audit (diff) — publisher.com | ||
| ──────────────────────────────── | ||
| Confirmed: 4 assets still present on page | ||
| New: 1 asset detected (appended as comment to js-assets.toml) | ||
| Missing: 1 asset no longer seen on page ⚠ | ||
|
|
||
| NEW zM4nK8vP googletagmanager.com/gtm.js → review in js-assets.toml | ||
| MISSING xQ9pL2wY raven-static.vendor.io/... → may have been removed or renamed | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Implementation | ||
|
|
||
| The Auditor is a Claude Code skill file. No compiled code. | ||
|
|
||
| **Skill location:** `.claude/skills/audit-js-assets.md` | ||
|
|
||
| **MCP tools used:** | ||
| - `mcp__chrome-devtools__new_page` — open browser tab | ||
| - `mcp__chrome-devtools__navigate_page` — load publisher URL | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ❓ question — Skill location: The spec places the skill at Is there a reason to introduce a new directory? If not, |
||
| - `mcp__chrome-devtools__wait_for` — settle after page load | ||
| - `mcp__chrome-devtools__list_network_requests` — capture JS requests | ||
| - `mcp__chrome-devtools__evaluate_script` — detect head-loaded scripts via DOM query | ||
| - `mcp__chrome-devtools__close_page` — clean up tab | ||
|
|
||
| **File tools used:** | ||
| - `Read` — read `trusted-server.toml` (publisher domain) and existing `js-assets.toml` (diff mode) | ||
| - `Write` — write generated/updated `js-assets.toml` | ||
|
|
||
| --- | ||
|
|
||
| ## Delivery Order | ||
|
|
||
| The Auditor should be delivered **after Proxy Phase 1** (so `js-assets.toml` schema is defined) and **before Proxy Phase 2** (so engineering has real populated entries to test the cache pipeline against actual vendor origins). | ||
|
|
||
| See [delivery order in the Proxy spec](2026-04-01-js-asset-proxy-design.md). | ||
|
|
||
| --- | ||
|
|
||
| ## Verification | ||
|
|
||
| - Run `/audit-js-assets https://www.publisher.com` against a known test publisher page with identified third-party JS | ||
| - Verify generated entries match actual third-party JS observed on the page (cross-check in browser DevTools) | ||
| - Verify `inject_in_head = true` only for scripts that appear in `<head>` (not `<body>`) | ||
| - Verify wildcard detection fires for versioned path segments and not for stable paths | ||
| - Verify GTM (`googletagmanager.com`) is captured and not filtered | ||
| - Verify framework CDNs (`cdnjs.cloudflare.com` etc.) are filtered with reason in summary | ||
| - Run `--diff` against an unchanged page → all entries confirmed, no new/missing | ||
| - Run `--diff` after adding a new vendor script to the page → appears as `NEW` in summary | ||
| - Run `--diff` after removing a script → appears as `MISSING ⚠` in summary, file unchanged | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⛏ nitpick — Relative link to Proxy spec may be broken at merge time
[JS Asset Proxy spec](2026-04-01-js-asset-proxy-design.md)assumes the Proxy spec exists in the same directory. If this PR merges first, the link is dead. Same issue on line 207.