HTTPArchive
diff --git a/‎.agents/skills/add-httparchive-metric-report/SKILL.md‎
Lines changed: 52 additions & 0 deletions b/‎.agents/skills/add-httparchive-metric-report/SKILL.md‎
Lines changed: 52 additions & 0 deletions
diff --git a/‎.agents/skills/optimize-model-compute/SKILL.md‎
Lines changed: 88 additions & 0 deletions b/‎.agents/skills/optimize-model-compute/SKILL.md‎
Lines changed: 88 additions & 0 deletions
diff --git a/‎.agents/skills/optimize-storage-costs/SKILL.md‎
Lines changed: 136 additions & 0 deletions b/‎.agents/skills/optimize-storage-costs/SKILL.md‎
Lines changed: 136 additions & 0 deletions
diff --git a/‎.github/linters/eslint.config.mjs‎
Lines changed: 2 additions & 1 deletion b/‎.github/linters/eslint.config.mjs‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎.github/linters/zizmor.yaml‎
Lines changed: 1 addition & 0 deletions b/‎.github/linters/zizmor.yaml‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎.github/workflows/ci.yaml‎
Lines changed: 1 addition & 1 deletion b/‎.github/workflows/ci.yaml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎.github/workflows/infra.yaml‎
Lines changed: 42 additions & 0 deletions b/‎.github/workflows/infra.yaml‎
Lines changed: 42 additions & 0 deletions
diff --git a/‎.gitignore‎
Lines changed: 2 additions & 1 deletion b/‎.gitignore‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎Makefile‎
Lines changed: 8 additions & 2 deletions b/‎Makefile‎
Lines changed: 8 additions & 2 deletions
diff --git a/‎SECURITY.md‎
Lines changed: 5 additions & 0 deletions b/‎SECURITY.md‎
Lines changed: 5 additions & 0 deletions
@@ -0,0 +1,52 @@
+---
+name: add-httparchive-metric-report
+description: Add new metrics to HTTPArchive reports config. USE FOR adding performance metrics, adoption/percentage metrics, or custom metric analysis from crawl data. Chooses timeseries vs histogram based on data type.
+---
+
+# Adding Metrics to HTTPArchive Reports
+
+## Documentation
+
+**See [reports.md](../../../reports.md)** for complete guide including:
+- Architecture and processing details
+- Quick Decision Guide table
+- Required SQL patterns checklist
+- SQL pattern reference (adoption, percentiles, binning)
+- Complete examples
+- Troubleshooting
+
+## Quick Start
+
+1. Open `includes/reports.js`, find `config._metrics` (line ~42)
+2. Choose type: **Timeseries** (adoption/percentiles) or **Histogram** (distributions)
+3. Add metric with required patterns: `date`, `is_root_page`, `${params.lens.sql}`, `${params.devRankFilter}`, `${ctx.ref('crawl', 'pages')}`, `GROUP BY client`
+4. Run `get_errors` to verify
+
+## Key Rules
+
+- **Boolean/adoption metrics**: Timeseries ONLY (histogram meaningless for 2 states)
+- **Continuous metrics**: Both histogram + timeseries
+- **Use safe functions**: `SAFE_DIVIDE()`, `SAFE.BOOL()` for custom metrics
+- **Filter zeros**: Add `AND metric > 0` before percentile calculations
+
+## Minimal Example
+
+```javascript
+metricName: {
+  SQL: [
+    {
+      type: 'timeseries',  // or 'histogram'
+      query: DataformTemplateBuilder.create((ctx, params) => `
+        SELECT client, /* your calculations */
+        FROM ${ctx.ref('crawl', 'pages')}
+        WHERE date = '${params.date}' AND is_root_page
+          ${params.lens.sql} ${params.devRankFilter}
+        GROUP BY client ORDER BY client
+      `)
+    }
+  ]
+}
+```
+
+See [reports.md](../../../reports.md) for complete patterns and examples.
+
@@ -0,0 +1,88 @@
+---
+name: optimize-model-compute
+description: Optimize BigQuery compute costs by assigning Dataform actions to slot reservations. USE FOR managing which models use reserved slots vs on-demand pricing, updating reservation assignments, or analyzing cost vs priority tradeoffs for data pipelines.
+---
+
+# Optimize Model Compute (BigQuery Reservations)
+
+## Purpose
+
+Automatically assign Dataform actions to BigQuery slot reservations based on priority and cost optimization strategy. Routes high-priority workloads to reserved slots while using on-demand pricing for low-priority tasks.
+
+## When to Use
+
+- Assigning new models/actions to appropriate compute tiers (reserved vs on-demand)
+- Rebalancing reservation assignments based on priority changes
+- Optimizing costs by moving low-priority workloads to on-demand
+- Ensuring critical pipelines get guaranteed compute resources
+
+## Configuration File
+
+Reservations are configured in `definitions/_reservations.js`:
+
+```javascript
+const { autoAssignActions } = require('@masthead-data/dataform-package')
+
+const RESERVATION_CONFIG = [
+  {
+    tag: 'reservation',                    // Human-readable identifier
+    reservation: 'projects/.../reservations/...', // BigQuery reservation path
+    actions: [                             // Models assigned to this tier
+      'httparchive.crawl.pages',
+      'httparchive.f1.pages_latest'
+    ]
+  },
+  {
+    tag: 'on_demand',
+    reservation: 'none',                   // On-demand pricing
+    actions: [
+      'httparchive.sample_data.pages_10k'
+    ]
+  }
+]
+
+autoAssignActions(RESERVATION_CONFIG)
+```
+
+## Implementation Steps
+
+### Step 1: Source Configuration
+
+**TODO**: _User will provide details on how to determine which models should use reserved vs on-demand compute_
+
+### Step 2: Update Configuration
+
+1. Open `definitions/_reservations.js`
+2. Add or move actions between reservation tiers:
+   - **Reserved slots** (`reservation: 'projects/...'`): Critical, high-priority, SLA-sensitive workloads
+   - **On-demand** (`reservation: 'none'`): Low-priority, ad-hoc, or experimental workloads
+
+### Step 3: Verify Changes
+
+```bash
+# Check syntax
+dataform compile
+
+# Validate no duplicate assignments
+grep -r "\.actions" definitions/_reservations.js
+```
+
+## Decision Criteria
+
+| Factor | Reserved Slots | On-Demand |
+|--------|----------------|-----------|
+| **Priority** | High, SLA-bound | Low, flexible |
+| **Frequency** | Regular, scheduled | Ad-hoc, occasional |
+| **Cost Pattern** | Predictable usage | Variable, sporadic |
+| **Impact** | Critical pipelines | Experimental, samples |
+
+## Key Notes
+
+- Each action should appear in only ONE reservation config
+- File starts with `_` to ensure it runs first in Dataform queue
+- Changes take effect on next Dataform workflow run
+- Package automatically handles global assignment (no per-file edits needed)
+
+## Package Reference
+
+Using `@masthead-data/dataform-package` (see [package.json](../../../package.json))
@@ -0,0 +1,136 @@
+---
+name: optimize-storage-costs
+description: Optimize BigQuery storage costs by identifying and removing dead-end and unused tables. USE FOR analyzing storage waste, reviewing tables with no consumption, cleaning up unused datasets, or implementing storage cost reduction strategies.
+---
+
+# Optimize Storage Costs (Dead-end and Unused Tables)
+
+## Purpose
+
+Identify and remove BigQuery tables that contribute to storage costs but have no active consumption, based on Masthead Data lineage analysis.
+
+## Table Categories
+
+Masthead Data uses lineage analysis to identify tables, but relies on visible pipeline references. Modification timestamps are critical:
+
+| Type | Definition | Indicators | Watch for |
+|------|------------|------------|---|
+| **Dead-end** | Regularly updated, no downstream consumption | Updated but never read in 30+ days | External writers outside lineage graph (manual jobs, independent pipelines) |
+| **Unused** | No upstream or downstream activity | No reads/writes in 30+ days | Recent `lastModifiedTime` despite "Unused" flag suggests external writer—**do not drop without verification** |
+
+### Key Signal
+If a table is flagged `Unused` **and** has a recent modification timestamp, something outside Masthead's visibility is writing to it. This always warrants investigation before dropping.
+
+## When to Use
+
+- Reducing storage costs when budget is constrained
+- Cleaning up abandoned tables and pipelines
+- Implementing regular storage hygiene
+- Investigating sudden storage cost increases
+
+## Prerequisites
+
+- Masthead Data agent v0.2.7+ installed (for accurate lineage)
+- Access to Masthead insights dataset: `masthead-prod.httparchive.insights`
+- BigQuery permissions to query insights and drop tables
+
+## Implementation Steps
+
+### Step 1: Query Storage Waste
+
+```bash
+bq query --project_id=YOUR_PROJECT --use_legacy_sql=false --format=csv \
+"SELECT
+  subtype,
+  project_id,
+  target_resource,
+  SAFE.STRING(operations[0].resource_type) AS resource_type,
+  SAFE.INT64(overview.num_bytes) / POW(1024, 4) AS total_tib,
+  SAFE.FLOAT64(overview.cost_30d) AS cost_usd_30d,
+  SAFE.FLOAT64(overview.savings_30d) AS savings_usd_30d
+FROM \`masthead-prod.httparchive.insights\`
+WHERE category = 'Cost'
+  AND subtype IN ('Dead end table', 'Unused table')
+  AND overview.num_bytes IS NOT NULL
+  AND SAFE.FLOAT64(overview.savings_30d) > 10
+  AND target_resource NOT LIKE '%analytics_%'  -- Filter out low-impact GA intraday tables
+ORDER BY savings_usd_30d DESC" > storage_waste.csv
+```
+
+**Note:** Sorting by `savings_usd_30d` instead of `total_tib` prioritizes high-impact targets for review.
+
+**Alternative: Use Masthead UI**
+- Navigate to [Dictionary page](https://app.mastheadata.com/dictionary?tab=Tables&deadEnd=true)
+- Filter by `Dead-end` or `Unused` labels
+- Export table list for review
+
+### Step 2: Review and Decide
+
+Review `storage_waste.csv` and add a `status` column with values:
+- `keep` - Table is needed
+- `to drop` - Safe to remove
+- `investigate` - Needs further analysis
+
+**Review criteria:**
+- Is this a backup or archive table? (consider alternative storage)
+- Is there a downstream dependency not captured in lineage?
+- Is this table part of an active experiment or migration?
+- **For repo-managed projects:** Search the codebase (e.g., `grep` for table name in model definitions, scripts) to confirm ownership. Table naming can be misleading (e.g., `cwv_tech_*` may seem like current outputs but could be legacy).
+- **Check for disabled producers:** If a Dataform `publish()` has `disabled: true` but the underlying BigQuery table still exists and has recent modifications, either the table is abandoned or an external process took over—both warrant investigation.
+
+### Step 3: Drop Approved Tables
+
+```bash
+# Generate DROP statements
+awk -F',' '$NF=="to drop" {
+  print "bq rm -f -t " $4
+}' storage_waste.csv > drop_tables.sh
+
+# Review generated commands
+cat drop_tables.sh
+
+# Execute (after review!)
+bash drop_tables.sh
+```
+
+**Safe mode (dry-run first):**
+```bash
+# Add --dry-run flag to each command
+sed 's/bq rm/bq rm --dry-run/' drop_tables.sh > drop_tables_dryrun.sh
+bash drop_tables_dryrun.sh
+```
+
+### Step 4: Verify Savings
+
+After 24-48 hours, check storage reduction in Masthead:
+- [Storage Cost Insights](https://app.mastheadata.com/costs?tab=Storage+costs)
+- Compare before/after storage size and costs
+
+## Decision Framework
+
+| Monthly Savings | Action | Recency Check |
+|-----------------|--------|---------------|
+| < $10 | Consider keeping (low ROI) | Skip if `lastModifiedTime` > 12 months old (hygiene only) |
+| $10-$100 | Review and drop if unused | Check modification date; recent writes require owner verification |
+| $100-$1000 | Priority review, likely drop | Mandatory verification if modified in last 30 days |
+| > $1000 | Immediate investigation required | Always verify external writer before any action |
+
+## Key Notes
+
+- **Dead-end tables** may indicate pipeline issues - investigate before dropping
+- **Unused tables with recent modifications** are the highest-priority investigate cases. The gap between Masthead's "no lineage" and actual writes means an external dependency exists.
+- Tables can be restored from time travel (7 days) or fail-safe (7 days after time travel)
+- Consider archiving to Cloud Storage if compliance requires retention
+- Coordinate with data teams before dropping shared datasets
+- Wait 14 days after storage billing model changes before dropping tables
+
+## Related Optimizations
+
+- **Storage billing model**: Switch between Logical/Physical pricing (see docs)
+- **Table expiration**: Set automatic expiration for temporary tables
+- **Partitioning**: Use partitioned tables with expiration policies
+
+## Documentation
+
+- [Masthead Storage Costs](https://docs.mastheadata.com/cost-insights/storage-costs)
+- [BigQuery Storage Pricing](https://cloud.google.com/bigquery/pricing#storage)
@@ -22,7 +22,8 @@ export default [
         ctx: 'readonly',
         constants: 'readonly',
         reports: 'readonly',
-        reservations: 'readonly'
+        reservations: 'readonly',
+        descriptions: 'readonly'
       }
     },
     rules: {
 
@@ -2,3 +2,4 @@ rules:
   unpinned-uses:
     ignore:
       - ci.yaml
+      - infra.yaml
@@ -27,7 +27,7 @@ jobs:
           persist-credentials: false
 
       - name: Lint Code Base
-        uses: super-linter/super-linter/slim@v8.3.0
+        uses: super-linter/super-linter/slim@v8.5.0
         env:
           DEFAULT_BRANCH: main
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
 
@@ -0,0 +1,42 @@
+name: Terraform Apply on Changes
+
+on:
+  push:
+    branches:
+      - main
+
+concurrency:
+  group: ${{ github.workflow }}-${{ github.ref }}
+  cancel-in-progress: true
+
+permissions:
+  contents: read
+
+jobs:
+  terraform:
+    name: Terraform
+    runs-on: ubuntu-latest
+
+    steps:
+      - name: Checkout code
+        uses: actions/checkout@v6
+        with:
+          fetch-depth: 0
+          persist-credentials: false
+
+      - name: Authenticate with Google Cloud
+        uses: google-github-actions/auth@v3
+        with:
+          project_id: "httparchive"
+          credentials_json: ${{ secrets.GCP_SA_KEY }}
+
+      - name: Set up Terraform
+        uses: hashicorp/setup-terraform@v4
+
+      - name: Initialize Terraform
+        run: |
+          make tf_init
+          make tf_lint
+
+      - name: Apply Terraform
+        run: make tf_apply
@@ -1,4 +1,5 @@
-**/node_modules/
 .DS_Store
+.vscode/
+**/node_modules/
 **/.terraform/
 .env
@@ -2,8 +2,14 @@ clean:
 	rm -rf ./infra/bigquery-export/node_modules
 	rm -rf ./infra/dataform-service/node_modules
 
+tf_init:
+	cd infra && terraform init -upgrade
+
+tf_lint:
+	cd infra && terraform fmt -check && terraform validate
+
 tf_plan:
-	cd infra && terraform init -upgrade && terraform plan
+	cd infra && terraform plan
 
 tf_apply:
-	cd infra && terraform init && terraform apply -auto-approve
+	cd infra && terraform apply -auto-approve
@@ -0,0 +1,5 @@
+# Security Policy
+
+## Reporting a Vulnerability
+
+Please report any suspected security issues to `team@httparchive.org`. We currently to not participate in a bug bounty programme.