Skip to content

Commit 355034e

Browse files
committed
Merge remote-tracking branch 'origin/main' into wasteful-macaw
2 parents 3e96340 + 93b10a5 commit 355034e

53 files changed

Lines changed: 6983 additions & 1260 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
---
2+
name: add-httparchive-metric-report
3+
description: Add new metrics to HTTPArchive reports config. USE FOR adding performance metrics, adoption/percentage metrics, or custom metric analysis from crawl data. Chooses timeseries vs histogram based on data type.
4+
---
5+
6+
# Adding Metrics to HTTPArchive Reports
7+
8+
## Documentation
9+
10+
**See [reports.md](../../../reports.md)** for complete guide including:
11+
- Architecture and processing details
12+
- Quick Decision Guide table
13+
- Required SQL patterns checklist
14+
- SQL pattern reference (adoption, percentiles, binning)
15+
- Complete examples
16+
- Troubleshooting
17+
18+
## Quick Start
19+
20+
1. Open `includes/reports.js`, find `config._metrics` (line ~42)
21+
2. Choose type: **Timeseries** (adoption/percentiles) or **Histogram** (distributions)
22+
3. Add metric with required patterns: `date`, `is_root_page`, `${params.lens.sql}`, `${params.devRankFilter}`, `${ctx.ref('crawl', 'pages')}`, `GROUP BY client`
23+
4. Run `get_errors` to verify
24+
25+
## Key Rules
26+
27+
- **Boolean/adoption metrics**: Timeseries ONLY (histogram meaningless for 2 states)
28+
- **Continuous metrics**: Both histogram + timeseries
29+
- **Use safe functions**: `SAFE_DIVIDE()`, `SAFE.BOOL()` for custom metrics
30+
- **Filter zeros**: Add `AND metric > 0` before percentile calculations
31+
32+
## Minimal Example
33+
34+
```javascript
35+
metricName: {
36+
SQL: [
37+
{
38+
type: 'timeseries', // or 'histogram'
39+
query: DataformTemplateBuilder.create((ctx, params) => `
40+
SELECT client, /* your calculations */
41+
FROM ${ctx.ref('crawl', 'pages')}
42+
WHERE date = '${params.date}' AND is_root_page
43+
${params.lens.sql} ${params.devRankFilter}
44+
GROUP BY client ORDER BY client
45+
`)
46+
}
47+
]
48+
}
49+
```
50+
51+
See [reports.md](../../../reports.md) for complete patterns and examples.
52+
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
---
2+
name: optimize-model-compute
3+
description: Optimize BigQuery compute costs by assigning Dataform actions to slot reservations. USE FOR managing which models use reserved slots vs on-demand pricing, updating reservation assignments, or analyzing cost vs priority tradeoffs for data pipelines.
4+
---
5+
6+
# Optimize Model Compute (BigQuery Reservations)
7+
8+
## Purpose
9+
10+
Automatically assign Dataform actions to BigQuery slot reservations based on priority and cost optimization strategy. Routes high-priority workloads to reserved slots while using on-demand pricing for low-priority tasks.
11+
12+
## When to Use
13+
14+
- Assigning new models/actions to appropriate compute tiers (reserved vs on-demand)
15+
- Rebalancing reservation assignments based on priority changes
16+
- Optimizing costs by moving low-priority workloads to on-demand
17+
- Ensuring critical pipelines get guaranteed compute resources
18+
19+
## Configuration File
20+
21+
Reservations are configured in `definitions/_reservations.js`:
22+
23+
```javascript
24+
const { autoAssignActions } = require('@masthead-data/dataform-package')
25+
26+
const RESERVATION_CONFIG = [
27+
{
28+
tag: 'reservation', // Human-readable identifier
29+
reservation: 'projects/.../reservations/...', // BigQuery reservation path
30+
actions: [ // Models assigned to this tier
31+
'httparchive.crawl.pages',
32+
'httparchive.f1.pages_latest'
33+
]
34+
},
35+
{
36+
tag: 'on_demand',
37+
reservation: 'none', // On-demand pricing
38+
actions: [
39+
'httparchive.sample_data.pages_10k'
40+
]
41+
}
42+
]
43+
44+
autoAssignActions(RESERVATION_CONFIG)
45+
```
46+
47+
## Implementation Steps
48+
49+
### Step 1: Source Configuration
50+
51+
**TODO**: _User will provide details on how to determine which models should use reserved vs on-demand compute_
52+
53+
### Step 2: Update Configuration
54+
55+
1. Open `definitions/_reservations.js`
56+
2. Add or move actions between reservation tiers:
57+
- **Reserved slots** (`reservation: 'projects/...'`): Critical, high-priority, SLA-sensitive workloads
58+
- **On-demand** (`reservation: 'none'`): Low-priority, ad-hoc, or experimental workloads
59+
60+
### Step 3: Verify Changes
61+
62+
```bash
63+
# Check syntax
64+
dataform compile
65+
66+
# Validate no duplicate assignments
67+
grep -r "\.actions" definitions/_reservations.js
68+
```
69+
70+
## Decision Criteria
71+
72+
| Factor | Reserved Slots | On-Demand |
73+
|--------|----------------|-----------|
74+
| **Priority** | High, SLA-bound | Low, flexible |
75+
| **Frequency** | Regular, scheduled | Ad-hoc, occasional |
76+
| **Cost Pattern** | Predictable usage | Variable, sporadic |
77+
| **Impact** | Critical pipelines | Experimental, samples |
78+
79+
## Key Notes
80+
81+
- Each action should appear in only ONE reservation config
82+
- File starts with `_` to ensure it runs first in Dataform queue
83+
- Changes take effect on next Dataform workflow run
84+
- Package automatically handles global assignment (no per-file edits needed)
85+
86+
## Package Reference
87+
88+
Using `@masthead-data/dataform-package` (see [package.json](../../../package.json))
Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
---
2+
name: optimize-storage-costs
3+
description: Optimize BigQuery storage costs by identifying and removing dead-end and unused tables. USE FOR analyzing storage waste, reviewing tables with no consumption, cleaning up unused datasets, or implementing storage cost reduction strategies.
4+
---
5+
6+
# Optimize Storage Costs (Dead-end and Unused Tables)
7+
8+
## Purpose
9+
10+
Identify and remove BigQuery tables that contribute to storage costs but have no active consumption, based on Masthead Data lineage analysis.
11+
12+
## Table Categories
13+
14+
Masthead Data uses lineage analysis to identify tables, but relies on visible pipeline references. Modification timestamps are critical:
15+
16+
| Type | Definition | Indicators | Watch for |
17+
|------|------------|------------|---|
18+
| **Dead-end** | Regularly updated, no downstream consumption | Updated but never read in 30+ days | External writers outside lineage graph (manual jobs, independent pipelines) |
19+
| **Unused** | No upstream or downstream activity | No reads/writes in 30+ days | Recent `lastModifiedTime` despite "Unused" flag suggests external writer—**do not drop without verification** |
20+
21+
### Key Signal
22+
If a table is flagged `Unused` **and** has a recent modification timestamp, something outside Masthead's visibility is writing to it. This always warrants investigation before dropping.
23+
24+
## When to Use
25+
26+
- Reducing storage costs when budget is constrained
27+
- Cleaning up abandoned tables and pipelines
28+
- Implementing regular storage hygiene
29+
- Investigating sudden storage cost increases
30+
31+
## Prerequisites
32+
33+
- Masthead Data agent v0.2.7+ installed (for accurate lineage)
34+
- Access to Masthead insights dataset: `masthead-prod.httparchive.insights`
35+
- BigQuery permissions to query insights and drop tables
36+
37+
## Implementation Steps
38+
39+
### Step 1: Query Storage Waste
40+
41+
```bash
42+
bq query --project_id=YOUR_PROJECT --use_legacy_sql=false --format=csv \
43+
"SELECT
44+
subtype,
45+
project_id,
46+
target_resource,
47+
SAFE.STRING(operations[0].resource_type) AS resource_type,
48+
SAFE.INT64(overview.num_bytes) / POW(1024, 4) AS total_tib,
49+
SAFE.FLOAT64(overview.cost_30d) AS cost_usd_30d,
50+
SAFE.FLOAT64(overview.savings_30d) AS savings_usd_30d
51+
FROM \`masthead-prod.httparchive.insights\`
52+
WHERE category = 'Cost'
53+
AND subtype IN ('Dead end table', 'Unused table')
54+
AND overview.num_bytes IS NOT NULL
55+
AND SAFE.FLOAT64(overview.savings_30d) > 10
56+
AND target_resource NOT LIKE '%analytics_%' -- Filter out low-impact GA intraday tables
57+
ORDER BY savings_usd_30d DESC" > storage_waste.csv
58+
```
59+
60+
**Note:** Sorting by `savings_usd_30d` instead of `total_tib` prioritizes high-impact targets for review.
61+
62+
**Alternative: Use Masthead UI**
63+
- Navigate to [Dictionary page](https://app.mastheadata.com/dictionary?tab=Tables&deadEnd=true)
64+
- Filter by `Dead-end` or `Unused` labels
65+
- Export table list for review
66+
67+
### Step 2: Review and Decide
68+
69+
Review `storage_waste.csv` and add a `status` column with values:
70+
- `keep` - Table is needed
71+
- `to drop` - Safe to remove
72+
- `investigate` - Needs further analysis
73+
74+
**Review criteria:**
75+
- Is this a backup or archive table? (consider alternative storage)
76+
- Is there a downstream dependency not captured in lineage?
77+
- Is this table part of an active experiment or migration?
78+
- **For repo-managed projects:** Search the codebase (e.g., `grep` for table name in model definitions, scripts) to confirm ownership. Table naming can be misleading (e.g., `cwv_tech_*` may seem like current outputs but could be legacy).
79+
- **Check for disabled producers:** If a Dataform `publish()` has `disabled: true` but the underlying BigQuery table still exists and has recent modifications, either the table is abandoned or an external process took over—both warrant investigation.
80+
81+
### Step 3: Drop Approved Tables
82+
83+
```bash
84+
# Generate DROP statements
85+
awk -F',' '$NF=="to drop" {
86+
print "bq rm -f -t " $4
87+
}' storage_waste.csv > drop_tables.sh
88+
89+
# Review generated commands
90+
cat drop_tables.sh
91+
92+
# Execute (after review!)
93+
bash drop_tables.sh
94+
```
95+
96+
**Safe mode (dry-run first):**
97+
```bash
98+
# Add --dry-run flag to each command
99+
sed 's/bq rm/bq rm --dry-run/' drop_tables.sh > drop_tables_dryrun.sh
100+
bash drop_tables_dryrun.sh
101+
```
102+
103+
### Step 4: Verify Savings
104+
105+
After 24-48 hours, check storage reduction in Masthead:
106+
- [Storage Cost Insights](https://app.mastheadata.com/costs?tab=Storage+costs)
107+
- Compare before/after storage size and costs
108+
109+
## Decision Framework
110+
111+
| Monthly Savings | Action | Recency Check |
112+
|-----------------|--------|---------------|
113+
| < $10 | Consider keeping (low ROI) | Skip if `lastModifiedTime` > 12 months old (hygiene only) |
114+
| $10-$100 | Review and drop if unused | Check modification date; recent writes require owner verification |
115+
| $100-$1000 | Priority review, likely drop | Mandatory verification if modified in last 30 days |
116+
| > $1000 | Immediate investigation required | Always verify external writer before any action |
117+
118+
## Key Notes
119+
120+
- **Dead-end tables** may indicate pipeline issues - investigate before dropping
121+
- **Unused tables with recent modifications** are the highest-priority investigate cases. The gap between Masthead's "no lineage" and actual writes means an external dependency exists.
122+
- Tables can be restored from time travel (7 days) or fail-safe (7 days after time travel)
123+
- Consider archiving to Cloud Storage if compliance requires retention
124+
- Coordinate with data teams before dropping shared datasets
125+
- Wait 14 days after storage billing model changes before dropping tables
126+
127+
## Related Optimizations
128+
129+
- **Storage billing model**: Switch between Logical/Physical pricing (see docs)
130+
- **Table expiration**: Set automatic expiration for temporary tables
131+
- **Partitioning**: Use partitioned tables with expiration policies
132+
133+
## Documentation
134+
135+
- [Masthead Storage Costs](https://docs.mastheadata.com/cost-insights/storage-costs)
136+
- [BigQuery Storage Pricing](https://cloud.google.com/bigquery/pricing#storage)

.github/linters/eslint.config.mjs

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,8 @@ export default [
2222
ctx: 'readonly',
2323
constants: 'readonly',
2424
reports: 'readonly',
25-
reservations: 'readonly'
25+
reservations: 'readonly',
26+
descriptions: 'readonly'
2627
}
2728
},
2829
rules: {

.github/linters/zizmor.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,3 +2,4 @@ rules:
22
unpinned-uses:
33
ignore:
44
- ci.yaml
5+
- infra.yaml

.github/workflows/ci.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ jobs:
2727
persist-credentials: false
2828

2929
- name: Lint Code Base
30-
uses: super-linter/super-linter/slim@v8.3.0
30+
uses: super-linter/super-linter/slim@v8.5.0
3131
env:
3232
DEFAULT_BRANCH: main
3333
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

.github/workflows/infra.yaml

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
name: Terraform Apply on Changes
2+
3+
on:
4+
push:
5+
branches:
6+
- main
7+
8+
concurrency:
9+
group: ${{ github.workflow }}-${{ github.ref }}
10+
cancel-in-progress: true
11+
12+
permissions:
13+
contents: read
14+
15+
jobs:
16+
terraform:
17+
name: Terraform
18+
runs-on: ubuntu-latest
19+
20+
steps:
21+
- name: Checkout code
22+
uses: actions/checkout@v6
23+
with:
24+
fetch-depth: 0
25+
persist-credentials: false
26+
27+
- name: Authenticate with Google Cloud
28+
uses: google-github-actions/auth@v3
29+
with:
30+
project_id: "httparchive"
31+
credentials_json: ${{ secrets.GCP_SA_KEY }}
32+
33+
- name: Set up Terraform
34+
uses: hashicorp/setup-terraform@v4
35+
36+
- name: Initialize Terraform
37+
run: |
38+
make tf_init
39+
make tf_lint
40+
41+
- name: Apply Terraform
42+
run: make tf_apply

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
1-
**/node_modules/
21
.DS_Store
2+
.vscode/
3+
**/node_modules/
34
**/.terraform/
45
.env

Makefile

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,14 @@ clean:
22
rm -rf ./infra/bigquery-export/node_modules
33
rm -rf ./infra/dataform-service/node_modules
44

5+
tf_init:
6+
cd infra && terraform init -upgrade
7+
8+
tf_lint:
9+
cd infra && terraform fmt -check && terraform validate
10+
511
tf_plan:
6-
cd infra && terraform init -upgrade && terraform plan
12+
cd infra && terraform plan
713

814
tf_apply:
9-
cd infra && terraform init && terraform apply -auto-approve
15+
cd infra && terraform apply -auto-approve

SECURITY.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# Security Policy
2+
3+
## Reporting a Vulnerability
4+
5+
Please report any suspected security issues to `team@httparchive.org`. We currently to not participate in a bug bounty programme.

0 commit comments

Comments
 (0)