Skip to content

Latest commit

 

History

History
405 lines (322 loc) · 21.4 KB

File metadata and controls

405 lines (322 loc) · 21.4 KB

Here is the complete, unabbreviated document with all the strict technical guardrails restored. Copy the block below and save it directly over your existing file.

This is the exact version you need to hand to your VS agents so they stop guessing and actually fix the code.

Markdown


# TGC Analytics Policy

## Purpose
Clarify the intent, scope, and domain for each type and level of online and app analytics used by True Good Craft.

This document exists to:
* Define what analytics means across TGC.
* Establish consistent language.
* Separate useful analytics from unnecessary collection.
* Support product, marketing, and operational decisions without drifting into excessive or unjustified data collection.

## Goal
The goal of analytics is to provide TGC with useful and actionable information for improving products, outreach, distribution, reporting, and operations.

Analytics should be:
* Useful
* Intentional
* Minimal
* Privacy-aware
* Justified by operational or product value

TGC does **not** collect unnecessary personal information just because it is available. Information should only be collected when it helps improve the product, understand usage, evaluate outreach, or support a defined operational need.

## Core Principles
* **Collect with intent:** Every field or signal should have a reason to exist.
* **Prefer minimal collection:** If less data can achieve the same result, use less data.
* **Respect user agency:** Users should not be profiled beyond what is operationally necessary.
* **Separate layers clearly:** Host, page, app, and user-supplied data are not the same thing and must not be confused.
* **Avoid unnecessary PII:** Personal data should not be gathered unless it is explicitly required for a product, service, or user relationship.
* **Use analytics to improve decisions:** The point is better product judgment, not vanity metrics.
* **Privacy-first by default:** Privacy-preserving operation is preferred over deeper collection.
* **Aggregate-only where possible:** If daily or grouped totals are enough, event-level storage should be avoided.
* **No user identifiers without explicit justification:** Install IDs, device IDs, account IDs, cookies, fingerprinting, or similar identifiers should not be introduced casually.
* **No raw IP retention without explicit justification:** IP data should not be stored or repurposed as a uniqueness mechanism unless a system explicitly requires it and the policy says so.
* **No behavioral profiling by default:** TGC analytics should not drift into cross-day user linking or user profiling unless that collection is explicitly justified and approved.
* **Fail-soft and non-blocking:** Analytics collection must not break the product or degrade primary user flows.

## Types of Analytics
* Pageviews
* Source
* Links Followed
* Actions
* Usage
* Versions
* Heartbeats
* Traffic Snapshots
* Aggregate Counters
* User-Provided Data

## Note on Identifiers
TGC policy does **not** treat a hashed value as automatically anonymous or automatically acceptable. 

A hashed identifier is still an identifier if it can be used to distinguish, recognize, correlate, or track a person, device, browser, install, or repeat actor over time. Because of that, **hashed IDs are prohibited by default** unless a product-specific policy explicitly approves them for a narrowly defined purpose.

The following are **not allowed by default**:
* Hashed IP uniqueness systems.
* Hashed device IDs.
* Hashed install IDs.
* Hashed account IDs used for silent analytics correlation.
* Any identifier used to create cross-day linking or behavioral profiling.

If a product ever requires an identifier, the document for that product must explicitly state:
* Why it is needed.
* What exact value is used.
* What it is used for.
* What it is **not** used for.
* Retention expectations.
* Why aggregate-only measurement is insufficient.

## Levels of Analytics
* Page Level
* Host Level
* App Level
* User Level
* Internal

---

## Level Definitions

### Page Level
Analytics built into webpage code. This captures signals from actual page execution in a browser or real page load. This is the closest thing to a confirmed physical page run, because the page code itself executes.

Typical examples:
* Pageviews
* Page-level actions
* Tracked link visits
* Referral source capture
* Client-side session signals
* Page heartbeats

### Host Level
Analytics from the hosting or edge layer, such as Cloudflare or similar platform reporting. This reflects **all observed traffic**, including:
* Human traffic
* Bots
* Crawlers
* Probes
* Malformed requests
* Blocked traffic

Host-level analytics are useful for traffic volume, infrastructure visibility, and broad traffic pattern awareness, but they are not the same as confirmed human usage.

### App Level
Analytics generated by code running inside an application itself. This is similar to page-level analytics in that it reflects code execution inside the product, but applies to software/app usage rather than public webpage loads.

Typical examples:
* App starts
* Version reports
* Feature usage
* Workflow actions
* Update checks
* Usage heartbeats
* Install or runtime diagnostics

### User Level
Information knowingly provided by the user or tied directly to the user relationship.

Examples may include:
* Email address
* Account identity
* Billing information
* User-submitted profile information
* Support-linked identifiers

This level requires the greatest restraint and should only be used where there is a direct product or business need.

### Internal
Analytics, logs, reporting, or instrumentation that exist strictly for internal TGC system use.

Examples:
* Admin diagnostics
* Internal dashboards
* System health signals
* Private operational reporting
* Internal automation traces

---

## Glossary

* **Analytics:** Structured collection of useful signals intended to support product, operational, and marketing decisions.
* **Signal:** A single meaningful recorded event, measurement, or observed state.
* **Pageview:** A recorded visit or load of a webpage.
* **Source:** The origin of traffic, visit, or discovery path. Examples may include referrer, campaign tag, direct entry, or tracked link origin.
* **Tracked Link:** A link intentionally structured to preserve source or campaign attribution. Tracked links are useful for measuring which post, page, channel, or campaign produced a visit or downstream action.
* **Counted Intent Route:** A route or endpoint intentionally designed to count a meaningful action, such as an explicit download initiation, update check, or similar operator-defined event. This is different from a passive read route. A counted intent route exists to represent deliberate action.
* **Non-Counted Public Read:** A public route used to read or hydrate information without incrementing an action counter. This distinction is important because public reads and actual user intent are not the same thing.
* **Link Followed:** A recorded event showing that a user activated or followed a link.
* **Action:** A meaningful user or system event beyond a passive view. Examples include button clicks, form submissions, downloads, purchases, or feature use.
* **Hashed / Anonymized ID:** A non-plain-text identifier used to recognize repeat patterns or continuity without storing raw personal identity directly.
* **Usage:** Observed interaction with a page, product, feature, or workflow over time.
* **Version:** A reported software or site version associated with an event, client, or runtime state.
* **Heartbeat:** A recurring signal that indicates an active session, live page, running app instance, or continuing use state.
* **Physical Run:** A practical term for a real execution of code on a real page or in a real app instance, as opposed to only server-side or edge-observed traffic.
* **Attribution:** The process of connecting a visit, action, or conversion to a source, campaign, post, or tracked path.
* **PII:** Personally identifiable information. This includes data that directly identifies a person or can reasonably be tied back to them.
* **Telemetry:** Operational or product instrumentation emitted by software or systems for measurement, monitoring, reporting, or analysis.

---

## Working Policy Direction

TGC analytics should default toward:
* Operational usefulness.
* Minimal necessary collection.
* Clear separation between host traffic and confirmed product/page execution.
* Cautious treatment of user-linked data.
* Explicit naming of what each analytics layer does and does not mean.
* Aggregate-only counting where full event logs are not needed.
* Privacy-first implementation.

## Company-Wide Baseline Policy

Unless a product-specific policy explicitly states otherwise, TGC analytics follows this baseline:
* Prefer aggregate counts over event logs.
* Avoid raw personal identifiers.
* Avoid raw IP retention.
* Avoid uniqueness systems built from IPs or hidden identifiers.
* Avoid behavioral profiling and cross-day identity linking.
* Keep analytics non-blocking and fail-soft.
* Distinguish passive reads from counted intent.
* Distinguish host-observed traffic from code-executed page/app signals.
* Only collect user-level data when there is a direct business or product reason.
* Keep implementation, public wording, and source-of-truth documents aligned.

---

## Governance Rules

### 1. Policy Precedence
Company-wide analytics policy applies by default. If a product needs to differ from this baseline, that difference must be explicit in the product’s own documentation or source of truth. Silence does not imply permission.

### 2. Minimum Collection Rule
If a question can be answered with aggregate counts, then user-level or event-level collection should not be introduced.

### 3. Counted Intent vs Public Read Rule
TGC must distinguish between:
* **Counted intent routes:** explicit actions that are meant to represent user intent.
* **Non-counted public reads:** public reads, hydration, manifest fetches, or raw asset access that are not equivalent to deliberate action.

Examples from current TGC systems include a counted download-initiation route, a non-counted public manifest read route, and raw asset delivery that does not itself imply counted intent.

### 4. Host Traffic Is Not Human Usage Rule
Host-level traffic includes bots, probes, crawlers, malformed requests, and blocked traffic. Host-level numbers are useful for infrastructure visibility and broad traffic awareness, but they must not be casually described as human usage, app usage, or confirmed user engagement.

### 5. Physical Execution Rule
Page-level and app-level analytics are closer to confirmed execution because code actually ran on a real page or in a real app instance. Even so, page/app execution still does not justify unnecessary user identification.

### 6. No Hidden Identity Systems Rule
TGC must not silently introduce persistent or semi-persistent identity systems for analytics purposes without explicit approval. This includes cookies, install IDs, device IDs, fingerprinting, account-linking for analytics, hashed uniqueness systems, or similar approaches when their real purpose is repeat-user recognition.

### 7. User-Level Data Restraint Rule
User-level data requires the highest restraint. User-provided information such as email, billing details, contact details, or support-linked data should only be collected when there is a direct product, service, support, compliance, or commercial need.

### 8. Fail-Soft Analytics Rule
Analytics must never become a dependency that breaks the primary user path. When analytics fail, the product or page should continue functioning normally wherever possible.

### 9. Honest Wording Rule
Public copy, internal docs, reporting language, and implementation should not contradict each other. If a site uses website analytics but an application does not, that distinction must be stated clearly.

### 10. Product Scope Rule
Not every TGC property needs every analytics layer. Each product or site should explicitly declare which analytics levels it uses. 

*For example: BUS Core site may use multiple layers, including page, host, app-adjacent release/reporting surfaces, and internal analytics. Star Map may intentionally remain page-level only.* That is acceptable as long as the declared scope is explicit and truthful.

---

## Governance Matrix

| Level | What it covers | Allowed by default | Not allowed by default | Notes |
| :--- | :--- | :--- | :--- | :--- |
| **Host Level** | Hosting/edge/platform traffic | requests, broad traffic summaries, bot-aware traffic patterns, edge errors | user linkage, identity building, claiming host traffic equals humans | includes all observed traffic, not just users |
| **Page Level** | Website code execution | pageviews, source attribution, tracked link visits, page actions, page heartbeats | email capture by stealth, fingerprinting, hidden persistent identifiers | closer to physical page execution |
| **App Level** | Application code execution | app starts, version reports, feature actions, usage heartbeats, update checks | install IDs or device IDs unless explicitly approved | similar to page level but in product runtime |
| **User Level** | User-provided or directly user-linked data | account/contact/billing/support data only when operationally needed | speculative collection, analytics-only PII expansion | highest restraint level |
| **Internal** | Private TGC ops and diagnostics | admin reporting, dashboards, diagnostics, automation traces | externalized user profiling | internal system use only |

---

## Open Items

* Define approved fields for each analytics level in a stricter field list.
* Define prohibited collection examples in a sharper appendix.
* Define dev/test exclusion standards company-wide.
* Define retention expectations per layer.
* Define standard wording across README, SOT, telemetry docs, privacy pages, and reporting.

---

# Appendix A: Active Property Registry & Contract

This section is the canonical source of truth for analytics declarations across all True Good Craft public properties. It sits strictly between the top-level TGC Analytics Policy and the technical implementation layers (Lighthouse and Agent Smith).

## 1. Shared Event Taxonomy
All properties must use these canonical names for cross-site comparable actions:
* `page_view`
* `outbound_click`
* `contact_click`
* `service_interest`

Any event not on this list must be declared as a site-specific Layer 5 Extension event.

## 2. Property Declarations

### A. BUS Core
* **Site Key:** `buscore`
* **Support Class:** `legacy_hybrid`
* **Analytics Levels:** Host Level, Page Level, selected App/Internal
* **Capability Layers:**
  * Layer 1 (Registry): Yes
  * Layer 2 (Event): Yes
  * Layer 3 (Traffic): Yes
  * Layer 4 (Identity): Yes
  * Layer 5 (Extension): Minimal
* **Report Expectations:** This is the richest site by design. It is expected to expose richer report sections, including full Traffic and Identity blocks.

### B. Star Map Generator
* **Site Key:** `star_map_generator`
* **Production Host:** `starmap.truegoodcraft.ca`
* **Support Class:** `event_only`
* **Analytics Levels:** Page Level
* **Capability Layers:**
  * Layer 1 (Registry): Yes
  * Layer 2 (Event): Yes
  * Layer 3 (Traffic): No
  * Layer 4 (Identity): No
  * Layer 5 (Extension): Yes
* **Extension Events:** `preview_generated`, `high_res_requested`, `payment_click`, `download_completed`, `error_preview`, `error_high_res`.
* **Report Expectations:** Requests and visits are unsupported by design. Identity remains null. Expected useful output includes `page_view`, extension events, top paths, and attribution lists (source/campaign/content/referrer).

### C. True Good Craft (TGC Site)
* **Site Key:** `tgc_site`
* **Support Class:** `event_only`
* **Capability Layers:**
  * Layer 1 (Registry): Yes
  * Layer 2 (Event): Yes
  * Layer 3 (Traffic): No
  * Layer 4 (Identity): No
  * Layer 5 (Extension): No active extensions currently
* **Report Expectations:** Event telemetry only. No traffic layer or identity layer.

## 3. Global Filter Rules
* **Production Default:** `production_only` defaults to `true`.
* **Suppression Contract:** `dev_mode` is the canonical cross-site suppression standard. It suppresses Lighthouse first-party telemetry and Cloudflare beacon logic.
* **Null Honesty:** Unsupported metrics must remain null or omitted by rule. Normalization does NOT mean manufacturing parity. Do not fake identity or traffic data to make reports look uniform.

## 4. Report View Families
Lighthouse is the single source of truth for reporting data. The `GET /report` endpoint supports four strict view families.
* **`view=site`**: The normalized, single-site reporting shape. Requires a valid `site_key`. Returns `{ view: "site", generated_at, scope, summary, traffic, events, identity, health }`.
* **`view=fleet`**: The normalized fleet-wide summary shape. Returns `{ view: "fleet", generated_at, sites }`.
* **`view=source_health`**: The telemetry-integrity shape, focused on dropped/invalid rates. Returns `{ view: "source_health", generated_at, sites }`.
* **Legacy Mode (Bare `/report`)**: The older, BUS Core-centric shape. Activated only when `view` is omitted or blank. Explicit `view=legacy` is strictly invalid.

## 5. Normative JSON Examples
The top-level schema for `view=site` is identical across all properties. The difference between a `legacy_hybrid` and an `event_only` site is dictated by `null` values and boolean flags inside the payload, not by a different JSON structure.

**Normative `view=site` Payload (Event-Only Example):**
```json
{
  "view": "site",
  "generated_at": "2026-04-10T12:00:00.000Z",
  "scope": {
    "site_key": "tgc_site",
    "support_class": "event_only",
    "production_only": true,
    "section_availability": {
      "summary": true,
      "traffic": false,
      "identity": false
    }
  },
  "summary": {
    "accepted_events_7d": 5,
    "pageviews_7d": null,
    "traffic_requests_7d": null,
    "traffic_visits_7d": null,
    "has_recent_signal": true
  },
  "events": {
    "accepted_events": 5,
    "unique_paths": 2,
    "by_event_name": [{ "event_name": "page_view", "events": 5 }],
    "top_paths": [{ "path": "/", "events": 5 }],
    "top_sources": [{ "source": "search", "events": 5 }],
    "top_campaigns": [],
    "top_contents": []
  },
  "identity": null,
  "health": {
    "included_events": 5,
    "excluded_non_production_host": 0,
    "cloudflare_traffic_enabled": false
  }
}

6. Field-Shape Table

Downstream consumers MUST strictly adhere to these TypeScript definitions when parsing Lighthouse output.

Object.Field Exact Data Type Definition
scope.support_class string The authoritative class (legacy_hybrid, event_only, etc.).
summary.accepted_events_7d number Total accepted events over 7 days.
summary.traffic_requests_7d number null
health.included_events number Count of events surviving the production filter.
health.cloudflare_traffic_enabled boolean True if the CF traffic layer is active.
events.by_event_name Array<{ event_name: string; events: number }> Array of event names and their counts.
events.top_paths Array<{ path: string; events: number }> Array of paths and their counts.
events.top_contents Array<{ utm_content: string; events: number }> Array of contents and their counts.

7. Filter Semantics

  • Production Filter: When scope.production_only = true, all event breakdown queries (accepted_events, unique_paths, by_event_name, top_paths, etc.) append a strict host-matching SQL filter.
  • Excluded Host Metric: health.excluded_non_production_host is computed independently. It counts events where accepted = 1 but the URL host does not match the production string. It is not a simple inverse of included_events.
  • Empty States under Filtering: If a site has accepted events, but all of them are filtered out by production_only, the numeric fields (like accepted_events) return 0. The breakdown lists (like top_paths) return [] (empty arrays). They do not return null.

8. Null / Empty / Unsupported Rules

Consumers must respect the strict semantic differences between absent data and filtered data:

  • null (Unavailable): The capability layer is disabled by design (e.g., traffic_requests_7d for an event_only site).
  • 0 (No Activity): The layer is supported, and the metric is requested, but the resulting count is zero.
  • [] (Empty Scope): The breakdown is supported, but no rows survived the current filter (e.g., all hits were from a staging host while production_only=true).

9. Consumer Obligations for Agent Smith

Agent Smith is a read-only formatting bridge. It MUST adhere to the following hard rules:

  • Trust Lighthouse Scope: Smith MUST NOT derive support_class locally from the site_key. It must read and obey scope.support_class sent by Lighthouse.
  • Correct Array Parsing: Smith MUST parse by_event_name as an array of objects ([{event_name, events}]), not as a flat object map.
  • Correct Count Keys: Smith MUST read the events key for numeric counts in attribution lists (e.g., top_paths, top_sources). It MUST NOT look for count or pageviews.
  • Correct Summary Keys: Smith MUST parse traffic_requests_7d and traffic_visits_7d. It MUST NOT look for legacy keys like requests_7d.
  • Legible Filtering: Smith MUST explicitly state when Report scope is production-only and render the excluded_non_production_host value so operators know when data is filtered versus missing.
  • No Fake Parity: Smith MUST NOT silence or fake empty arrays, and it MUST NOT label an explicit [] array as "unavailable".