From 2acb95f2b96a13e99e348259515beaec8e76a29a Mon Sep 17 00:00:00 2001
From: snowfox1003 <snowfox1003@gmail.com>
Date: Wed, 27 May 2026 04:41:58 -0400
Subject: [PATCH 1/2] docs: enhance contributing and onboarding documentation
 with new tutorial for building collectors

---
 CONTRIBUTING.md                       |   2 +
 README.md                             |   2 +-
 core/_version.py                      |   2 +-
 core/collectors/README.md             |   2 +-
 docs/How_to_add_a_collector.md        |   4 +
 docs/Onboarding.md                    |   9 +-
 docs/README.md                        |   1 +
 docs/Tutorial_building_a_collector.md | 482 ++++++++++++++++++++++++++
 8 files changed, 497 insertions(+), 7 deletions(-)
 create mode 100644 docs/Tutorial_building_a_collector.md
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index f864be2..1309c62 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -4,6 +4,8 @@ This document describes how to contribute to the project, with emphasis on the *
 
 ## Creating a new collector
 
+**Start here:** [docs/Tutorial_building_a_collector.md](docs/Tutorial_building_a_collector.md) — step-by-step walkthrough (scaffolding, `AbstractCollector` hooks, testing, YAML/Celery, deployment) with a worked `heartbeat_demo` example.
+
 Use the **`startcollector`** management command to generate a new Django app with the usual collector layout (stub `models.py`, `services.py`, `AbstractCollector` + `BaseCollectorCommand`, `tests/` package, `migrations/0001_initial.py`, and `schedule_snippet.yaml`). Run it from the **repository root** so the new package sits next to the other apps.
 
 ```bash
diff --git a/README.md b/README.md
index f017cde..2c4ec56 100644
--- a/README.md
+++ b/README.md
@@ -77,7 +77,7 @@ If you see `relation "cppa_user_tracker_githubaccount" does not exist` (or simil
 python manage.py run_scheduled_collectors --schedule daily --group github
 ```
 
-7. To **add a new collector app** (boilerplate, management command, and schedule snippet template), use **`python manage.py startcollector <name>`** and follow **[CONTRIBUTING.md](CONTRIBUTING.md#creating-a-new-collector)**.
+7. To **add a new collector app**, follow **[docs/Tutorial_building_a_collector.md](docs/Tutorial_building_a_collector.md)** (walkthrough), then **`python manage.py startcollector <name>`** and **[CONTRIBUTING.md](CONTRIBUTING.md#creating-a-new-collector)** (checklist).
 
 For local development you can start the dev server: `python manage.py runserver`.
 
diff --git a/core/_version.py b/core/_version.py
index 3405f10..023bb7b 100644
--- a/core/_version.py
+++ b/core/_version.py
@@ -1,2 +1,2 @@
 # file generated by setuptools-scm; do not edit
-version = "0.1.1.dev566+g359d75ad4.d20260526"
+version = "0.1.1.dev2+g0dd532eaf.d20260527"
diff --git a/core/collectors/README.md b/core/collectors/README.md
index d20a4a8..d09768d 100644
--- a/core/collectors/README.md
+++ b/core/collectors/README.md
@@ -12,7 +12,7 @@ Collector orchestration shared by every `run_*` management command.
 
 ## Usage
 
-New collectors should subclass `AbstractCollector` and wire a management command through `BaseCollectorCommand`. See [How to add a collector](../../docs/How_to_add_a_collector.md) and the parent [core README](../README.md).
+New collectors should subclass `AbstractCollector` and wire a management command through `BaseCollectorCommand`. See [Tutorial: building a collector](../../docs/Tutorial_building_a_collector.md) (walkthrough), [How to add a collector](../../docs/How_to_add_a_collector.md) (checklist), and the parent [core README](../README.md).
 
 ## Tests
 
diff --git a/docs/How_to_add_a_collector.md b/docs/How_to_add_a_collector.md
index c23c966..ed3e08e 100644
--- a/docs/How_to_add_a_collector.md
+++ b/docs/How_to_add_a_collector.md
@@ -1,7 +1,11 @@
 # How to add a collector
 
+**Tutorial (start here):** [Tutorial_building_a_collector.md](Tutorial_building_a_collector.md) — end-to-end walkthrough with design decisions for scaffolding, `AbstractCollector` hooks, testing, Celery scheduling, and deployment.
+
 **Preferred:** scaffold a new app from the repo root with **`python manage.py startcollector <app_label>`** (see [CONTRIBUTING.md — Creating a new collector](../CONTRIBUTING.md#creating-a-new-collector)), then follow the manual steps there (`INSTALLED_APPS`, YAML schedule, migrations, **cross-app-dependencies.md**).
 
+**Layout note:** `startcollector` places the collector class and `BaseCollectorCommand` in **`management/commands/run_<app>.py`**. Split into a separate **`collectors.py`** when the command grows (see the tutorial §2.5). Section 4 below uses a `collectors.py` layout as an alternate copy-paste skeleton, not the default scaffold output.
+
 If you used **`startcollector`**, section 1 below is mostly satisfied (you still add the app to **`INSTALLED_APPS`**). Otherwise, this checklist assumes you already have a Django app (or are creating one) with a `management/commands/run_<your_app>.py` entry point. For a high-level diagram and GitHub pipeline notes, see the **Architecture** section in [Development_guideline.md](Development_guideline.md).
 
 ## 1. App and command
diff --git a/docs/Onboarding.md b/docs/Onboarding.md
index 0054ab8..6bfb584 100644
--- a/docs/Onboarding.md
+++ b/docs/Onboarding.md
@@ -25,10 +25,11 @@ For setup steps (venv, migrate, tests), start with the root **[README.md](../REA
 | 3 | [Architecture_data_flow.md](Architecture_data_flow.md) | Sources → collectors → DB / workspace → Pinecone (diagrams). |
 | 4 | [Workflow.md](Workflow.md) | YAML schedules, Celery Beat, execution order. |
 | 5 | [CONTRIBUTING.md](../CONTRIBUTING.md) | Service-layer rule for DB writes. |
-| 6 | [Workspace.md](Workspace.md) | Where files land under `WORKSPACE_DIR`. |
-| 7 | [Schema.md](Schema.md) — § Overview + diagrams for your area | Cross-app tables (identity, GitHub, Boost libraries). |
-| 8 | [Service_API.md](Service_API.md) + `service_api/<app>.md` | Callable surface for writes you must use. |
-| 9 | [operations/README.md](operations/README.md) | Shared I/O (GitHub, etc.), not the same as services. |
+| 6 | [Tutorial_building_a_collector.md](Tutorial_building_a_collector.md) | **First collector:** scaffold → hooks → tests → schedule → deploy. |
+| 7 | [Workspace.md](Workspace.md) | Where files land under `WORKSPACE_DIR`. |
+| 8 | [Schema.md](Schema.md) — § Overview + diagrams for your area | Cross-app tables (identity, GitHub, Boost libraries). |
+| 9 | [Service_API.md](Service_API.md) + `service_api/<app>.md` | Callable surface for writes you must use. |
+| 10 | [operations/README.md](operations/README.md) | Shared I/O (GitHub, etc.), not the same as services. |
 
 Deep dives when you touch an area: **[Docker.md](Docker.md)**, **[Deployment.md](Deployment.md)**, per-app notes under **`docs/service_api/`** and **`docs/operations/`**.
 
diff --git a/docs/README.md b/docs/README.md
index a7dba3f..93e413d 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -10,6 +10,7 @@ Documentation is organized **by topic**, not by app. Each doc covers one cross-c
 | **Architecture overview** | [Architecture_overview.md](Architecture_overview.md) | **Start here for system design:** all 15 domain apps + `core`, persistence, coupling, links to app READMEs and service API. |
 | **Workflow** | [Workflow.md](Workflow.md) | Main application workflow, execution order, and project details. |
 | **Architecture (data flow)** | [Architecture_data_flow.md](Architecture_data_flow.md) | Data flow (sources → collectors → DB / workspace → Pinecone), orchestration diagram, per-app component map. |
+| **Tutorial: building a collector** | [Tutorial_building_a_collector.md](Tutorial_building_a_collector.md) | End-to-end walkthrough: `startcollector`, hooks, tests, YAML/Celery, deploy. |
 | **Cross-app dependencies** | [cross-app-dependencies.md](cross-app-dependencies.md) | FK/import matrix, import-linter contracts, regeneration via `list_cross_app_imports.py`. |
 | **CODEOWNERS / reviews** | [CODEOWNERS_and_branch_protection.md](CODEOWNERS_and_branch_protection.md) | CODEOWNERS behavior, enabling branch protection, verification checklist. |
 | **Onboarding walkthroughs** | [onboarding/](onboarding/README.md) | 1:1 session runbooks (Leo, Jonathan) and session logs. |
diff --git a/docs/Tutorial_building_a_collector.md b/docs/Tutorial_building_a_collector.md
new file mode 100644
index 0000000..5f60b6f
--- /dev/null
+++ b/docs/Tutorial_building_a_collector.md
@@ -0,0 +1,482 @@
+# Tutorial: building a collector from scratch
+
+This tutorial walks through creating a new data collector end to end: scaffolding with **`startcollector`**, **`AbstractCollector`** hooks, tests, YAML/Celery scheduling, and production readiness. It is the narrative companion to the checklist in [How_to_add_a_collector.md](How_to_add_a_collector.md).
+
+**Worked example:** a fictional app **`heartbeat_demo`**. You create it locally with `startcollector`; it is **not** committed to the repository. For real in-repo patterns, see [cppa_user_tracker](../cppa_user_tracker/) (minimal inline collector) and [wg21_paper_tracker](../wg21_paper_tracker/) (split `collectors.py` + rich CLI).
+
+---
+
+## 0. Prerequisites and outcomes
+
+### Before you start
+
+1. Complete project setup from the root [README.md](../README.md) (venv, `DATABASE_URL`, `migrate`).
+2. Skim [Onboarding.md](Onboarding.md) §1 — five ideas: collectors are management commands; writes go through **`services.py`**; scheduling is YAML-driven.
+3. Optional: [Architecture_data_flow.md](Architecture_data_flow.md) for where your collector fits in the pipeline.
+
+### What you will be able to do
+
+After this tutorial you can:
+
+- Scaffold a collector app with `python manage.py startcollector <app_label>`
+- Explain why **`validate_config`**, **`collect`**, and **`sync_pinecone`** exist and who calls them
+- Run and test the collector locally with pytest (PostgreSQL)
+- Register the command in **`config/boost_collector_schedule.yaml`** with **`enabled: false`** until ready
+- Know what happens when Celery Beat and deployment run your collector
+
+---
+
+## 1. Scaffolding with `startcollector`
+
+Run all commands from the **repository root** so the new app sits next to the other Django apps.
+
+### Preview, then create
+
+```bash
+python manage.py startcollector heartbeat_demo --dry-run
+python manage.py startcollector heartbeat_demo
+```
+
+`--dry-run` prints planned paths and a preview of `schedule_snippet.yaml` without writing files.
+
+### What gets generated (14 files)
+
+`startcollector` writes a full Django app package. There is no separate template directory; file bodies are built in [core/management/commands/startcollector.py](../core/management/commands/startcollector.py).
+
+```text
+heartbeat_demo/
+  __init__.py
+  admin.py
+  apps.py
+  models.py
+  services.py
+  views.py
+  schedule_snippet.yaml
+  migrations/
+    __init__.py
+    0001_initial.py
+  management/
+    __init__.py
+    commands/
+      __init__.py
+      run_heartbeat_demo.py
+  tests/
+    __init__.py
+    test_run_heartbeat_demo_command.py
+```
+
+```mermaid
+flowchart TB
+  subgraph scaffold [startcollector output]
+    models[models.py RunState stub]
+    services[services.py record_run]
+    runCmd["management/commands/run_heartbeat_demo.py"]
+    mig[migrations/0001_initial.py]
+    snippet[schedule_snippet.yaml]
+    test[tests/test_run_heartbeat_demo_command.py]
+  end
+  subgraph manual [You do manually]
+    installed[INSTALLED_APPS]
+    yaml[boost_collector_schedule.yaml]
+    migrate[migrate]
+  end
+  scaffold --> manual
+```
+
+| File | Purpose |
+|------|---------|
+| `apps.py` | `HeartbeatDemoConfig` with `name = "heartbeat_demo"` |
+| `models.py` | Stub `HeartbeatDemoRunState` (`source_key`, `run_count`, `updated_at`) |
+| `services.py` | Stub `record_run()` — **all DB writes** for this app should stay here |
+| `management/commands/run_heartbeat_demo.py` | `HeartbeatDemoCollector` + `Command(BaseCollectorCommand)` |
+| `migrations/0001_initial.py` | Hand-written initial migration matching the stub model |
+| `schedule_snippet.yaml` | Commented YAML to paste into the shared schedule file |
+| `tests/test_run_heartbeat_demo_command.py` | Smoke test via `call_command` |
+
+**Not generated:** `collectors.py`, `INSTALLED_APPS` entry, or edits to `config/boost_collector_schedule.yaml`.
+
+### Design decisions (why the scaffold looks like this)
+
+| Decision | Why |
+|----------|-----|
+| One Django app per collector domain | Shared PostgreSQL, isolated `services.py`, clear ownership in CODEOWNERS |
+| Command name `run_{app_label}` | Must match YAML `command:` and what Celery/`call_command` invoke |
+| Stub run-state model + `record_run()` | Proves migrations and the service-layer write path before real domain logic |
+| Hand-written `0001_initial.py` | Matches `models.py` on day one without requiring `makemigrations` first |
+| `schedule_snippet.yaml` not auto-merged | Schedule changes are reviewed in PR; avoids Beat calling a missing command |
+| Collector class inside `run_*.py` initially | Smallest first PR; split into `collectors.py` when the command grows (§2.5) |
+
+### Manual steps after scaffold
+
+1. Add **`"heartbeat_demo"`** to `INSTALLED_APPS` in [config/settings.py](../config/settings.py) (keep **alphabetical** order with other project apps).
+2. Run **`python manage.py migrate`**.
+3. Run **`python manage.py run_heartbeat_demo`** — expect success output and one row in `heartbeat_demo_heartbeatdemorunstate` (table name from Django’s model naming).
+
+The generated collector follows this shape (abbreviated):
+
+```python
+class HeartbeatDemoCollector(AbstractCollector):
+  @property
+  def name(self) -> str:
+    return "heartbeat_demo"
+
+  def validate_config(self) -> None:
+    if not self.source_key or not self.source_key.strip():
+      raise ValueError("source_key must not be empty")
+
+  def collect(self) -> None:
+    _, created = services.record_run(source_key=self.source_key.strip())
+    # ...
+
+class Command(BaseCollectorCommand):
+  def get_collector(self, **_options: Any) -> AbstractCollector:
+    return HeartbeatDemoCollector(stdout=self.stdout, style=self.style)
+```
+
+Note: the scaffold stores `source_key` on the collector but does **not** wire `--source-key` on the command until you add it (§3.1). That is intentional practice for your first edit.
+
+Further manual steps (from [CONTRIBUTING.md](../CONTRIBUTING.md#creating-a-new-collector)):
+
+4. Merge `schedule_snippet.yaml` into [config/boost_collector_schedule.yaml](../config/boost_collector_schedule.yaml) (§5).
+5. If the app imports other apps or adds cross-app FKs, update [cross-app-dependencies.md](cross-app-dependencies.md).
+6. When `services.py` grows, run **`python scripts/generate_service_docs.py`** and commit `docs/service_api/` updates.
+
+---
+
+## 2. AbstractCollector hooks and lifecycle
+
+Collectors are not auto-discovered as Python classes. Django finds **`management/commands/run_*.py`** under installed apps. Your collector class is built by **`BaseCollectorCommand.get_collector()`** and executed in a fixed two-phase lifecycle.
+
+### Sequence
+
+```mermaid
+sequenceDiagram
+  participant Cmd as BaseCollectorCommand
+  participant Col as AbstractCollector
+  Cmd->>Col: get_collector(options)
+  Cmd->>Col: run()
+  Col->>Col: validate_config()
+  Col->>Col: collect()
+  Cmd->>Col: sync_pinecone()
+```
+
+Implementation: [core/collectors/command_base.py](../core/collectors/command_base.py) `handle()` calls `get_collector`, then `_run_collector_phase(collector, collector.run)`, then `_run_collector_phase(collector, collector.sync_pinecone)`.
+
+`AbstractCollector.run()` is **concrete** — do not override it. It runs `validate_config()` then `collect()` ([core/collectors/base_collector.py](../core/collectors/base_collector.py)).
+
+### Hook reference
+
+| Hook | Who calls it | Responsibility | Tutorial example |
+|------|----------------|----------------|------------------|
+| `name` | `handle_error` logging | Stable slug for metrics/alerts | `"heartbeat_demo"` |
+| `validate_config()` | `run()` before I/O | Fast checks: env, CLI, empty keys | Reject empty `source_key` |
+| `collect()` | `run()` | Orchestration; delegate DB to `services.py` | `services.record_run(...)` |
+| `run()` | Command | Template: validate → collect | Do not override |
+| `handle_error(exc)` | Command on non-`CommandError` | Log with `classify_failure` | Default is enough for most apps |
+| `sync_pinecone()` | Command after `run` | Post-run vector sync; default no-op | See §2.4 |
+
+### Error taxonomy (important for reviews)
+
+| Exception | Behavior |
+|-----------|----------|
+| `django.core.management.base.CommandError` | Logged by command with `failure_category=command`; **not** passed to `handle_error`; re-raised |
+| Any other `Exception` | `collector.handle_error(exc)` then re-raised (non-zero exit for scheduler) |
+
+During each phase the command sets **`collector._error_phase`** to `"run"` or `"sync_pinecone"` and clears it in `finally` (even if logging fails).
+
+**When to use `CommandError`:** invalid CLI combinations, missing required env vars that should read as “user/config error” (§3.3).
+
+**When to let other exceptions propagate:** API failures, DB errors — `handle_error` classifies them via [core/errors.py](../core/errors.py) `classify_failure()` into `CollectorFailureCategory` values for structured logs.
+
+Override `handle_error` only when the default classifier does not match your domain (see [How_to_add_a_collector.md](How_to_add_a_collector.md#3-shared-abstractions-recommended)).
+
+### Optional: `sync_pinecone()`
+
+Default is a no-op. Collectors that index after fetch override it, often by calling another management command, e.g. `cppa_slack_tracker` → `run_cppa_pinecone_sync`. The tutorial stub does not implement this.
+
+### When to extract `collectors.py`
+
+Keep collector + command in **`run_heartbeat_demo.py`** while the command stays small (rough guide: under ~80 lines of collector logic).
+
+Split when:
+
+- CLI parsing and collector logic make `run_*.py` hard to navigate
+- You want unit tests on the collector without loading the Django command
+
+**Production example:** [wg21_paper_tracker/collectors.py](../wg21_paper_tracker/collectors.py) + thin [run_wg21_paper_tracker.py](../wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py).
+
+**Minimal in-repo example (inline, like scaffold):** [cppa_user_tracker/management/commands/run_cppa_user_tracker.py](../cppa_user_tracker/management/commands/run_cppa_user_tracker.py).
+
+### Legacy: `CollectorBase`
+
+Older collectors implement only `run()`. New work should use **`AbstractCollector`**. See [Core_public_API.md](Core_public_API.md#collectors).
+
+---
+
+## 3. Evolving the worked example
+
+These three edits turn the stub into a realistic collector shape. Apply them on your local `heartbeat_demo` app (do not commit the app unless you are shipping a real feature).
+
+### 3.1 Wire `--source-key` on the command
+
+The scaffold collector already accepts `source_key` in `__init__`. Add CLI wiring on `Command`:
+
+```python
+# heartbeat_demo/management/commands/run_heartbeat_demo.py
+
+class Command(BaseCollectorCommand):
+    help = "Run the heartbeat_demo collector."
+
+    def add_arguments(self, parser) -> None:
+        parser.add_argument(
+            "--source-key",
+            default="default",
+            help="Logical source key for HeartbeatDemoRunState.",
+        )
+
+    def get_collector(self, **options: Any) -> AbstractCollector:
+        return HeartbeatDemoCollector(
+            stdout=self.stdout,
+            style=self.style,
+            source_key=options["source_key"],
+        )
+```
+
+Run: `python manage.py run_heartbeat_demo --source-key=prod`.
+
+### 3.2 Service layer
+
+Rename or extend `record_run` as your domain grows. **Rule:** all creates/updates/deletes for `heartbeat_demo` models go through **`heartbeat_demo/services.py`** — not from `collect()` directly. See [CONTRIBUTING.md](../CONTRIBUTING.md#service-layer-single-place-for-writes).
+
+When you add public service functions, regenerate API docs:
+
+```bash
+python scripts/generate_service_docs.py
+# or one app: python scripts/generate_service_docs.py --app heartbeat_demo
+```
+
+### 3.3 Validation and `CommandError`
+
+Demonstrate config errors vs runtime errors. Example: require an env var for a hypothetical API:
+
+```python
+import os
+from django.core.management.base import CommandError
+
+def validate_config(self) -> None:
+    if not self.source_key or not self.source_key.strip():
+        raise ValueError("source_key must not be empty")
+    if not os.environ.get("HEARTBEAT_DEMO_API_KEY"):
+        raise CommandError(
+            "HEARTBEAT_DEMO_API_KEY is not set. "
+            "Add it to .env and document it in .env.example."
+        )
+```
+
+Document new variables in **`.env.example`** and, if needed, **`docs/operations/`**.
+
+### Anti-patterns
+
+- Calling `HeartbeatDemoRunState.objects.create(...)` from `collect()` instead of `services.py`
+- Importing another tracker app’s models without updating [cross-app-dependencies.md](cross-app-dependencies.md)
+- Setting **`enabled: true`** in YAML before the app is merged and migrated
+
+---
+
+## 4. Testing
+
+The project uses **pytest + pytest-django** with **PostgreSQL only** (`config.test_settings`). See [README.md](../README.md#running-tests).
+
+### Test layers
+
+| Layer | What to test | How |
+|-------|----------------|-----|
+| Service | `record_run` create/increment | `@pytest.mark.django_db`, assert on ORM |
+| Command integration | Full `call_command` path | Scaffold smoke test (below) |
+| Collector unit | `validate_config` / `collect` with mocks | `@patch` on `services.record_run` |
+| Scheduler (advanced) | YAML loading / strict mode | [boost_collector_runner/tests/test_schedule_config.py](../boost_collector_runner/tests/test_schedule_config.py) |
+
+### Scaffold smoke test (already generated)
+
+```python
+@pytest.mark.django_db
+def test_run_heartbeat_demo_writes_success():
+    out = StringIO()
+    call_command("run_heartbeat_demo", stdout=out, verbosity=0)
+    assert "completed" in out.getvalue().lower()
+```
+
+### Expand: service test
+
+```python
+# heartbeat_demo/tests/test_services.py
+import pytest
+from heartbeat_demo.models import HeartbeatDemoRunState
+from heartbeat_demo.services import record_run
+
+
+@pytest.mark.django_db
+def test_record_run_creates_and_increments():
+    row1, created1 = record_run(source_key="alpha")
+    assert created1 is True
+    assert row1.run_count == 1
+
+    row2, created2 = record_run(source_key="alpha")
+    assert created2 is False
+    assert row2.id == row1.id
+    assert row2.run_count == 2
+```
+
+### Expand: collector unit test (mock services)
+
+Pattern from [wg21_paper_tracker/tests/test_collectors.py](../wg21_paper_tracker/tests/test_collectors.py):
+
+```python
+from unittest.mock import patch
+import pytest
+from heartbeat_demo.management.commands.run_heartbeat_demo import HeartbeatDemoCollector
+
+
+def test_validate_config_rejects_empty_source_key():
+    collector = HeartbeatDemoCollector(stdout=None, style=None, source_key="  ")
+    with pytest.raises(ValueError):
+        collector.validate_config()
+
+
+@patch("heartbeat_demo.services.record_run")
+def test_collect_calls_service(mock_record_run):
+    mock_record_run.return_value = (None, True)
+    collector = HeartbeatDemoCollector(stdout=None, style=None, source_key="k1")
+    collector.collect()
+    mock_record_run.assert_called_once_with(source_key="k1")
+```
+
+### Run tests locally
+
+```bash
+docker compose -f docker-compose.test.yml up -d
+export DATABASE_URL=postgres://postgres:postgres@127.0.0.1:5433/postgres
+export SECRET_KEY=for-local-only
+export DJANGO_SETTINGS_MODULE=config.test_settings
+
+uv run pytest heartbeat_demo/tests/ -v
+uv run pytest   # full suite before PR
+uv run pyright  # typing, matches CI
+```
+
+### CI
+
+- **Lint job:** [scripts/validate_collector_scaffold.py](../scripts/validate_collector_scaffold.py) recreates a throwaway app under `.test_artifacts/`, then runs ruff and scoped pyright.
+- **Test job:** full `pytest` with **90%** coverage gate (`.github/workflows/actions.yml`).
+
+---
+
+## 5. Celery scheduling
+
+Scheduling is **YAML-driven** — no Python change to add a Beat entry. Full reference: [Workflow.md §2](Workflow.md#2-boost-collector-runner-and-yaml-schedule).
+
+### Author checklist
+
+1. Open **`heartbeat_demo/schedule_snippet.yaml`** from the scaffold.
+2. Paste a task under the right **`groups.<name>.tasks`** in [config/boost_collector_schedule.yaml](../config/boost_collector_schedule.yaml) (pick a group that matches runtime dependencies, e.g. `github` for GitHub-related work).
+3. Use **`enabled: false`** until the app is on `develop`/`main`, in `INSTALLED_APPS`, and migrated.
+4. After merge, flip **`enabled: true`** per team policy (same PR or follow-up).
+
+Example entry:
+
+```yaml
+groups:
+  github:
+    default_time: "00:05"
+    tasks:
+      # ... existing tasks ...
+      - command: run_heartbeat_demo
+        schedule: daily
+        enabled: false
+        args: ["--source-key", "default"]
+```
+
+The **`command`** value must match the Django management command name (filename `run_heartbeat_demo.py` → command `run_heartbeat_demo`).
+
+### How Beat runs your collector
+
+```mermaid
+flowchart LR
+  beat[CeleryBeat] --> task[run_scheduled_collectors_task]
+  task --> mgmt[run_scheduled_collectors]
+  mgmt --> cmd[run_heartbeat_demo]
+```
+
+- [config/settings.py](../config/settings.py) sets `CELERY_BEAT_SCHEDULE` from YAML via `get_beat_schedule()`.
+- [boost_collector_runner/tasks.py](../boost_collector_runner/tasks.py) `@shared_task` `run_scheduled_collectors_task` calls `call_command("run_scheduled_collectors", ...)`.
+- Within one batch, collectors in a group run **sequentially**; different groups can run in parallel on workers.
+
+### Local verification
+
+```bash
+# Run one collector by hand
+python manage.py run_heartbeat_demo --source-key=default
+
+# Run a scheduled group (same as Beat would batch)
+python manage.py run_scheduled_collectors --schedule daily --group github
+
+# Optional: trigger Celery task directly (worker must be running)
+# See docs/Celery_test.md
+```
+
+Docker: start **`celery_worker`** and **`celery_beat`** per [Docker.md](Docker.md).
+
+**Strict mode:** With `DEBUG=False` or `BOOST_COLLECTOR_SCHEDULE_STRICT=True`, invalid YAML fails at settings import so Beat does not start with an empty schedule.
+
+---
+
+## 6. Deployment and production readiness
+
+Short path; details in [Deployment.md](Deployment.md) and [GCP_Production_Checklist.md](GCP_Production_Checklist.md).
+
+| Step | Action |
+|------|--------|
+| PR merged | Set `enabled: true` in YAML when ready for production runs |
+| Deploy | `docker compose` with `web`, `celery_worker`, `celery_beat`, Redis |
+| Health | `GET /health/` returns **503** while `HEALTH_ENFORCE_COLLECTOR_FRESHNESS=true` until daily YAML groups have successful runs |
+| Secrets | New env vars in `.env.example`; ops notes under `docs/operations/` if using external APIs |
+
+Production-scale references:
+
+- **`github_activity_tracker`** — full fetch, workspace, Pinecone pipeline
+- **`wg21_paper_tracker`** — split `collectors.py`, rich CLI, service-backed pipeline
+
+---
+
+## 7. Checklist and further reading
+
+### Copy-paste checklist
+
+- [ ] `python manage.py startcollector <app_label>` from repo root
+- [ ] Add app to `INSTALLED_APPS` (alphabetical)
+- [ ] `python manage.py migrate`
+- [ ] Implement real logic in `services.py`; keep `collect()` thin
+- [ ] Subclass `AbstractCollector` + `BaseCollectorCommand` (or split `collectors.py` when large)
+- [ ] `validate_config` for fast checks; `CommandError` for bad config
+- [ ] Tests: service + command (+ collector unit tests with mocks)
+- [ ] Paste schedule entry with `enabled: false`; merge to `config/boost_collector_schedule.yaml`
+- [ ] Update `cross-app-dependencies.md` if importing other apps
+- [ ] `.env.example` + ops docs for new secrets
+- [ ] `generate_service_docs.py` when adding public service functions
+- [ ] `uv run pytest` and `uv run pyright` before PR
+- [ ] Enable task in YAML after deploy/migrate
+- [ ] Verify `/health/` and optional `run_scheduled_collectors` on staging
+
+### Further reading
+
+| Topic | Doc |
+|-------|-----|
+| Checklist / contracts | [How_to_add_a_collector.md](How_to_add_a_collector.md) |
+| `startcollector` + service layer | [CONTRIBUTING.md](../CONTRIBUTING.md#creating-a-new-collector) |
+| YAML schedules | [Workflow.md](Workflow.md) |
+| Core collector API | [Core_public_API.md](Core_public_API.md) |
+| Data flow | [Architecture_data_flow.md](Architecture_data_flow.md) |
+| Cross-app imports | [cross-app-dependencies.md](cross-app-dependencies.md) |
+| Deploy / GCP | [Deployment.md](Deployment.md), [GCP_Production_Checklist.md](GCP_Production_Checklist.md) |
+| Celery manual test | [Celery_test.md](Celery_test.md) |
+| Service function reference | [Service_API.md](Service_API.md) |

From 9d2a2d0799458319979e8969cbf132cb4b6c6beb Mon Sep 17 00:00:00 2001
From: snowfox1003 <snowfox1003@gmail.com>
Date: Wed, 27 May 2026 15:25:20 -0400
Subject: [PATCH 2/2] docs: remove legacy `CollectorBase` reference from
 collector tutorial

---
 docs/Tutorial_building_a_collector.md | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/docs/Tutorial_building_a_collector.md b/docs/Tutorial_building_a_collector.md
index 5f60b6f..df88d62 100644
--- a/docs/Tutorial_building_a_collector.md
+++ b/docs/Tutorial_building_a_collector.md
@@ -207,10 +207,6 @@ Split when:
 
 **Minimal in-repo example (inline, like scaffold):** [cppa_user_tracker/management/commands/run_cppa_user_tracker.py](../cppa_user_tracker/management/commands/run_cppa_user_tracker.py).
 
-### Legacy: `CollectorBase`
-
-Older collectors implement only `run()`. New work should use **`AbstractCollector`**. See [Core_public_API.md](Core_public_API.md#collectors).
-
 ---
 
 ## 3. Evolving the worked example