From 2acb95f2b96a13e99e348259515beaec8e76a29a Mon Sep 17 00:00:00 2001 From: snowfox1003 Date: Wed, 27 May 2026 04:41:58 -0400 Subject: [PATCH 1/2] docs: enhance contributing and onboarding documentation with new tutorial for building collectors --- CONTRIBUTING.md | 2 + README.md | 2 +- core/_version.py | 2 +- core/collectors/README.md | 2 +- docs/How_to_add_a_collector.md | 4 + docs/Onboarding.md | 9 +- docs/README.md | 1 + docs/Tutorial_building_a_collector.md | 482 ++++++++++++++++++++++++++ 8 files changed, 497 insertions(+), 7 deletions(-) create mode 100644 docs/Tutorial_building_a_collector.md diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index f864be2..1309c62 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -4,6 +4,8 @@ This document describes how to contribute to the project, with emphasis on the * ## Creating a new collector +**Start here:** [docs/Tutorial_building_a_collector.md](docs/Tutorial_building_a_collector.md) — step-by-step walkthrough (scaffolding, `AbstractCollector` hooks, testing, YAML/Celery, deployment) with a worked `heartbeat_demo` example. + Use the **`startcollector`** management command to generate a new Django app with the usual collector layout (stub `models.py`, `services.py`, `AbstractCollector` + `BaseCollectorCommand`, `tests/` package, `migrations/0001_initial.py`, and `schedule_snippet.yaml`). Run it from the **repository root** so the new package sits next to the other apps. ```bash diff --git a/README.md b/README.md index f017cde..2c4ec56 100644 --- a/README.md +++ b/README.md @@ -77,7 +77,7 @@ If you see `relation "cppa_user_tracker_githubaccount" does not exist` (or simil python manage.py run_scheduled_collectors --schedule daily --group github ``` -7. To **add a new collector app** (boilerplate, management command, and schedule snippet template), use **`python manage.py startcollector `** and follow **[CONTRIBUTING.md](CONTRIBUTING.md#creating-a-new-collector)**. +7. To **add a new collector app**, follow **[docs/Tutorial_building_a_collector.md](docs/Tutorial_building_a_collector.md)** (walkthrough), then **`python manage.py startcollector `** and **[CONTRIBUTING.md](CONTRIBUTING.md#creating-a-new-collector)** (checklist). For local development you can start the dev server: `python manage.py runserver`. diff --git a/core/_version.py b/core/_version.py index 3405f10..023bb7b 100644 --- a/core/_version.py +++ b/core/_version.py @@ -1,2 +1,2 @@ # file generated by setuptools-scm; do not edit -version = "0.1.1.dev566+g359d75ad4.d20260526" +version = "0.1.1.dev2+g0dd532eaf.d20260527" diff --git a/core/collectors/README.md b/core/collectors/README.md index d20a4a8..d09768d 100644 --- a/core/collectors/README.md +++ b/core/collectors/README.md @@ -12,7 +12,7 @@ Collector orchestration shared by every `run_*` management command. ## Usage -New collectors should subclass `AbstractCollector` and wire a management command through `BaseCollectorCommand`. See [How to add a collector](../../docs/How_to_add_a_collector.md) and the parent [core README](../README.md). +New collectors should subclass `AbstractCollector` and wire a management command through `BaseCollectorCommand`. See [Tutorial: building a collector](../../docs/Tutorial_building_a_collector.md) (walkthrough), [How to add a collector](../../docs/How_to_add_a_collector.md) (checklist), and the parent [core README](../README.md). ## Tests diff --git a/docs/How_to_add_a_collector.md b/docs/How_to_add_a_collector.md index c23c966..ed3e08e 100644 --- a/docs/How_to_add_a_collector.md +++ b/docs/How_to_add_a_collector.md @@ -1,7 +1,11 @@ # How to add a collector +**Tutorial (start here):** [Tutorial_building_a_collector.md](Tutorial_building_a_collector.md) — end-to-end walkthrough with design decisions for scaffolding, `AbstractCollector` hooks, testing, Celery scheduling, and deployment. + **Preferred:** scaffold a new app from the repo root with **`python manage.py startcollector `** (see [CONTRIBUTING.md — Creating a new collector](../CONTRIBUTING.md#creating-a-new-collector)), then follow the manual steps there (`INSTALLED_APPS`, YAML schedule, migrations, **cross-app-dependencies.md**). +**Layout note:** `startcollector` places the collector class and `BaseCollectorCommand` in **`management/commands/run_.py`**. Split into a separate **`collectors.py`** when the command grows (see the tutorial §2.5). Section 4 below uses a `collectors.py` layout as an alternate copy-paste skeleton, not the default scaffold output. + If you used **`startcollector`**, section 1 below is mostly satisfied (you still add the app to **`INSTALLED_APPS`**). Otherwise, this checklist assumes you already have a Django app (or are creating one) with a `management/commands/run_.py` entry point. For a high-level diagram and GitHub pipeline notes, see the **Architecture** section in [Development_guideline.md](Development_guideline.md). ## 1. App and command diff --git a/docs/Onboarding.md b/docs/Onboarding.md index 0054ab8..6bfb584 100644 --- a/docs/Onboarding.md +++ b/docs/Onboarding.md @@ -25,10 +25,11 @@ For setup steps (venv, migrate, tests), start with the root **[README.md](../REA | 3 | [Architecture_data_flow.md](Architecture_data_flow.md) | Sources → collectors → DB / workspace → Pinecone (diagrams). | | 4 | [Workflow.md](Workflow.md) | YAML schedules, Celery Beat, execution order. | | 5 | [CONTRIBUTING.md](../CONTRIBUTING.md) | Service-layer rule for DB writes. | -| 6 | [Workspace.md](Workspace.md) | Where files land under `WORKSPACE_DIR`. | -| 7 | [Schema.md](Schema.md) — § Overview + diagrams for your area | Cross-app tables (identity, GitHub, Boost libraries). | -| 8 | [Service_API.md](Service_API.md) + `service_api/.md` | Callable surface for writes you must use. | -| 9 | [operations/README.md](operations/README.md) | Shared I/O (GitHub, etc.), not the same as services. | +| 6 | [Tutorial_building_a_collector.md](Tutorial_building_a_collector.md) | **First collector:** scaffold → hooks → tests → schedule → deploy. | +| 7 | [Workspace.md](Workspace.md) | Where files land under `WORKSPACE_DIR`. | +| 8 | [Schema.md](Schema.md) — § Overview + diagrams for your area | Cross-app tables (identity, GitHub, Boost libraries). | +| 9 | [Service_API.md](Service_API.md) + `service_api/.md` | Callable surface for writes you must use. | +| 10 | [operations/README.md](operations/README.md) | Shared I/O (GitHub, etc.), not the same as services. | Deep dives when you touch an area: **[Docker.md](Docker.md)**, **[Deployment.md](Deployment.md)**, per-app notes under **`docs/service_api/`** and **`docs/operations/`**. diff --git a/docs/README.md b/docs/README.md index a7dba3f..93e413d 100644 --- a/docs/README.md +++ b/docs/README.md @@ -10,6 +10,7 @@ Documentation is organized **by topic**, not by app. Each doc covers one cross-c | **Architecture overview** | [Architecture_overview.md](Architecture_overview.md) | **Start here for system design:** all 15 domain apps + `core`, persistence, coupling, links to app READMEs and service API. | | **Workflow** | [Workflow.md](Workflow.md) | Main application workflow, execution order, and project details. | | **Architecture (data flow)** | [Architecture_data_flow.md](Architecture_data_flow.md) | Data flow (sources → collectors → DB / workspace → Pinecone), orchestration diagram, per-app component map. | +| **Tutorial: building a collector** | [Tutorial_building_a_collector.md](Tutorial_building_a_collector.md) | End-to-end walkthrough: `startcollector`, hooks, tests, YAML/Celery, deploy. | | **Cross-app dependencies** | [cross-app-dependencies.md](cross-app-dependencies.md) | FK/import matrix, import-linter contracts, regeneration via `list_cross_app_imports.py`. | | **CODEOWNERS / reviews** | [CODEOWNERS_and_branch_protection.md](CODEOWNERS_and_branch_protection.md) | CODEOWNERS behavior, enabling branch protection, verification checklist. | | **Onboarding walkthroughs** | [onboarding/](onboarding/README.md) | 1:1 session runbooks (Leo, Jonathan) and session logs. | diff --git a/docs/Tutorial_building_a_collector.md b/docs/Tutorial_building_a_collector.md new file mode 100644 index 0000000..5f60b6f --- /dev/null +++ b/docs/Tutorial_building_a_collector.md @@ -0,0 +1,482 @@ +# Tutorial: building a collector from scratch + +This tutorial walks through creating a new data collector end to end: scaffolding with **`startcollector`**, **`AbstractCollector`** hooks, tests, YAML/Celery scheduling, and production readiness. It is the narrative companion to the checklist in [How_to_add_a_collector.md](How_to_add_a_collector.md). + +**Worked example:** a fictional app **`heartbeat_demo`**. You create it locally with `startcollector`; it is **not** committed to the repository. For real in-repo patterns, see [cppa_user_tracker](../cppa_user_tracker/) (minimal inline collector) and [wg21_paper_tracker](../wg21_paper_tracker/) (split `collectors.py` + rich CLI). + +--- + +## 0. Prerequisites and outcomes + +### Before you start + +1. Complete project setup from the root [README.md](../README.md) (venv, `DATABASE_URL`, `migrate`). +2. Skim [Onboarding.md](Onboarding.md) §1 — five ideas: collectors are management commands; writes go through **`services.py`**; scheduling is YAML-driven. +3. Optional: [Architecture_data_flow.md](Architecture_data_flow.md) for where your collector fits in the pipeline. + +### What you will be able to do + +After this tutorial you can: + +- Scaffold a collector app with `python manage.py startcollector ` +- Explain why **`validate_config`**, **`collect`**, and **`sync_pinecone`** exist and who calls them +- Run and test the collector locally with pytest (PostgreSQL) +- Register the command in **`config/boost_collector_schedule.yaml`** with **`enabled: false`** until ready +- Know what happens when Celery Beat and deployment run your collector + +--- + +## 1. Scaffolding with `startcollector` + +Run all commands from the **repository root** so the new app sits next to the other Django apps. + +### Preview, then create + +```bash +python manage.py startcollector heartbeat_demo --dry-run +python manage.py startcollector heartbeat_demo +``` + +`--dry-run` prints planned paths and a preview of `schedule_snippet.yaml` without writing files. + +### What gets generated (14 files) + +`startcollector` writes a full Django app package. There is no separate template directory; file bodies are built in [core/management/commands/startcollector.py](../core/management/commands/startcollector.py). + +```text +heartbeat_demo/ + __init__.py + admin.py + apps.py + models.py + services.py + views.py + schedule_snippet.yaml + migrations/ + __init__.py + 0001_initial.py + management/ + __init__.py + commands/ + __init__.py + run_heartbeat_demo.py + tests/ + __init__.py + test_run_heartbeat_demo_command.py +``` + +```mermaid +flowchart TB + subgraph scaffold [startcollector output] + models[models.py RunState stub] + services[services.py record_run] + runCmd["management/commands/run_heartbeat_demo.py"] + mig[migrations/0001_initial.py] + snippet[schedule_snippet.yaml] + test[tests/test_run_heartbeat_demo_command.py] + end + subgraph manual [You do manually] + installed[INSTALLED_APPS] + yaml[boost_collector_schedule.yaml] + migrate[migrate] + end + scaffold --> manual +``` + +| File | Purpose | +|------|---------| +| `apps.py` | `HeartbeatDemoConfig` with `name = "heartbeat_demo"` | +| `models.py` | Stub `HeartbeatDemoRunState` (`source_key`, `run_count`, `updated_at`) | +| `services.py` | Stub `record_run()` — **all DB writes** for this app should stay here | +| `management/commands/run_heartbeat_demo.py` | `HeartbeatDemoCollector` + `Command(BaseCollectorCommand)` | +| `migrations/0001_initial.py` | Hand-written initial migration matching the stub model | +| `schedule_snippet.yaml` | Commented YAML to paste into the shared schedule file | +| `tests/test_run_heartbeat_demo_command.py` | Smoke test via `call_command` | + +**Not generated:** `collectors.py`, `INSTALLED_APPS` entry, or edits to `config/boost_collector_schedule.yaml`. + +### Design decisions (why the scaffold looks like this) + +| Decision | Why | +|----------|-----| +| One Django app per collector domain | Shared PostgreSQL, isolated `services.py`, clear ownership in CODEOWNERS | +| Command name `run_{app_label}` | Must match YAML `command:` and what Celery/`call_command` invoke | +| Stub run-state model + `record_run()` | Proves migrations and the service-layer write path before real domain logic | +| Hand-written `0001_initial.py` | Matches `models.py` on day one without requiring `makemigrations` first | +| `schedule_snippet.yaml` not auto-merged | Schedule changes are reviewed in PR; avoids Beat calling a missing command | +| Collector class inside `run_*.py` initially | Smallest first PR; split into `collectors.py` when the command grows (§2.5) | + +### Manual steps after scaffold + +1. Add **`"heartbeat_demo"`** to `INSTALLED_APPS` in [config/settings.py](../config/settings.py) (keep **alphabetical** order with other project apps). +2. Run **`python manage.py migrate`**. +3. Run **`python manage.py run_heartbeat_demo`** — expect success output and one row in `heartbeat_demo_heartbeatdemorunstate` (table name from Django’s model naming). + +The generated collector follows this shape (abbreviated): + +```python +class HeartbeatDemoCollector(AbstractCollector): + @property + def name(self) -> str: + return "heartbeat_demo" + + def validate_config(self) -> None: + if not self.source_key or not self.source_key.strip(): + raise ValueError("source_key must not be empty") + + def collect(self) -> None: + _, created = services.record_run(source_key=self.source_key.strip()) + # ... + +class Command(BaseCollectorCommand): + def get_collector(self, **_options: Any) -> AbstractCollector: + return HeartbeatDemoCollector(stdout=self.stdout, style=self.style) +``` + +Note: the scaffold stores `source_key` on the collector but does **not** wire `--source-key` on the command until you add it (§3.1). That is intentional practice for your first edit. + +Further manual steps (from [CONTRIBUTING.md](../CONTRIBUTING.md#creating-a-new-collector)): + +4. Merge `schedule_snippet.yaml` into [config/boost_collector_schedule.yaml](../config/boost_collector_schedule.yaml) (§5). +5. If the app imports other apps or adds cross-app FKs, update [cross-app-dependencies.md](cross-app-dependencies.md). +6. When `services.py` grows, run **`python scripts/generate_service_docs.py`** and commit `docs/service_api/` updates. + +--- + +## 2. AbstractCollector hooks and lifecycle + +Collectors are not auto-discovered as Python classes. Django finds **`management/commands/run_*.py`** under installed apps. Your collector class is built by **`BaseCollectorCommand.get_collector()`** and executed in a fixed two-phase lifecycle. + +### Sequence + +```mermaid +sequenceDiagram + participant Cmd as BaseCollectorCommand + participant Col as AbstractCollector + Cmd->>Col: get_collector(options) + Cmd->>Col: run() + Col->>Col: validate_config() + Col->>Col: collect() + Cmd->>Col: sync_pinecone() +``` + +Implementation: [core/collectors/command_base.py](../core/collectors/command_base.py) `handle()` calls `get_collector`, then `_run_collector_phase(collector, collector.run)`, then `_run_collector_phase(collector, collector.sync_pinecone)`. + +`AbstractCollector.run()` is **concrete** — do not override it. It runs `validate_config()` then `collect()` ([core/collectors/base_collector.py](../core/collectors/base_collector.py)). + +### Hook reference + +| Hook | Who calls it | Responsibility | Tutorial example | +|------|----------------|----------------|------------------| +| `name` | `handle_error` logging | Stable slug for metrics/alerts | `"heartbeat_demo"` | +| `validate_config()` | `run()` before I/O | Fast checks: env, CLI, empty keys | Reject empty `source_key` | +| `collect()` | `run()` | Orchestration; delegate DB to `services.py` | `services.record_run(...)` | +| `run()` | Command | Template: validate → collect | Do not override | +| `handle_error(exc)` | Command on non-`CommandError` | Log with `classify_failure` | Default is enough for most apps | +| `sync_pinecone()` | Command after `run` | Post-run vector sync; default no-op | See §2.4 | + +### Error taxonomy (important for reviews) + +| Exception | Behavior | +|-----------|----------| +| `django.core.management.base.CommandError` | Logged by command with `failure_category=command`; **not** passed to `handle_error`; re-raised | +| Any other `Exception` | `collector.handle_error(exc)` then re-raised (non-zero exit for scheduler) | + +During each phase the command sets **`collector._error_phase`** to `"run"` or `"sync_pinecone"` and clears it in `finally` (even if logging fails). + +**When to use `CommandError`:** invalid CLI combinations, missing required env vars that should read as “user/config error” (§3.3). + +**When to let other exceptions propagate:** API failures, DB errors — `handle_error` classifies them via [core/errors.py](../core/errors.py) `classify_failure()` into `CollectorFailureCategory` values for structured logs. + +Override `handle_error` only when the default classifier does not match your domain (see [How_to_add_a_collector.md](How_to_add_a_collector.md#3-shared-abstractions-recommended)). + +### Optional: `sync_pinecone()` + +Default is a no-op. Collectors that index after fetch override it, often by calling another management command, e.g. `cppa_slack_tracker` → `run_cppa_pinecone_sync`. The tutorial stub does not implement this. + +### When to extract `collectors.py` + +Keep collector + command in **`run_heartbeat_demo.py`** while the command stays small (rough guide: under ~80 lines of collector logic). + +Split when: + +- CLI parsing and collector logic make `run_*.py` hard to navigate +- You want unit tests on the collector without loading the Django command + +**Production example:** [wg21_paper_tracker/collectors.py](../wg21_paper_tracker/collectors.py) + thin [run_wg21_paper_tracker.py](../wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py). + +**Minimal in-repo example (inline, like scaffold):** [cppa_user_tracker/management/commands/run_cppa_user_tracker.py](../cppa_user_tracker/management/commands/run_cppa_user_tracker.py). + +### Legacy: `CollectorBase` + +Older collectors implement only `run()`. New work should use **`AbstractCollector`**. See [Core_public_API.md](Core_public_API.md#collectors). + +--- + +## 3. Evolving the worked example + +These three edits turn the stub into a realistic collector shape. Apply them on your local `heartbeat_demo` app (do not commit the app unless you are shipping a real feature). + +### 3.1 Wire `--source-key` on the command + +The scaffold collector already accepts `source_key` in `__init__`. Add CLI wiring on `Command`: + +```python +# heartbeat_demo/management/commands/run_heartbeat_demo.py + +class Command(BaseCollectorCommand): + help = "Run the heartbeat_demo collector." + + def add_arguments(self, parser) -> None: + parser.add_argument( + "--source-key", + default="default", + help="Logical source key for HeartbeatDemoRunState.", + ) + + def get_collector(self, **options: Any) -> AbstractCollector: + return HeartbeatDemoCollector( + stdout=self.stdout, + style=self.style, + source_key=options["source_key"], + ) +``` + +Run: `python manage.py run_heartbeat_demo --source-key=prod`. + +### 3.2 Service layer + +Rename or extend `record_run` as your domain grows. **Rule:** all creates/updates/deletes for `heartbeat_demo` models go through **`heartbeat_demo/services.py`** — not from `collect()` directly. See [CONTRIBUTING.md](../CONTRIBUTING.md#service-layer-single-place-for-writes). + +When you add public service functions, regenerate API docs: + +```bash +python scripts/generate_service_docs.py +# or one app: python scripts/generate_service_docs.py --app heartbeat_demo +``` + +### 3.3 Validation and `CommandError` + +Demonstrate config errors vs runtime errors. Example: require an env var for a hypothetical API: + +```python +import os +from django.core.management.base import CommandError + +def validate_config(self) -> None: + if not self.source_key or not self.source_key.strip(): + raise ValueError("source_key must not be empty") + if not os.environ.get("HEARTBEAT_DEMO_API_KEY"): + raise CommandError( + "HEARTBEAT_DEMO_API_KEY is not set. " + "Add it to .env and document it in .env.example." + ) +``` + +Document new variables in **`.env.example`** and, if needed, **`docs/operations/`**. + +### Anti-patterns + +- Calling `HeartbeatDemoRunState.objects.create(...)` from `collect()` instead of `services.py` +- Importing another tracker app’s models without updating [cross-app-dependencies.md](cross-app-dependencies.md) +- Setting **`enabled: true`** in YAML before the app is merged and migrated + +--- + +## 4. Testing + +The project uses **pytest + pytest-django** with **PostgreSQL only** (`config.test_settings`). See [README.md](../README.md#running-tests). + +### Test layers + +| Layer | What to test | How | +|-------|----------------|-----| +| Service | `record_run` create/increment | `@pytest.mark.django_db`, assert on ORM | +| Command integration | Full `call_command` path | Scaffold smoke test (below) | +| Collector unit | `validate_config` / `collect` with mocks | `@patch` on `services.record_run` | +| Scheduler (advanced) | YAML loading / strict mode | [boost_collector_runner/tests/test_schedule_config.py](../boost_collector_runner/tests/test_schedule_config.py) | + +### Scaffold smoke test (already generated) + +```python +@pytest.mark.django_db +def test_run_heartbeat_demo_writes_success(): + out = StringIO() + call_command("run_heartbeat_demo", stdout=out, verbosity=0) + assert "completed" in out.getvalue().lower() +``` + +### Expand: service test + +```python +# heartbeat_demo/tests/test_services.py +import pytest +from heartbeat_demo.models import HeartbeatDemoRunState +from heartbeat_demo.services import record_run + + +@pytest.mark.django_db +def test_record_run_creates_and_increments(): + row1, created1 = record_run(source_key="alpha") + assert created1 is True + assert row1.run_count == 1 + + row2, created2 = record_run(source_key="alpha") + assert created2 is False + assert row2.id == row1.id + assert row2.run_count == 2 +``` + +### Expand: collector unit test (mock services) + +Pattern from [wg21_paper_tracker/tests/test_collectors.py](../wg21_paper_tracker/tests/test_collectors.py): + +```python +from unittest.mock import patch +import pytest +from heartbeat_demo.management.commands.run_heartbeat_demo import HeartbeatDemoCollector + + +def test_validate_config_rejects_empty_source_key(): + collector = HeartbeatDemoCollector(stdout=None, style=None, source_key=" ") + with pytest.raises(ValueError): + collector.validate_config() + + +@patch("heartbeat_demo.services.record_run") +def test_collect_calls_service(mock_record_run): + mock_record_run.return_value = (None, True) + collector = HeartbeatDemoCollector(stdout=None, style=None, source_key="k1") + collector.collect() + mock_record_run.assert_called_once_with(source_key="k1") +``` + +### Run tests locally + +```bash +docker compose -f docker-compose.test.yml up -d +export DATABASE_URL=postgres://postgres:postgres@127.0.0.1:5433/postgres +export SECRET_KEY=for-local-only +export DJANGO_SETTINGS_MODULE=config.test_settings + +uv run pytest heartbeat_demo/tests/ -v +uv run pytest # full suite before PR +uv run pyright # typing, matches CI +``` + +### CI + +- **Lint job:** [scripts/validate_collector_scaffold.py](../scripts/validate_collector_scaffold.py) recreates a throwaway app under `.test_artifacts/`, then runs ruff and scoped pyright. +- **Test job:** full `pytest` with **90%** coverage gate (`.github/workflows/actions.yml`). + +--- + +## 5. Celery scheduling + +Scheduling is **YAML-driven** — no Python change to add a Beat entry. Full reference: [Workflow.md §2](Workflow.md#2-boost-collector-runner-and-yaml-schedule). + +### Author checklist + +1. Open **`heartbeat_demo/schedule_snippet.yaml`** from the scaffold. +2. Paste a task under the right **`groups..tasks`** in [config/boost_collector_schedule.yaml](../config/boost_collector_schedule.yaml) (pick a group that matches runtime dependencies, e.g. `github` for GitHub-related work). +3. Use **`enabled: false`** until the app is on `develop`/`main`, in `INSTALLED_APPS`, and migrated. +4. After merge, flip **`enabled: true`** per team policy (same PR or follow-up). + +Example entry: + +```yaml +groups: + github: + default_time: "00:05" + tasks: + # ... existing tasks ... + - command: run_heartbeat_demo + schedule: daily + enabled: false + args: ["--source-key", "default"] +``` + +The **`command`** value must match the Django management command name (filename `run_heartbeat_demo.py` → command `run_heartbeat_demo`). + +### How Beat runs your collector + +```mermaid +flowchart LR + beat[CeleryBeat] --> task[run_scheduled_collectors_task] + task --> mgmt[run_scheduled_collectors] + mgmt --> cmd[run_heartbeat_demo] +``` + +- [config/settings.py](../config/settings.py) sets `CELERY_BEAT_SCHEDULE` from YAML via `get_beat_schedule()`. +- [boost_collector_runner/tasks.py](../boost_collector_runner/tasks.py) `@shared_task` `run_scheduled_collectors_task` calls `call_command("run_scheduled_collectors", ...)`. +- Within one batch, collectors in a group run **sequentially**; different groups can run in parallel on workers. + +### Local verification + +```bash +# Run one collector by hand +python manage.py run_heartbeat_demo --source-key=default + +# Run a scheduled group (same as Beat would batch) +python manage.py run_scheduled_collectors --schedule daily --group github + +# Optional: trigger Celery task directly (worker must be running) +# See docs/Celery_test.md +``` + +Docker: start **`celery_worker`** and **`celery_beat`** per [Docker.md](Docker.md). + +**Strict mode:** With `DEBUG=False` or `BOOST_COLLECTOR_SCHEDULE_STRICT=True`, invalid YAML fails at settings import so Beat does not start with an empty schedule. + +--- + +## 6. Deployment and production readiness + +Short path; details in [Deployment.md](Deployment.md) and [GCP_Production_Checklist.md](GCP_Production_Checklist.md). + +| Step | Action | +|------|--------| +| PR merged | Set `enabled: true` in YAML when ready for production runs | +| Deploy | `docker compose` with `web`, `celery_worker`, `celery_beat`, Redis | +| Health | `GET /health/` returns **503** while `HEALTH_ENFORCE_COLLECTOR_FRESHNESS=true` until daily YAML groups have successful runs | +| Secrets | New env vars in `.env.example`; ops notes under `docs/operations/` if using external APIs | + +Production-scale references: + +- **`github_activity_tracker`** — full fetch, workspace, Pinecone pipeline +- **`wg21_paper_tracker`** — split `collectors.py`, rich CLI, service-backed pipeline + +--- + +## 7. Checklist and further reading + +### Copy-paste checklist + +- [ ] `python manage.py startcollector ` from repo root +- [ ] Add app to `INSTALLED_APPS` (alphabetical) +- [ ] `python manage.py migrate` +- [ ] Implement real logic in `services.py`; keep `collect()` thin +- [ ] Subclass `AbstractCollector` + `BaseCollectorCommand` (or split `collectors.py` when large) +- [ ] `validate_config` for fast checks; `CommandError` for bad config +- [ ] Tests: service + command (+ collector unit tests with mocks) +- [ ] Paste schedule entry with `enabled: false`; merge to `config/boost_collector_schedule.yaml` +- [ ] Update `cross-app-dependencies.md` if importing other apps +- [ ] `.env.example` + ops docs for new secrets +- [ ] `generate_service_docs.py` when adding public service functions +- [ ] `uv run pytest` and `uv run pyright` before PR +- [ ] Enable task in YAML after deploy/migrate +- [ ] Verify `/health/` and optional `run_scheduled_collectors` on staging + +### Further reading + +| Topic | Doc | +|-------|-----| +| Checklist / contracts | [How_to_add_a_collector.md](How_to_add_a_collector.md) | +| `startcollector` + service layer | [CONTRIBUTING.md](../CONTRIBUTING.md#creating-a-new-collector) | +| YAML schedules | [Workflow.md](Workflow.md) | +| Core collector API | [Core_public_API.md](Core_public_API.md) | +| Data flow | [Architecture_data_flow.md](Architecture_data_flow.md) | +| Cross-app imports | [cross-app-dependencies.md](cross-app-dependencies.md) | +| Deploy / GCP | [Deployment.md](Deployment.md), [GCP_Production_Checklist.md](GCP_Production_Checklist.md) | +| Celery manual test | [Celery_test.md](Celery_test.md) | +| Service function reference | [Service_API.md](Service_API.md) | From 9d2a2d0799458319979e8969cbf132cb4b6c6beb Mon Sep 17 00:00:00 2001 From: snowfox1003 Date: Wed, 27 May 2026 15:25:20 -0400 Subject: [PATCH 2/2] docs: remove legacy `CollectorBase` reference from collector tutorial --- docs/Tutorial_building_a_collector.md | 4 ---- 1 file changed, 4 deletions(-) diff --git a/docs/Tutorial_building_a_collector.md b/docs/Tutorial_building_a_collector.md index 5f60b6f..df88d62 100644 --- a/docs/Tutorial_building_a_collector.md +++ b/docs/Tutorial_building_a_collector.md @@ -207,10 +207,6 @@ Split when: **Minimal in-repo example (inline, like scaffold):** [cppa_user_tracker/management/commands/run_cppa_user_tracker.py](../cppa_user_tracker/management/commands/run_cppa_user_tracker.py). -### Legacy: `CollectorBase` - -Older collectors implement only `run()`. New work should use **`AbstractCollector`**. See [Core_public_API.md](Core_public_API.md#collectors). - --- ## 3. Evolving the worked example