diff --git a/CHANGELOG.md b/CHANGELOG.md index a8a85ad..b671dda 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -9,6 +9,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] +### Documentation + +- Document **`pandoc`** as a **system** dependency (not installed by `pip` / `pypandoc` alone); centralize install steps for macOS, Debian/Ubuntu, and Windows in the root README and link from contributor setup docs. + ### Changed - **core.collectors:** Removed deprecated `CollectorBase` and `DjangoCommandCollector`; the supported collector contract is **`AbstractCollector`** + **`BaseCollectorCommand`** (see docs). diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 0de1bb9..54017e9 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -103,6 +103,7 @@ Reference tables in `docs/service_api/*.md` are produced by **[`scripts/generate ### Testing - **Running tests:** From the project root, install dev deps (`pip install -r requirements-dev.lock` or `uv pip install -r requirements-dev.lock`), start the test database (`docker compose -f docker-compose.test.yml up -d`), set `DATABASE_URL` (and `SECRET_KEY` for the process) as in [README.md](README.md#running-tests), then run `python -m pytest`. Tests **always use PostgreSQL** (`config.test_settings`); there is no SQLite fallback. +- **`boost_library_docs_tracker` / pandoc:** If you work on the docs collector or run tests that hit real HTML→Markdown conversion, install the **`pandoc`** system binary per [README — System dependencies](README.md#system-dependencies) (`pypandoc` from pip is not enough). - See [README.md](README.md#running-tests) and [docs/Development_guideline.md](docs/Development_guideline.md#testing-workflow) for full commands and options. - **Unit tests for `services.py`:** Call the service functions and assert on the database (or mocks) as needed. - **Other tests:** Prefer service functions when setting up data. If you must create models directly for tests, keep it in test code (e.g. fixtures or test helpers) and avoid doing the same in production code. diff --git a/README.md b/README.md index 2c4ec56..cc88ff3 100644 --- a/README.md +++ b/README.md @@ -27,12 +27,29 @@ Authoritative names, examples, and comments live in **[`.env.example`](.env.exam - Python 3.13.x (`requires-python` in `pyproject.toml`; CI and Docker use 3.13) - Django (version in `requirements.txt`) - PostgreSQL database access -- **pandoc** — required by `boost_library_docs_tracker` for HTML→Markdown conversion (`pypandoc` calls the `pandoc` binary at runtime): - - macOS: `brew install pandoc` - - Debian/Ubuntu: `sudo apt-get install pandoc` - - Windows: `winget install JohnMacFarlane.Pandoc` or download from [pandoc.org](https://pandoc.org/installing.html) +- **pandoc** (system binary) — required for `boost_library_docs_tracker` HTML→Markdown; `pip` installs `pypandoc` only. See [System dependencies](#system-dependencies). - Environment variables for database URL and API keys (e.g. via `.env`) +### System dependencies + +Some tools must be installed with your **operating system** or package manager; they are not shipped as Python wheels in this repo’s requirements files. + +#### pandoc + +The **`pandoc`** executable must be on your **`PATH`** when you run **`run_boost_library_docs_tracker`** or otherwise exercise real HTML→Markdown conversion. The **`pypandoc`** package in **`requirements.txt`** is only a wrapper around that binary — **`pip install -r requirements.txt` does not install pandoc.** + +**Verify:** `pandoc --version` + +| Platform | Install | +| --- | --- | +| macOS | `brew install pandoc` | +| Debian / Ubuntu | `sudo apt-get install -y pandoc` | +| Windows | `winget install JohnMacFarlane.Pandoc`, or installers from [pandoc.org/installing.html](https://pandoc.org/installing.html) | + +**When you need it:** Running or debugging the Boost library docs collector without mocks. Many unit tests mock conversion and do not require pandoc on your machine. + +**CI:** The GitHub Actions **`test`** job installs pandoc on **`ubuntu-latest`**. The **`lint`** and **`pyright`** jobs do not install it. Developers on macOS or Windows should install pandoc locally if they run integration-style tests or the real collector. + ### Initial setup 1. Clone the repository: @@ -52,15 +69,17 @@ venv\Scripts\activate source venv/bin/activate ``` -3. Install dependencies: +3. Install system dependencies for the parts of the project you use (see [System dependencies](#system-dependencies)). If you will run **`run_boost_library_docs_tracker`**, install **`pandoc`** for your OS and confirm **`pandoc --version`**. + +4. Install dependencies: ```bash pip install -r requirements.txt ``` -4. Configure environment variables (e.g. copy `.env.example` to `.env` and set database URL and API credentials). +5. Configure environment variables (e.g. copy `.env.example` to `.env` and set database URL and API credentials). -5. Create and run migrations (required before any command that uses the database): +6. Create and run migrations (required before any command that uses the database): ```bash python manage.py makemigrations @@ -71,13 +90,13 @@ Each project app has a `migrations/` package; if you previously saw "No changes If you see `relation "cppa_user_tracker_githubaccount" does not exist` (or similar), the database tables are missing — run the two commands above. -6. Run a single app command or the full workflow to confirm the project works: +7. Run a single app command or the full workflow to confirm the project works: ```bash python manage.py run_scheduled_collectors --schedule daily --group github ``` -7. To **add a new collector app**, follow **[docs/Tutorial_building_a_collector.md](docs/Tutorial_building_a_collector.md)** (walkthrough), then **`python manage.py startcollector `** and **[CONTRIBUTING.md](CONTRIBUTING.md#creating-a-new-collector)** (checklist). +8. To **add a new collector app**, follow **[docs/Tutorial_building_a_collector.md](docs/Tutorial_building_a_collector.md)** (walkthrough), then **`python manage.py startcollector `** and **[CONTRIBUTING.md](CONTRIBUTING.md#creating-a-new-collector)** (checklist). For local development you can start the dev server: `python manage.py runserver`. @@ -149,7 +168,7 @@ python -m pytest --tb=short --cov=. --cov-report=term-missing --cov-fail-under=9 Coverage writes a local **`.coverage`** file (binary data used by `coverage.py`; safe to delete). It is listed in `.gitignore`. -**CI:** [`.github/workflows/actions.yml`](.github/workflows/actions.yml) runs three jobs on pushes/PRs (see the workflow for triggers): **`lint`** (pre-commit on all files), **`pyright`** (static analysis from `pyrightconfig.json`), and **`test`** (pytest with Postgres, `DATABASE_URL=postgres://postgres:postgres@127.0.0.1:5432/postgres`, `DJANGO_SETTINGS_MODULE=config.test_settings`, coverage, and `--cov-fail-under=90`). +**CI:** [`.github/workflows/actions.yml`](.github/workflows/actions.yml) runs three jobs on pushes/PRs (see the workflow for triggers): **`lint`** (pre-commit on all files), **`pyright`** (static analysis from `pyrightconfig.json`), and **`test`** (pytest with Postgres, `DATABASE_URL=postgres://postgres:postgres@127.0.0.1:5432/postgres`, `DJANGO_SETTINGS_MODULE=config.test_settings`, coverage, and `--cov-fail-under=90`). The **`test`** job installs **`pandoc`** via apt on Ubuntu; on macOS or Windows, install pandoc yourself if you run the full suite or docs-tracker paths that invoke real conversion (see [System dependencies](#system-dependencies)). 6. Run a subset of tests (e.g. one app or one file): diff --git a/boost_library_docs_tracker/README.md b/boost_library_docs_tracker/README.md index 1a8571a..097f7d8 100644 --- a/boost_library_docs_tracker/README.md +++ b/boost_library_docs_tracker/README.md @@ -2,7 +2,7 @@ ## Overview -Fetches and converts **Boost library documentation** (HTML and related sources) into Markdown for storage and downstream search (Pinecone, etc.). Requires a working **`pandoc`** binary on the host (see root [README](../README.md#quick-start)). +Fetches and converts **Boost library documentation** (HTML and related sources) into Markdown for storage and downstream search (Pinecone, etc.). Requires a working **`pandoc`** binary on the host (see root [README](../README.md#system-dependencies)). ## Data workflow @@ -41,7 +41,7 @@ After DB + workspace writes, the collector can call **`run_cppa_pinecone_sync`** - Run the tracker: `python manage.py run_boost_library_docs_tracker --help`. - Service-layer overview: [docs/service_api/boost_library_docs_tracker.md](../docs/service_api/boost_library_docs_tracker.md). -- Confirm `pandoc` is on `PATH` before debugging conversion failures. +- Confirm `pandoc` is on `PATH` before debugging conversion failures (see [README — System dependencies](../README.md#system-dependencies)). ## Main command: `run_boost_library_docs_tracker` diff --git a/boost_library_docs_tracker/html_to_md.py b/boost_library_docs_tracker/html_to_md.py index 97862bc..a80007b 100644 --- a/boost_library_docs_tracker/html_to_md.py +++ b/boost_library_docs_tracker/html_to_md.py @@ -124,8 +124,9 @@ def _pandoc_convert(html: str) -> str: break raise RuntimeError( - "pandoc is not available. Install it with `brew install pandoc` " - "or `pip install pypandoc` followed by `pypandoc.download_pandoc()`." + "pandoc is not available. Install the pandoc binary for your OS (see README.md " + "#system-dependencies). `pip install pypandoc` does not install pandoc; you may " + "use `pypandoc.download_pandoc()` as a fallback if you cannot use a system package." ) diff --git a/docs/Development_guideline.md b/docs/Development_guideline.md index 80d059d..c5e1523 100644 --- a/docs/Development_guideline.md +++ b/docs/Development_guideline.md @@ -90,10 +90,11 @@ Use these steps to get the Django project running on your machine. 1. Clone the repository and open the project root (where `manage.py` lives). 2. Create a virtual environment (e.g. `python -m venv .venv`) and activate it. -3. Install dependencies (e.g. `pip install -r requirements.txt`). -4. Copy the sample env file (e.g. `.env.example`) to `.env` and fill in values for database URL, credentials, and any API keys (e.g. via `django-environ` or `python-decouple`). -5. Ensure the database is reachable. Run migrations: `python manage.py migrate`. -6. Run a single app command (e.g. `python manage.py run_boost_github_activity_tracker`) or a YAML batch (e.g. `python manage.py run_scheduled_collectors --schedule default --group `) to confirm the project works. To test the YAML-driven path as Beat does, use `python manage.py run_scheduled_collectors --schedule default --group ` for a group batch, or `python manage.py run_scheduled_collectors --schedule interval --interval-minutes ` for an interval batch (see `config/boost_collector_schedule.yaml` or the checked-in `config/boost_collector_schedule.yaml.example`). +3. Install system dependencies you need. For **`boost_library_docs_tracker`** / **`run_boost_library_docs_tracker`**, install the **`pandoc`** binary for your OS (see [README — System dependencies](../README.md#system-dependencies)); `pip` only installs the `pypandoc` wrapper. +4. Install dependencies (e.g. `pip install -r requirements.txt`). +5. Copy the sample env file (e.g. `.env.example`) to `.env` and fill in values for database URL, credentials, and any API keys (e.g. via `django-environ` or `python-decouple`). +6. Ensure the database is reachable. Run migrations: `python manage.py migrate`. +7. Run a single app command (e.g. `python manage.py run_boost_github_activity_tracker`) or a YAML batch (e.g. `python manage.py run_scheduled_collectors --schedule default --group `) to confirm the project works. To test the YAML-driven path as Beat does, use `python manage.py run_scheduled_collectors --schedule default --group ` for a group batch, or `python manage.py run_scheduled_collectors --schedule interval --interval-minutes ` for an interval batch (see `config/boost_collector_schedule.yaml` or the checked-in `config/boost_collector_schedule.yaml.example`). ## Testing workflow diff --git a/requirements.in b/requirements.in index b89105a..2b00de7 100644 --- a/requirements.in +++ b/requirements.in @@ -22,6 +22,7 @@ gunicorn>=22.0,<24 beautifulsoup4>=4.12,<5 # CVE-2026-41066 / GHSA-vfmq-68hx-4jfw: fixed in lxml 6.1.0; prior 5.x and 6.0.x remain affected. lxml>=6.1.0,<7 +# Wraps the pandoc CLI; install the pandoc binary separately (see README.md#system-dependencies). pypandoc>=1.11,<2 # --- cppa_pinecone_sync (vector index; chunking is in-tree) ---