Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### Documentation

- Document **`pandoc`** as a **system** dependency (not installed by `pip` / `pypandoc` alone); centralize install steps for macOS, Debian/Ubuntu, and Windows in the root README and link from contributor setup docs.

### Changed

- **core.collectors:** Removed deprecated `CollectorBase` and `DjangoCommandCollector`; the supported collector contract is **`AbstractCollector`** + **`BaseCollectorCommand`** (see docs).
Expand Down
1 change: 1 addition & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,7 @@ Reference tables in `docs/service_api/*.md` are produced by **[`scripts/generate
### Testing

- **Running tests:** From the project root, install dev deps (`pip install -r requirements-dev.lock` or `uv pip install -r requirements-dev.lock`), start the test database (`docker compose -f docker-compose.test.yml up -d`), set `DATABASE_URL` (and `SECRET_KEY` for the process) as in [README.md](README.md#running-tests), then run `python -m pytest`. Tests **always use PostgreSQL** (`config.test_settings`); there is no SQLite fallback.
- **`boost_library_docs_tracker` / pandoc:** If you work on the docs collector or run tests that hit real HTML→Markdown conversion, install the **`pandoc`** system binary per [README — System dependencies](README.md#system-dependencies) (`pypandoc` from pip is not enough).
- See [README.md](README.md#running-tests) and [docs/Development_guideline.md](docs/Development_guideline.md#testing-workflow) for full commands and options.
- **Unit tests for `services.py`:** Call the service functions and assert on the database (or mocks) as needed.
- **Other tests:** Prefer service functions when setting up data. If you must create models directly for tests, keep it in test code (e.g. fixtures or test helpers) and avoid doing the same in production code.
Expand Down
39 changes: 29 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,12 +27,29 @@ Authoritative names, examples, and comments live in **[`.env.example`](.env.exam
- Python 3.13.x (`requires-python` in `pyproject.toml`; CI and Docker use 3.13)
- Django (version in `requirements.txt`)
- PostgreSQL database access
- **pandoc** — required by `boost_library_docs_tracker` for HTML→Markdown conversion (`pypandoc` calls the `pandoc` binary at runtime):
- macOS: `brew install pandoc`
- Debian/Ubuntu: `sudo apt-get install pandoc`
- Windows: `winget install JohnMacFarlane.Pandoc` or download from [pandoc.org](https://pandoc.org/installing.html)
- **pandoc** (system binary) — required for `boost_library_docs_tracker` HTML→Markdown; `pip` installs `pypandoc` only. See [System dependencies](#system-dependencies).
- Environment variables for database URL and API keys (e.g. via `.env`)

### System dependencies

Some tools must be installed with your **operating system** or package manager; they are not shipped as Python wheels in this repo’s requirements files.

#### pandoc

The **`pandoc`** executable must be on your **`PATH`** when you run **`run_boost_library_docs_tracker`** or otherwise exercise real HTML→Markdown conversion. The **`pypandoc`** package in **`requirements.txt`** is only a wrapper around that binary — **`pip install -r requirements.txt` does not install pandoc.**

**Verify:** `pandoc --version`

| Platform | Install |
| --- | --- |
| macOS | `brew install pandoc` |
| Debian / Ubuntu | `sudo apt-get install -y pandoc` |
| Windows | `winget install JohnMacFarlane.Pandoc`, or installers from [pandoc.org/installing.html](https://pandoc.org/installing.html) |

**When you need it:** Running or debugging the Boost library docs collector without mocks. Many unit tests mock conversion and do not require pandoc on your machine.

**CI:** The GitHub Actions **`test`** job installs pandoc on **`ubuntu-latest`**. The **`lint`** and **`pyright`** jobs do not install it. Developers on macOS or Windows should install pandoc locally if they run integration-style tests or the real collector.

### Initial setup

1. Clone the repository:
Expand All @@ -52,15 +69,17 @@ venv\Scripts\activate
source venv/bin/activate
```

3. Install dependencies:
3. Install system dependencies for the parts of the project you use (see [System dependencies](#system-dependencies)). If you will run **`run_boost_library_docs_tracker`**, install **`pandoc`** for your OS and confirm **`pandoc --version`**.

4. Install dependencies:

```bash
pip install -r requirements.txt
```

4. Configure environment variables (e.g. copy `.env.example` to `.env` and set database URL and API credentials).
5. Configure environment variables (e.g. copy `.env.example` to `.env` and set database URL and API credentials).

5. Create and run migrations (required before any command that uses the database):
6. Create and run migrations (required before any command that uses the database):

```bash
python manage.py makemigrations
Expand All @@ -71,13 +90,13 @@ Each project app has a `migrations/` package; if you previously saw "No changes

If you see `relation "cppa_user_tracker_githubaccount" does not exist` (or similar), the database tables are missing — run the two commands above.

6. Run a single app command or the full workflow to confirm the project works:
7. Run a single app command or the full workflow to confirm the project works:

```bash
python manage.py run_scheduled_collectors --schedule daily --group github
```

7. To **add a new collector app**, follow **[docs/Tutorial_building_a_collector.md](docs/Tutorial_building_a_collector.md)** (walkthrough), then **`python manage.py startcollector <name>`** and **[CONTRIBUTING.md](CONTRIBUTING.md#creating-a-new-collector)** (checklist).
8. To **add a new collector app**, follow **[docs/Tutorial_building_a_collector.md](docs/Tutorial_building_a_collector.md)** (walkthrough), then **`python manage.py startcollector <name>`** and **[CONTRIBUTING.md](CONTRIBUTING.md#creating-a-new-collector)** (checklist).

For local development you can start the dev server: `python manage.py runserver`.

Expand Down Expand Up @@ -149,7 +168,7 @@ python -m pytest --tb=short --cov=. --cov-report=term-missing --cov-fail-under=9

Coverage writes a local **`.coverage`** file (binary data used by `coverage.py`; safe to delete). It is listed in `.gitignore`.

**CI:** [`.github/workflows/actions.yml`](.github/workflows/actions.yml) runs three jobs on pushes/PRs (see the workflow for triggers): **`lint`** (pre-commit on all files), **`pyright`** (static analysis from `pyrightconfig.json`), and **`test`** (pytest with Postgres, `DATABASE_URL=postgres://postgres:postgres@127.0.0.1:5432/postgres`, `DJANGO_SETTINGS_MODULE=config.test_settings`, coverage, and `--cov-fail-under=90`).
**CI:** [`.github/workflows/actions.yml`](.github/workflows/actions.yml) runs three jobs on pushes/PRs (see the workflow for triggers): **`lint`** (pre-commit on all files), **`pyright`** (static analysis from `pyrightconfig.json`), and **`test`** (pytest with Postgres, `DATABASE_URL=postgres://postgres:postgres@127.0.0.1:5432/postgres`, `DJANGO_SETTINGS_MODULE=config.test_settings`, coverage, and `--cov-fail-under=90`). The **`test`** job installs **`pandoc`** via apt on Ubuntu; on macOS or Windows, install pandoc yourself if you run the full suite or docs-tracker paths that invoke real conversion (see [System dependencies](#system-dependencies)).

6. Run a subset of tests (e.g. one app or one file):

Expand Down
4 changes: 2 additions & 2 deletions boost_library_docs_tracker/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Overview

Fetches and converts **Boost library documentation** (HTML and related sources) into Markdown for storage and downstream search (Pinecone, etc.). Requires a working **`pandoc`** binary on the host (see root [README](../README.md#quick-start)).
Fetches and converts **Boost library documentation** (HTML and related sources) into Markdown for storage and downstream search (Pinecone, etc.). Requires a working **`pandoc`** binary on the host (see root [README](../README.md#system-dependencies)).

## Data workflow

Expand Down Expand Up @@ -41,7 +41,7 @@ After DB + workspace writes, the collector can call **`run_cppa_pinecone_sync`**

- Run the tracker: `python manage.py run_boost_library_docs_tracker --help`.
- Service-layer overview: [docs/service_api/boost_library_docs_tracker.md](../docs/service_api/boost_library_docs_tracker.md).
- Confirm `pandoc` is on `PATH` before debugging conversion failures.
- Confirm `pandoc` is on `PATH` before debugging conversion failures (see [README — System dependencies](../README.md#system-dependencies)).

## Main command: `run_boost_library_docs_tracker`

Expand Down
5 changes: 3 additions & 2 deletions boost_library_docs_tracker/html_to_md.py
Original file line number Diff line number Diff line change
Expand Up @@ -124,8 +124,9 @@ def _pandoc_convert(html: str) -> str:
break

raise RuntimeError(
"pandoc is not available. Install it with `brew install pandoc` "
"or `pip install pypandoc` followed by `pypandoc.download_pandoc()`."
"pandoc is not available. Install the pandoc binary for your OS (see README.md "
"#system-dependencies). `pip install pypandoc` does not install pandoc; you may "
"use `pypandoc.download_pandoc()` as a fallback if you cannot use a system package."
)


Expand Down
9 changes: 5 additions & 4 deletions docs/Development_guideline.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,10 +90,11 @@ Use these steps to get the Django project running on your machine.

1. Clone the repository and open the project root (where `manage.py` lives).
2. Create a virtual environment (e.g. `python -m venv .venv`) and activate it.
3. Install dependencies (e.g. `pip install -r requirements.txt`).
4. Copy the sample env file (e.g. `.env.example`) to `.env` and fill in values for database URL, credentials, and any API keys (e.g. via `django-environ` or `python-decouple`).
5. Ensure the database is reachable. Run migrations: `python manage.py migrate`.
6. Run a single app command (e.g. `python manage.py run_boost_github_activity_tracker`) or a YAML batch (e.g. `python manage.py run_scheduled_collectors --schedule default --group <group_id>`) to confirm the project works. To test the YAML-driven path as Beat does, use `python manage.py run_scheduled_collectors --schedule default --group <group_id>` for a group batch, or `python manage.py run_scheduled_collectors --schedule interval --interval-minutes <n>` for an interval batch (see `config/boost_collector_schedule.yaml` or the checked-in `config/boost_collector_schedule.yaml.example`).
3. Install system dependencies you need. For **`boost_library_docs_tracker`** / **`run_boost_library_docs_tracker`**, install the **`pandoc`** binary for your OS (see [README — System dependencies](../README.md#system-dependencies)); `pip` only installs the `pypandoc` wrapper.
4. Install dependencies (e.g. `pip install -r requirements.txt`).
5. Copy the sample env file (e.g. `.env.example`) to `.env` and fill in values for database URL, credentials, and any API keys (e.g. via `django-environ` or `python-decouple`).
6. Ensure the database is reachable. Run migrations: `python manage.py migrate`.
7. Run a single app command (e.g. `python manage.py run_boost_github_activity_tracker`) or a YAML batch (e.g. `python manage.py run_scheduled_collectors --schedule default --group <group_id>`) to confirm the project works. To test the YAML-driven path as Beat does, use `python manage.py run_scheduled_collectors --schedule default --group <group_id>` for a group batch, or `python manage.py run_scheduled_collectors --schedule interval --interval-minutes <n>` for an interval batch (see `config/boost_collector_schedule.yaml` or the checked-in `config/boost_collector_schedule.yaml.example`).

## Testing workflow

Expand Down
1 change: 1 addition & 0 deletions requirements.in
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ gunicorn>=22.0,<24
beautifulsoup4>=4.12,<5
# CVE-2026-41066 / GHSA-vfmq-68hx-4jfw: fixed in lxml 6.1.0; prior 5.x and 6.0.x remain affected.
lxml>=6.1.0,<7
# Wraps the pandoc CLI; install the pandoc binary separately (see README.md#system-dependencies).
pypandoc>=1.11,<2

# --- cppa_pinecone_sync (vector index; chunking is in-tree) ---
Expand Down
Loading