docs: update README

mettta · mettta · commit fd4c80c7f0bd · 2025-11-20T23:41:21.000+01:00
diff --git a/README.md b/README.md
@@ -0,0 +1,98 @@
+# html2pdf4doc_python
+
+## Project Overview
+
+html2pdf4doc_python is the Python wrapper/CLI for the HTML2PDF4Doc JavaScript core that drives Chrome/Chromedriver to render HTML exports into PDFs.
+The Python CLI exists to provide a stable automation interface for CI pipelines and batch PDF generation.
+This repository focuses strictly on the Python-side automation layer; the rendering logic remains in the JS core.
+
+## Features
+
+- `print` command runs headless Chrome via Selenium to turn one or more HTML files into PDFs in a single invocation.
+- `get_driver` command pre-downloads and caches a matching Chromedriver, making CI setups reproducible.
+- Strict mode (enabled via `--strict`, default for the fuzzer) validates the PDF page count with `pypdf` and fails on mismatch.
+- Configurable Chromedriver cache dir, timeout, and explicit `--chromedriver` path for air-gapped systems.
+- Built-in debug log switch (`--debug`) that stores ChromeDriver logs under `/tmp/chromedriver.log`.
+
+## Installation
+
+1. Install Google Chrome (or Chrome for Testing) on the machine that will run the CLI.
+2. Install the package from PyPI:
+   ```bash
+   pip install html2pdf4doc
+   ```
+   Python 3.8+ is required; dependencies (`selenium`, `webdriver-manager`, `requests`, `pypdf`) are pulled automatically.
+3. For local development, clone this repository and install the dev extras:
+   ```bash
+   git clone https://github.com/strictdoc-project/html2pdf4doc_python.git
+   cd html2pdf4doc_python
+   pip install -r requirements.development.txt
+   ```
+   or simply run `invoke bootstrap` to initialize submodules and dev dependencies.
+
+## Usage
+
+- Print a single HTML file to PDF with strict validation:
+  ```bash
+  html2pdf4doc print --strict docs/index.html docs/output/index.pdf
+  ```
+- Print multiple files in one go by pairing each HTML with its PDF destination:
+  ```bash
+  html2pdf4doc print \
+      docs/book.html docs/output/book.pdf \
+      docs/appendix.html docs/output/appendix.pdf
+  ```
+- Pin a specific Chromedriver or cache directory (useful in CI or offline environments):
+  ```bash
+  html2pdf4doc get_driver --cache-dir /tmp/html2pdf_cache
+  html2pdf4doc print --chromedriver /tmp/html2pdf_cache/123/chromedriver docs/index.html docs/output/index.pdf
+  ```
+- Tune timeouts/debugging when HTML needs longer to load:
+  ```bash
+  html2pdf4doc print --page-load-timeout 120 --debug docs/index.html docs/output/index.pdf
+  ```
+
+## Testing Workflow
+
+### Integration tests
+
+- `invoke test` (alias for `invoke test_integration`) runs the `tests/integration` suite through LLVM’s `lit` harness. Each test folder contains `test.itest` (lit recipe), HTML fixtures, and optional Python assertions (e.g., verifying PDF text via `pypdf`).
+- Tests download or reuse Chromedriver by calling `invoke get_chrome_driver` under the hood, so ensure Chrome is installed locally.
+
+### Fuzz testing
+
+- `invoke test_fuzz` (or `invoke test-fuzz`) runs `html2pdf4doc/html2pdf4doc_fuzzer.py` against whichever HTML fixture you point it to. The default task configuration targets one of the StrictDoc exports under a path like `tests/fuzz/<fixture_name>/`, but you can edit the task or call the script directly to fuzz any other HTML bundle.
+- Pass `--long` to run 200 iterations instead of the default 20. Failures copy the entire fuzz fixture tree plus the mutated HTML/PDF into `output/<relative_fixture_path>/`, so you can inspect the exact inputs that triggered issues.
+
+## Fuzz Fixture & Mutation Questions
+
+- **Fixture selection.** The fuzzer expects two CLI arguments: the path to a single HTML file (`input_file`) and the path to the root directory that contains the file plus all required assets (`root_path`). Out of the box, the `invoke test_fuzz` task points to one of the StrictDoc exports in `tests/fuzz/<fixture_name>/`, but you can drop in any other HTML bundle by adjusting the task arguments or calling the script directly. Each run processes exactly one fixture; to cover several, run the command repeatedly or wrap the fuzzer with a loop. The output folder names include the relative root, so multiple fixtures can coexist.
+
+- **Mutations and how to change them.** `html2pdf4doc/html2pdf4doc_fuzzer.py` parses the HTML with `lxml`, collects all `<p>` and `<td>` elements, and performs up to 25 iterations where it picks a random node and replaces its `.text` with sentences from `Faker`. The mutator intentionally stays within well-formed DOM changes so we stress realistic content variations rather than intentional corruption. To extend mutation types, modify `mutate_and_print()` function by changing the XPath, touching attributes, inserting/removing nodes, or composing several mutation strategies. Just keep the DOM valid and serialize back to HTML via `etree.tostring()` before printing.
+
+- **Error handling/checks.** The fuzzer always calls the CLI with `--strict`, so `html2pdf4doc.py` validates that the PDF page count reported by Chromedriver matches what `pypdf` sees. Any mismatch or Chromedriver failure triggers a `RuntimeError` that bubbles up as a non-zero exit status. On exceptions, the fuzzer copies both the original fixture tree and the timestamped mutated files into `output/...`, making it clear which mutation failed. If you add new mutation types or fixtures, ensure they respect these invariants or extend the validation logic accordingly.
+
+- **Single vs multiple fixtures.** There is no hard limit of “one fixture forever,” but the current CLI accepts only one `(HTML, root)` pair per run. To fuzz multiple documents automatically, add a tiny driver (Python/Bash) that iterates over your fixtures and calls `html2pdf4doc_fuzzer.py` for each; the output structure is already namespaced, so nothing else is required.
+
+## Development Tasks
+
+- `invoke bootstrap`: initialize git submodules and install development dependencies.
+- `invoke build`: rebuild the JS core (`submodules/html2pdf`) and copy the minified bundle into `html2pdf4doc/html2pdf4doc_js/`.
+- `invoke lint`: run formatting (`ruff format`), lint (`ruff check`), and type checks (`mypy`) across the Python sources.
+- `invoke test` / `invoke test_integration`: execute the integration suite via `lit`.
+- `invoke test_fuzz [--long]`: run the HTML fuzzing harness for 20 (or 200) iterations.
+- `invoke clean_itest_artifacts`: scrub generated integration-test outputs via `git clean`.
+- `invoke package` / `invoke release`: build wheels/sdists and optionally upload to (test) PyPI after running `twine check`.
+
+## Troubleshooting
+
+- **Chromedriver not found.** Run `html2pdf4doc get_driver` (optionally with `--cache-dir`) to download a matching driver, then pass `--chromedriver <path>` to `print`.
+- **Chrome/driver mismatch on macOS CI.** The CLI automatically prefers “Chrome for Testing,” but you can override by pointing `--chromedriver` to the desired binary.
+- **Page load timeouts.** Increase `--page-load-timeout` (default 60s) for large documents or slow machines.
+- **Need logs from ChromeDriver.** Add `--debug` to the `print` command and inspect `/tmp/chromedriver.log`.
+- **Network-restricted environments.** Pre-download Chromedriver (e.g., on a whitelisted machine) and copy the cache directory into the offline environment; the CLI never auto-downloads if `--chromedriver` is provided.
+
+## Contributing & License
+
+- Contributions are welcome via issues and pull requests. Please run `invoke lint test` before submitting and keep the JS submodule in sync (use `invoke build` if you touch the bundle).
+- The project is distributed under the Apache License 2.0 (see `LICENSE`). Any new files should include the appropriate headers when applicable.