|
| 1 | +# html2pdf4doc_python |
| 2 | + |
| 3 | +## Project Overview |
| 4 | + |
| 5 | +html2pdf4doc_python is the Python wrapper/CLI for the HTML2PDF4Doc JavaScript core that drives Chrome/Chromedriver to render HTML exports into PDFs. |
| 6 | +The Python CLI exists to provide a stable automation interface for CI pipelines and batch PDF generation. |
| 7 | +This repository focuses strictly on the Python-side automation layer; the rendering logic remains in the JS core. |
| 8 | + |
| 9 | +## Features |
| 10 | + |
| 11 | +- `print` command runs headless Chrome via Selenium to turn one or more HTML files into PDFs in a single invocation. |
| 12 | +- `get_driver` command pre-downloads and caches a matching Chromedriver, making CI setups reproducible. |
| 13 | +- Strict mode (enabled via `--strict`, default for the fuzzer) validates the PDF page count with `pypdf` and fails on mismatch. |
| 14 | +- Configurable Chromedriver cache dir, timeout, and explicit `--chromedriver` path for air-gapped systems. |
| 15 | +- Built-in debug log switch (`--debug`) that stores ChromeDriver logs under `/tmp/chromedriver.log`. |
| 16 | + |
| 17 | +## Installation |
| 18 | + |
| 19 | +1. Install Google Chrome (or Chrome for Testing) on the machine that will run the CLI. |
| 20 | +2. Install the package from PyPI: |
| 21 | + ```bash |
| 22 | + pip install html2pdf4doc |
| 23 | + ``` |
| 24 | + Python 3.8+ is required; dependencies (`selenium`, `webdriver-manager`, `requests`, `pypdf`) are pulled automatically. |
| 25 | +3. For local development, clone this repository and install the dev extras: |
| 26 | + ```bash |
| 27 | + git clone https://github.com/strictdoc-project/html2pdf4doc_python.git |
| 28 | + cd html2pdf4doc_python |
| 29 | + pip install -r requirements.development.txt |
| 30 | + ``` |
| 31 | + or simply run `invoke bootstrap` to initialize submodules and dev dependencies. |
| 32 | + |
| 33 | +## Usage |
| 34 | + |
| 35 | +- Print a single HTML file to PDF with strict validation: |
| 36 | + ```bash |
| 37 | + html2pdf4doc print --strict docs/index.html docs/output/index.pdf |
| 38 | + ``` |
| 39 | +- Print multiple files in one go by pairing each HTML with its PDF destination: |
| 40 | + ```bash |
| 41 | + html2pdf4doc print \ |
| 42 | + docs/book.html docs/output/book.pdf \ |
| 43 | + docs/appendix.html docs/output/appendix.pdf |
| 44 | + ``` |
| 45 | +- Pin a specific Chromedriver or cache directory (useful in CI or offline environments): |
| 46 | + ```bash |
| 47 | + html2pdf4doc get_driver --cache-dir /tmp/html2pdf_cache |
| 48 | + html2pdf4doc print --chromedriver /tmp/html2pdf_cache/123/chromedriver docs/index.html docs/output/index.pdf |
| 49 | + ``` |
| 50 | +- Tune timeouts/debugging when HTML needs longer to load: |
| 51 | + ```bash |
| 52 | + html2pdf4doc print --page-load-timeout 120 --debug docs/index.html docs/output/index.pdf |
| 53 | + ``` |
| 54 | + |
| 55 | +## Testing Workflow |
| 56 | + |
| 57 | +### Integration tests |
| 58 | + |
| 59 | +- `invoke test` (alias for `invoke test_integration`) runs the `tests/integration` suite through LLVM’s `lit` harness. Each test folder contains `test.itest` (lit recipe), HTML fixtures, and optional Python assertions (e.g., verifying PDF text via `pypdf`). |
| 60 | +- Tests download or reuse Chromedriver by calling `invoke get_chrome_driver` under the hood, so ensure Chrome is installed locally. |
| 61 | + |
| 62 | +### Fuzz testing |
| 63 | + |
| 64 | +- `invoke test_fuzz` (or `invoke test-fuzz`) runs `html2pdf4doc/html2pdf4doc_fuzzer.py` against whichever HTML fixture you point it to. The default task configuration targets one of the StrictDoc exports under a path like `tests/fuzz/<fixture_name>/`, but you can edit the task or call the script directly to fuzz any other HTML bundle. |
| 65 | +- Pass `--long` to run 200 iterations instead of the default 20. Failures copy the entire fuzz fixture tree plus the mutated HTML/PDF into `output/<relative_fixture_path>/`, so you can inspect the exact inputs that triggered issues. |
| 66 | + |
| 67 | +## Fuzz Fixture & Mutation Questions |
| 68 | + |
| 69 | +- **Fixture selection.** The fuzzer expects two CLI arguments: the path to a single HTML file (`input_file`) and the path to the root directory that contains the file plus all required assets (`root_path`). Out of the box, the `invoke test_fuzz` task points to one of the StrictDoc exports in `tests/fuzz/<fixture_name>/`, but you can drop in any other HTML bundle by adjusting the task arguments or calling the script directly. Each run processes exactly one fixture; to cover several, run the command repeatedly or wrap the fuzzer with a loop. The output folder names include the relative root, so multiple fixtures can coexist. |
| 70 | + |
| 71 | +- **Mutations and how to change them.** `html2pdf4doc/html2pdf4doc_fuzzer.py` parses the HTML with `lxml`, collects all `<p>` and `<td>` elements, and performs up to 25 iterations where it picks a random node and replaces its `.text` with sentences from `Faker`. The mutator intentionally stays within well-formed DOM changes so we stress realistic content variations rather than intentional corruption. To extend mutation types, modify `mutate_and_print()` function by changing the XPath, touching attributes, inserting/removing nodes, or composing several mutation strategies. Just keep the DOM valid and serialize back to HTML via `etree.tostring()` before printing. |
| 72 | + |
| 73 | +- **Error handling/checks.** The fuzzer always calls the CLI with `--strict`, so `html2pdf4doc.py` validates that the PDF page count reported by Chromedriver matches what `pypdf` sees. Any mismatch or Chromedriver failure triggers a `RuntimeError` that bubbles up as a non-zero exit status. On exceptions, the fuzzer copies both the original fixture tree and the timestamped mutated files into `output/...`, making it clear which mutation failed. If you add new mutation types or fixtures, ensure they respect these invariants or extend the validation logic accordingly. |
| 74 | + |
| 75 | +- **Single vs multiple fixtures.** There is no hard limit of “one fixture forever,” but the current CLI accepts only one `(HTML, root)` pair per run. To fuzz multiple documents automatically, add a tiny driver (Python/Bash) that iterates over your fixtures and calls `html2pdf4doc_fuzzer.py` for each; the output structure is already namespaced, so nothing else is required. |
| 76 | + |
| 77 | +## Development Tasks |
| 78 | + |
| 79 | +- `invoke bootstrap`: initialize git submodules and install development dependencies. |
| 80 | +- `invoke build`: rebuild the JS core (`submodules/html2pdf`) and copy the minified bundle into `html2pdf4doc/html2pdf4doc_js/`. |
| 81 | +- `invoke lint`: run formatting (`ruff format`), lint (`ruff check`), and type checks (`mypy`) across the Python sources. |
| 82 | +- `invoke test` / `invoke test_integration`: execute the integration suite via `lit`. |
| 83 | +- `invoke test_fuzz [--long]`: run the HTML fuzzing harness for 20 (or 200) iterations. |
| 84 | +- `invoke clean_itest_artifacts`: scrub generated integration-test outputs via `git clean`. |
| 85 | +- `invoke package` / `invoke release`: build wheels/sdists and optionally upload to (test) PyPI after running `twine check`. |
| 86 | + |
| 87 | +## Troubleshooting |
| 88 | + |
| 89 | +- **Chromedriver not found.** Run `html2pdf4doc get_driver` (optionally with `--cache-dir`) to download a matching driver, then pass `--chromedriver <path>` to `print`. |
| 90 | +- **Chrome/driver mismatch on macOS CI.** The CLI automatically prefers “Chrome for Testing,” but you can override by pointing `--chromedriver` to the desired binary. |
| 91 | +- **Page load timeouts.** Increase `--page-load-timeout` (default 60s) for large documents or slow machines. |
| 92 | +- **Need logs from ChromeDriver.** Add `--debug` to the `print` command and inspect `/tmp/chromedriver.log`. |
| 93 | +- **Network-restricted environments.** Pre-download Chromedriver (e.g., on a whitelisted machine) and copy the cache directory into the offline environment; the CLI never auto-downloads if `--chromedriver` is provided. |
| 94 | + |
| 95 | +## Contributing & License |
| 96 | + |
| 97 | +- Contributions are welcome via issues and pull requests. Please run `invoke lint test` before submitting and keep the JS submodule in sync (use `invoke build` if you touch the bundle). |
| 98 | +- The project is distributed under the Apache License 2.0 (see `LICENSE`). Any new files should include the appropriate headers when applicable. |
0 commit comments