Skip to content

Commit fd4c80c

Browse files
committed
docs: update README
1 parent ebd6a9e commit fd4c80c

1 file changed

Lines changed: 98 additions & 0 deletions

File tree

README.md

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
# html2pdf4doc_python
2+
3+
## Project Overview
4+
5+
html2pdf4doc_python is the Python wrapper/CLI for the HTML2PDF4Doc JavaScript core that drives Chrome/Chromedriver to render HTML exports into PDFs.
6+
The Python CLI exists to provide a stable automation interface for CI pipelines and batch PDF generation.
7+
This repository focuses strictly on the Python-side automation layer; the rendering logic remains in the JS core.
8+
9+
## Features
10+
11+
- `print` command runs headless Chrome via Selenium to turn one or more HTML files into PDFs in a single invocation.
12+
- `get_driver` command pre-downloads and caches a matching Chromedriver, making CI setups reproducible.
13+
- Strict mode (enabled via `--strict`, default for the fuzzer) validates the PDF page count with `pypdf` and fails on mismatch.
14+
- Configurable Chromedriver cache dir, timeout, and explicit `--chromedriver` path for air-gapped systems.
15+
- Built-in debug log switch (`--debug`) that stores ChromeDriver logs under `/tmp/chromedriver.log`.
16+
17+
## Installation
18+
19+
1. Install Google Chrome (or Chrome for Testing) on the machine that will run the CLI.
20+
2. Install the package from PyPI:
21+
```bash
22+
pip install html2pdf4doc
23+
```
24+
Python 3.8+ is required; dependencies (`selenium`, `webdriver-manager`, `requests`, `pypdf`) are pulled automatically.
25+
3. For local development, clone this repository and install the dev extras:
26+
```bash
27+
git clone https://github.com/strictdoc-project/html2pdf4doc_python.git
28+
cd html2pdf4doc_python
29+
pip install -r requirements.development.txt
30+
```
31+
or simply run `invoke bootstrap` to initialize submodules and dev dependencies.
32+
33+
## Usage
34+
35+
- Print a single HTML file to PDF with strict validation:
36+
```bash
37+
html2pdf4doc print --strict docs/index.html docs/output/index.pdf
38+
```
39+
- Print multiple files in one go by pairing each HTML with its PDF destination:
40+
```bash
41+
html2pdf4doc print \
42+
docs/book.html docs/output/book.pdf \
43+
docs/appendix.html docs/output/appendix.pdf
44+
```
45+
- Pin a specific Chromedriver or cache directory (useful in CI or offline environments):
46+
```bash
47+
html2pdf4doc get_driver --cache-dir /tmp/html2pdf_cache
48+
html2pdf4doc print --chromedriver /tmp/html2pdf_cache/123/chromedriver docs/index.html docs/output/index.pdf
49+
```
50+
- Tune timeouts/debugging when HTML needs longer to load:
51+
```bash
52+
html2pdf4doc print --page-load-timeout 120 --debug docs/index.html docs/output/index.pdf
53+
```
54+
55+
## Testing Workflow
56+
57+
### Integration tests
58+
59+
- `invoke test` (alias for `invoke test_integration`) runs the `tests/integration` suite through LLVM’s `lit` harness. Each test folder contains `test.itest` (lit recipe), HTML fixtures, and optional Python assertions (e.g., verifying PDF text via `pypdf`).
60+
- Tests download or reuse Chromedriver by calling `invoke get_chrome_driver` under the hood, so ensure Chrome is installed locally.
61+
62+
### Fuzz testing
63+
64+
- `invoke test_fuzz` (or `invoke test-fuzz`) runs `html2pdf4doc/html2pdf4doc_fuzzer.py` against whichever HTML fixture you point it to. The default task configuration targets one of the StrictDoc exports under a path like `tests/fuzz/<fixture_name>/`, but you can edit the task or call the script directly to fuzz any other HTML bundle.
65+
- Pass `--long` to run 200 iterations instead of the default 20. Failures copy the entire fuzz fixture tree plus the mutated HTML/PDF into `output/<relative_fixture_path>/`, so you can inspect the exact inputs that triggered issues.
66+
67+
## Fuzz Fixture & Mutation Questions
68+
69+
- **Fixture selection.** The fuzzer expects two CLI arguments: the path to a single HTML file (`input_file`) and the path to the root directory that contains the file plus all required assets (`root_path`). Out of the box, the `invoke test_fuzz` task points to one of the StrictDoc exports in `tests/fuzz/<fixture_name>/`, but you can drop in any other HTML bundle by adjusting the task arguments or calling the script directly. Each run processes exactly one fixture; to cover several, run the command repeatedly or wrap the fuzzer with a loop. The output folder names include the relative root, so multiple fixtures can coexist.
70+
71+
- **Mutations and how to change them.** `html2pdf4doc/html2pdf4doc_fuzzer.py` parses the HTML with `lxml`, collects all `<p>` and `<td>` elements, and performs up to 25 iterations where it picks a random node and replaces its `.text` with sentences from `Faker`. The mutator intentionally stays within well-formed DOM changes so we stress realistic content variations rather than intentional corruption. To extend mutation types, modify `mutate_and_print()` function by changing the XPath, touching attributes, inserting/removing nodes, or composing several mutation strategies. Just keep the DOM valid and serialize back to HTML via `etree.tostring()` before printing.
72+
73+
- **Error handling/checks.** The fuzzer always calls the CLI with `--strict`, so `html2pdf4doc.py` validates that the PDF page count reported by Chromedriver matches what `pypdf` sees. Any mismatch or Chromedriver failure triggers a `RuntimeError` that bubbles up as a non-zero exit status. On exceptions, the fuzzer copies both the original fixture tree and the timestamped mutated files into `output/...`, making it clear which mutation failed. If you add new mutation types or fixtures, ensure they respect these invariants or extend the validation logic accordingly.
74+
75+
- **Single vs multiple fixtures.** There is no hard limit of “one fixture forever,” but the current CLI accepts only one `(HTML, root)` pair per run. To fuzz multiple documents automatically, add a tiny driver (Python/Bash) that iterates over your fixtures and calls `html2pdf4doc_fuzzer.py` for each; the output structure is already namespaced, so nothing else is required.
76+
77+
## Development Tasks
78+
79+
- `invoke bootstrap`: initialize git submodules and install development dependencies.
80+
- `invoke build`: rebuild the JS core (`submodules/html2pdf`) and copy the minified bundle into `html2pdf4doc/html2pdf4doc_js/`.
81+
- `invoke lint`: run formatting (`ruff format`), lint (`ruff check`), and type checks (`mypy`) across the Python sources.
82+
- `invoke test` / `invoke test_integration`: execute the integration suite via `lit`.
83+
- `invoke test_fuzz [--long]`: run the HTML fuzzing harness for 20 (or 200) iterations.
84+
- `invoke clean_itest_artifacts`: scrub generated integration-test outputs via `git clean`.
85+
- `invoke package` / `invoke release`: build wheels/sdists and optionally upload to (test) PyPI after running `twine check`.
86+
87+
## Troubleshooting
88+
89+
- **Chromedriver not found.** Run `html2pdf4doc get_driver` (optionally with `--cache-dir`) to download a matching driver, then pass `--chromedriver <path>` to `print`.
90+
- **Chrome/driver mismatch on macOS CI.** The CLI automatically prefers “Chrome for Testing,” but you can override by pointing `--chromedriver` to the desired binary.
91+
- **Page load timeouts.** Increase `--page-load-timeout` (default 60s) for large documents or slow machines.
92+
- **Need logs from ChromeDriver.** Add `--debug` to the `print` command and inspect `/tmp/chromedriver.log`.
93+
- **Network-restricted environments.** Pre-download Chromedriver (e.g., on a whitelisted machine) and copy the cache directory into the offline environment; the CLI never auto-downloads if `--chromedriver` is provided.
94+
95+
## Contributing & License
96+
97+
- Contributions are welcome via issues and pull requests. Please run `invoke lint test` before submitting and keep the JS submodule in sync (use `invoke build` if you touch the bundle).
98+
- The project is distributed under the Apache License 2.0 (see `LICENSE`). Any new files should include the appropriate headers when applicable.

0 commit comments

Comments
 (0)