|
1 | | -# Documentation Index |
2 | | - |
3 | | -- [Getting Started](getting-started.md) |
4 | | -- [API Reference](api-reference.md) |
5 | | -- [Selector Reference](selectors.md) |
6 | | -- [Performance Guide](performance.md) |
7 | | -- [Malformed HTML Guidance](malformed-html-guidance.md) |
8 | | -- [Conformance Notes](conformance.md) |
9 | | -- [Architecture](architecture.md) |
10 | | -- [Troubleshooting](troubleshooting.md) |
| 1 | +# htmlparser Manual |
| 2 | + |
| 3 | +This is the single source of truth for library usage, behavior contracts, performance workflow, and implementation notes. |
| 4 | + |
| 5 | +## Table of Contents |
| 6 | + |
| 7 | +- [Requirements](#requirements) |
| 8 | +- [Quick Start](#quick-start) |
| 9 | +- [Core API](#core-api) |
| 10 | +- [Selector Support](#selector-support) |
| 11 | +- [Mode Guidance](#mode-guidance) |
| 12 | +- [Performance and Benchmarks](#performance-and-benchmarks) |
| 13 | +- [Conformance Status](#conformance-status) |
| 14 | +- [Architecture](#architecture) |
| 15 | +- [Troubleshooting](#troubleshooting) |
| 16 | + |
| 17 | +## Requirements |
| 18 | + |
| 19 | +- Zig `0.15.2` |
| 20 | +- Mutable input buffers (`[]u8`) for parsing |
| 21 | + |
| 22 | +## Quick Start |
| 23 | + |
| 24 | +```zig |
| 25 | +const std = @import("std"); |
| 26 | +const html = @import("htmlparser"); |
| 27 | +const options: html.ParseOptions = .{}; |
| 28 | +const Document = options.GetDocument(); |
| 29 | +
|
| 30 | +test "basic parse + query" { |
| 31 | + var doc = Document.init(std.testing.allocator); |
| 32 | + defer doc.deinit(); |
| 33 | +
|
| 34 | + var input = "<div id='app'><a class='nav' href='/docs'>Docs</a></div>".*; |
| 35 | + try doc.parse(&input, .{}); |
| 36 | +
|
| 37 | + const a = doc.queryOne("div#app > a.nav") orelse return error.TestUnexpectedResult; |
| 38 | + try std.testing.expectEqualStrings("/docs", a.getAttributeValue("href").?); |
| 39 | +} |
| 40 | +``` |
| 41 | + |
| 42 | +Canonical examples live in `examples/` and are verified by `zig build examples-check` |
| 43 | + |
| 44 | +## Core API |
| 45 | + |
| 46 | +### `Document` factory and lifecycle |
| 47 | + |
| 48 | +- `const opts: ParseOptions = .{};` |
| 49 | +- `const Document = opts.GetDocument();` |
| 50 | +- `Document.init(allocator)` |
| 51 | +- `doc.deinit()` |
| 52 | +- `doc.clear()` |
| 53 | +- `doc.parse(input: []u8, comptime opts: ParseOptions)` |
| 54 | + |
| 55 | +### Query APIs |
| 56 | + |
| 57 | +- Compile-time selectors: |
| 58 | + - `doc.queryOne(comptime selector)` |
| 59 | + - `doc.queryAll(comptime selector)` |
| 60 | +- Runtime selectors: |
| 61 | + - `try doc.queryOneRuntime(selector)` |
| 62 | + - `try doc.queryAllRuntime(selector)` |
| 63 | +- Cached runtime selectors: |
| 64 | + - `doc.queryOneCached(&selector)` |
| 65 | + - `doc.queryAllCached(&selector)` |
| 66 | + - selector created via `try Selector.compileRuntime(allocator, source)` |
| 67 | +- Diagnostics: |
| 68 | + - `doc.queryOneDebug(comptime selector, report)` |
| 69 | + - `try doc.queryOneRuntimeDebug(selector, report)` |
| 70 | + |
| 71 | +### Node APIs |
| 72 | + |
| 73 | +- Navigation: |
| 74 | + - `tagName()` |
| 75 | + - `parentNode()` |
| 76 | + - `firstChild()` |
| 77 | + - `lastChild()` |
| 78 | + - `nextSibling()` |
| 79 | + - `prevSibling()` |
| 80 | + - `children()` (borrowed `[]const u32` index view) |
| 81 | +- Text: |
| 82 | + - `innerText(allocator)` (may return borrowed or allocated) |
| 83 | + - `innerTextWithOptions(allocator, TextOptions)` |
| 84 | + - `innerTextOwned(allocator)` (always allocated) |
| 85 | + - `innerTextOwnedWithOptions(allocator, TextOptions)` |
| 86 | +- Attributes: |
| 87 | + - `getAttributeValue(name)` |
| 88 | +- Scoped queries: |
| 89 | + - same query family as `Document` (`queryOne/queryAll`, runtime, cached, debug) |
| 90 | + |
| 91 | +### Additional helpers |
| 92 | + |
| 93 | +- `doc.html()`, `doc.head()`, `doc.body()` |
| 94 | +- `doc.isOwned(slice)` to check whether a returned slice points into document source bytes |
| 95 | + |
| 96 | +### Options |
| 97 | + |
| 98 | +- `ParseOptions` |
| 99 | + - `eager_child_views: bool = true` |
| 100 | + - `drop_whitespace_text_nodes: bool = false` |
| 101 | +- `TextOptions` |
| 102 | + - `normalize_whitespace: bool = true` |
| 103 | + |
| 104 | +### Instrumentation wrappers |
| 105 | + |
| 106 | +- `parseWithHooks(doc, input, opts, hooks)` |
| 107 | +- `queryOneRuntimeWithHooks(doc, selector, hooks)` |
| 108 | +- `queryOneCachedWithHooks(doc, selector, hooks)` |
| 109 | +- `queryAllRuntimeWithHooks(doc, selector, hooks)` |
| 110 | +- `queryAllCachedWithHooks(doc, selector, hooks)` |
| 111 | + |
| 112 | +## Selector Support |
| 113 | + |
| 114 | +Supported selectors: |
| 115 | + |
| 116 | +- tag selectors and universal `*` |
| 117 | +- `#id`, `.class` |
| 118 | +- attributes: |
| 119 | + - `[a]`, `[a=v]`, `[a^=v]`, `[a$=v]`, `[a*=v]`, `[a~=v]`, `[a|=v]` |
| 120 | +- combinators: |
| 121 | + - descendant (`a b`) |
| 122 | + - child (`a > b`) |
| 123 | + - adjacent sibling (`a + b`) |
| 124 | + - general sibling (`a ~ b`) |
| 125 | +- grouping: `a, b, c` |
| 126 | +- pseudo-classes: |
| 127 | + - `:first-child` |
| 128 | + - `:last-child` |
| 129 | + - `:nth-child(An+B)` with `odd/even` and forms like `3n+1`, `+3n-2`, `-n+6` |
| 130 | + - `:not(...)` (simple selector payload) |
| 131 | + |
| 132 | +Compilation modes: |
| 133 | + |
| 134 | +- comptime selectors fail at compile time when invalid |
| 135 | +- runtime selectors return `error.InvalidSelector` |
| 136 | + |
| 137 | +## Mode Guidance |
| 138 | + |
| 139 | +`htmlparser` is permissive by design. Choose parse options per site behavior: |
| 140 | + |
| 141 | +| Mode | Parse Options | Best For | Tradeoffs | |
| 142 | +|---|---|---|---| |
| 143 | +| `strictest` | `.eager_child_views = true`, `.drop_whitespace_text_nodes = false` | Maximum traversal predictability and text fidelity | More parse-time work | |
| 144 | +| `fastest` | `.eager_child_views = false`, `.drop_whitespace_text_nodes = true` | Throughput-first scraping | Whitespace-only text nodes dropped; child views built lazily | |
| 145 | + |
| 146 | +Fallback playbook: |
| 147 | + |
| 148 | +1. Start with `fastest` for bulk workloads. |
| 149 | +2. Switch problematic domains to `strictest` if text/navigation assumptions fail. |
| 150 | +3. Use `queryOneRuntimeDebug` and inspect `QueryDebugReport` before changing selectors. |
| 151 | + |
| 152 | +## Performance and Benchmarks |
| 153 | + |
| 154 | +Run benchmarks: |
| 155 | + |
| 156 | +```bash |
| 157 | +zig build bench-compare |
| 158 | +zig build tools -- run-benchmarks --profile quick |
| 159 | +zig build tools -- run-benchmarks --profile stable |
| 160 | +``` |
| 161 | + |
| 162 | +Artifacts: |
| 163 | + |
| 164 | +- `bench/results/latest.md` |
| 165 | +- `bench/results/latest.json` |
| 166 | + |
| 167 | +Notes: |
| 168 | + |
| 169 | +- parse comparisons include `strlen`, `lexbor`, and parse-only `lol-html` |
| 170 | +- query parse/match/cached sections benchmark `htmlparser` |
| 171 | +- repeated runtime selector workloads should use cached selectors |
| 172 | + |
| 173 | +## Conformance Status |
| 174 | + |
| 175 | +Run conformance suites: |
| 176 | + |
| 177 | +```bash |
| 178 | +zig build conformance |
| 179 | +# or |
| 180 | +zig build tools -- run-external-suites --mode both |
| 181 | +``` |
| 182 | + |
| 183 | +Report artifact: `bench/results/external_suite_report.json` |
| 184 | + |
| 185 | +Tracked suites: |
| 186 | + |
| 187 | +- selector suites: `nwmatcher`, `qwery_contextual` |
| 188 | +- parser suite: html5lib tree-construction compatibility subset |
| 189 | + |
| 190 | +## Architecture |
| 191 | + |
| 192 | +Core modules: |
| 193 | + |
| 194 | +- `src/html/parser.zig`: permissive parse pipeline |
| 195 | +- `src/html/scanner.zig`: byte-scanning hot-path helpers |
| 196 | +- `src/html/tags.zig`: tag metadata and hash dispatch |
| 197 | +- `src/html/attr_inline.zig`: in-place attribute traversal/lazy materialization |
| 198 | +- `src/html/entities.zig`: entity decode utilities |
| 199 | +- `src/selector/runtime.zig`, `src/selector/compile_time.zig`: selector parsing |
| 200 | +- `src/selector/matcher.zig`: selector matching/combinator traversal |
| 201 | + |
| 202 | +Data model highlights: |
| 203 | + |
| 204 | +- `Document` owns source bytes and node/index storage |
| 205 | +- nodes are contiguous and linked by indexes for traversal |
| 206 | +- attributes are traversed directly from source spans (no heap attr objects) |
| 207 | + |
| 208 | +## Troubleshooting |
| 209 | + |
| 210 | +### Query returns nothing |
| 211 | + |
| 212 | +- validate selector syntax (`queryOneRuntime` returns `error.InvalidSelector`) |
| 213 | +- check query scope (`Document` vs scoped `Node`) |
| 214 | +- use `queryOneRuntimeDebug` + `QueryDebugReport` for near-miss reasons |
| 215 | + |
| 216 | +### Unexpected `innerText` |
| 217 | + |
| 218 | +- default `innerText` normalizes whitespace |
| 219 | +- use `innerTextWithOptions(..., .{ .normalize_whitespace = false })` for raw spacing |
| 220 | +- use `innerTextOwned(...)` when you always require allocated output |
| 221 | +- use `doc.isOwned(slice)` to check borrowed vs allocated |
| 222 | + |
| 223 | +### Runtime iterator invalidation |
| 224 | + |
| 225 | +`queryAllRuntime` iterators are invalidated by newer `queryAllRuntime` calls on the same `Document`. |
| 226 | + |
| 227 | +### Input buffer changed |
| 228 | + |
| 229 | +Expected behavior: parsing and lazy decode paths mutate source bytes in place. |
0 commit comments