Skip to content

Commit 525668f

Browse files
committed
Combine docs and refresh README
1 parent 678d85c commit 525668f

10 files changed

Lines changed: 244 additions & 454 deletions

README.md

Lines changed: 15 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,17 @@ Then import in Zig code:
4747
const html = @import("htmlparser");
4848
```
4949

50+
## Documentation
51+
52+
The full manual now lives in one file:
53+
54+
- [`docs/README.md`](docs/README.md)
55+
- Core API: [`docs/README.md#core-api`](docs/README.md#core-api)
56+
- Selector grammar: [`docs/README.md#selector-support`](docs/README.md#selector-support)
57+
- Parse mode guidance: [`docs/README.md#mode-guidance`](docs/README.md#mode-guidance)
58+
- Performance workflow: [`docs/README.md#performance-and-benchmarks`](docs/README.md#performance-and-benchmarks)
59+
- Conformance notes: [`docs/README.md#conformance-status`](docs/README.md#conformance-status)
60+
5061
## Quick Start (Test-Backed)
5162

5263
This snippet matches `examples/basic_parse_query.zig`.
@@ -137,7 +148,7 @@ Two bundles are used by the benchmark harness and conformance runner:
137148

138149
`children()` returns a borrowed `[]const u32` index slice into the document's node array.
139150

140-
See `docs/malformed-html-guidance.md` for a mode matrix and fallback workflow on malformed pages.
151+
See `docs/README.md#mode-guidance` for the mode matrix and fallback workflow on malformed pages.
141152

142153
## Selector Support (v1)
143154

@@ -151,7 +162,7 @@ Supported (intentionally limited scope):
151162
- pseudo-classes: `:first-child`, `:last-child`, `:nth-child(An+B)` (includes `odd`/`even`)
152163
- `:not(...)` (simple selectors only)
153164

154-
See `docs/selectors.md` for the exact grammar and constraints.
165+
See `docs/README.md#selector-support` for the exact selector grammar and constraints.
155166

156167
## Design Contract
157168

@@ -235,7 +246,7 @@ Runs known-good external selector and parser suites (both `strictest` and `faste
235246
zig build conformance
236247
```
237248

238-
See `docs/conformance.md` for what’s covered and what’s intentionally out of scope.
249+
See `docs/README.md#conformance-status` for what’s covered and what’s intentionally out of scope.
239250

240251
## Migration Notes
241252

@@ -246,7 +257,7 @@ See `docs/conformance.md` for what’s covered and what’s intentionally out of
246257

247258
## Documentation
248259

249-
See `docs/README.md`.
260+
See `docs/README.md` for the full manual.
250261

251262
## License
252263

docs/README.md

Lines changed: 229 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,229 @@
1-
# Documentation Index
2-
3-
- [Getting Started](getting-started.md)
4-
- [API Reference](api-reference.md)
5-
- [Selector Reference](selectors.md)
6-
- [Performance Guide](performance.md)
7-
- [Malformed HTML Guidance](malformed-html-guidance.md)
8-
- [Conformance Notes](conformance.md)
9-
- [Architecture](architecture.md)
10-
- [Troubleshooting](troubleshooting.md)
1+
# htmlparser Manual
2+
3+
This is the single source of truth for library usage, behavior contracts, performance workflow, and implementation notes.
4+
5+
## Table of Contents
6+
7+
- [Requirements](#requirements)
8+
- [Quick Start](#quick-start)
9+
- [Core API](#core-api)
10+
- [Selector Support](#selector-support)
11+
- [Mode Guidance](#mode-guidance)
12+
- [Performance and Benchmarks](#performance-and-benchmarks)
13+
- [Conformance Status](#conformance-status)
14+
- [Architecture](#architecture)
15+
- [Troubleshooting](#troubleshooting)
16+
17+
## Requirements
18+
19+
- Zig `0.15.2`
20+
- Mutable input buffers (`[]u8`) for parsing
21+
22+
## Quick Start
23+
24+
```zig
25+
const std = @import("std");
26+
const html = @import("htmlparser");
27+
const options: html.ParseOptions = .{};
28+
const Document = options.GetDocument();
29+
30+
test "basic parse + query" {
31+
var doc = Document.init(std.testing.allocator);
32+
defer doc.deinit();
33+
34+
var input = "<div id='app'><a class='nav' href='/docs'>Docs</a></div>".*;
35+
try doc.parse(&input, .{});
36+
37+
const a = doc.queryOne("div#app > a.nav") orelse return error.TestUnexpectedResult;
38+
try std.testing.expectEqualStrings("/docs", a.getAttributeValue("href").?);
39+
}
40+
```
41+
42+
Canonical examples live in `examples/` and are verified by `zig build examples-check`
43+
44+
## Core API
45+
46+
### `Document` factory and lifecycle
47+
48+
- `const opts: ParseOptions = .{};`
49+
- `const Document = opts.GetDocument();`
50+
- `Document.init(allocator)`
51+
- `doc.deinit()`
52+
- `doc.clear()`
53+
- `doc.parse(input: []u8, comptime opts: ParseOptions)`
54+
55+
### Query APIs
56+
57+
- Compile-time selectors:
58+
- `doc.queryOne(comptime selector)`
59+
- `doc.queryAll(comptime selector)`
60+
- Runtime selectors:
61+
- `try doc.queryOneRuntime(selector)`
62+
- `try doc.queryAllRuntime(selector)`
63+
- Cached runtime selectors:
64+
- `doc.queryOneCached(&selector)`
65+
- `doc.queryAllCached(&selector)`
66+
- selector created via `try Selector.compileRuntime(allocator, source)`
67+
- Diagnostics:
68+
- `doc.queryOneDebug(comptime selector, report)`
69+
- `try doc.queryOneRuntimeDebug(selector, report)`
70+
71+
### Node APIs
72+
73+
- Navigation:
74+
- `tagName()`
75+
- `parentNode()`
76+
- `firstChild()`
77+
- `lastChild()`
78+
- `nextSibling()`
79+
- `prevSibling()`
80+
- `children()` (borrowed `[]const u32` index view)
81+
- Text:
82+
- `innerText(allocator)` (may return borrowed or allocated)
83+
- `innerTextWithOptions(allocator, TextOptions)`
84+
- `innerTextOwned(allocator)` (always allocated)
85+
- `innerTextOwnedWithOptions(allocator, TextOptions)`
86+
- Attributes:
87+
- `getAttributeValue(name)`
88+
- Scoped queries:
89+
- same query family as `Document` (`queryOne/queryAll`, runtime, cached, debug)
90+
91+
### Additional helpers
92+
93+
- `doc.html()`, `doc.head()`, `doc.body()`
94+
- `doc.isOwned(slice)` to check whether a returned slice points into document source bytes
95+
96+
### Options
97+
98+
- `ParseOptions`
99+
- `eager_child_views: bool = true`
100+
- `drop_whitespace_text_nodes: bool = false`
101+
- `TextOptions`
102+
- `normalize_whitespace: bool = true`
103+
104+
### Instrumentation wrappers
105+
106+
- `parseWithHooks(doc, input, opts, hooks)`
107+
- `queryOneRuntimeWithHooks(doc, selector, hooks)`
108+
- `queryOneCachedWithHooks(doc, selector, hooks)`
109+
- `queryAllRuntimeWithHooks(doc, selector, hooks)`
110+
- `queryAllCachedWithHooks(doc, selector, hooks)`
111+
112+
## Selector Support
113+
114+
Supported selectors:
115+
116+
- tag selectors and universal `*`
117+
- `#id`, `.class`
118+
- attributes:
119+
- `[a]`, `[a=v]`, `[a^=v]`, `[a$=v]`, `[a*=v]`, `[a~=v]`, `[a|=v]`
120+
- combinators:
121+
- descendant (`a b`)
122+
- child (`a > b`)
123+
- adjacent sibling (`a + b`)
124+
- general sibling (`a ~ b`)
125+
- grouping: `a, b, c`
126+
- pseudo-classes:
127+
- `:first-child`
128+
- `:last-child`
129+
- `:nth-child(An+B)` with `odd/even` and forms like `3n+1`, `+3n-2`, `-n+6`
130+
- `:not(...)` (simple selector payload)
131+
132+
Compilation modes:
133+
134+
- comptime selectors fail at compile time when invalid
135+
- runtime selectors return `error.InvalidSelector`
136+
137+
## Mode Guidance
138+
139+
`htmlparser` is permissive by design. Choose parse options per site behavior:
140+
141+
| Mode | Parse Options | Best For | Tradeoffs |
142+
|---|---|---|---|
143+
| `strictest` | `.eager_child_views = true`, `.drop_whitespace_text_nodes = false` | Maximum traversal predictability and text fidelity | More parse-time work |
144+
| `fastest` | `.eager_child_views = false`, `.drop_whitespace_text_nodes = true` | Throughput-first scraping | Whitespace-only text nodes dropped; child views built lazily |
145+
146+
Fallback playbook:
147+
148+
1. Start with `fastest` for bulk workloads.
149+
2. Switch problematic domains to `strictest` if text/navigation assumptions fail.
150+
3. Use `queryOneRuntimeDebug` and inspect `QueryDebugReport` before changing selectors.
151+
152+
## Performance and Benchmarks
153+
154+
Run benchmarks:
155+
156+
```bash
157+
zig build bench-compare
158+
zig build tools -- run-benchmarks --profile quick
159+
zig build tools -- run-benchmarks --profile stable
160+
```
161+
162+
Artifacts:
163+
164+
- `bench/results/latest.md`
165+
- `bench/results/latest.json`
166+
167+
Notes:
168+
169+
- parse comparisons include `strlen`, `lexbor`, and parse-only `lol-html`
170+
- query parse/match/cached sections benchmark `htmlparser`
171+
- repeated runtime selector workloads should use cached selectors
172+
173+
## Conformance Status
174+
175+
Run conformance suites:
176+
177+
```bash
178+
zig build conformance
179+
# or
180+
zig build tools -- run-external-suites --mode both
181+
```
182+
183+
Report artifact: `bench/results/external_suite_report.json`
184+
185+
Tracked suites:
186+
187+
- selector suites: `nwmatcher`, `qwery_contextual`
188+
- parser suite: html5lib tree-construction compatibility subset
189+
190+
## Architecture
191+
192+
Core modules:
193+
194+
- `src/html/parser.zig`: permissive parse pipeline
195+
- `src/html/scanner.zig`: byte-scanning hot-path helpers
196+
- `src/html/tags.zig`: tag metadata and hash dispatch
197+
- `src/html/attr_inline.zig`: in-place attribute traversal/lazy materialization
198+
- `src/html/entities.zig`: entity decode utilities
199+
- `src/selector/runtime.zig`, `src/selector/compile_time.zig`: selector parsing
200+
- `src/selector/matcher.zig`: selector matching/combinator traversal
201+
202+
Data model highlights:
203+
204+
- `Document` owns source bytes and node/index storage
205+
- nodes are contiguous and linked by indexes for traversal
206+
- attributes are traversed directly from source spans (no heap attr objects)
207+
208+
## Troubleshooting
209+
210+
### Query returns nothing
211+
212+
- validate selector syntax (`queryOneRuntime` returns `error.InvalidSelector`)
213+
- check query scope (`Document` vs scoped `Node`)
214+
- use `queryOneRuntimeDebug` + `QueryDebugReport` for near-miss reasons
215+
216+
### Unexpected `innerText`
217+
218+
- default `innerText` normalizes whitespace
219+
- use `innerTextWithOptions(..., .{ .normalize_whitespace = false })` for raw spacing
220+
- use `innerTextOwned(...)` when you always require allocated output
221+
- use `doc.isOwned(slice)` to check borrowed vs allocated
222+
223+
### Runtime iterator invalidation
224+
225+
`queryAllRuntime` iterators are invalidated by newer `queryAllRuntime` calls on the same `Document`.
226+
227+
### Input buffer changed
228+
229+
Expected behavior: parsing and lazy decode paths mutate source bytes in place.

docs/api-reference.md

Lines changed: 0 additions & 96 deletions
This file was deleted.

0 commit comments

Comments
 (0)