Skip to content

Commit fc040b3

Browse files
committed
Address html parser report
1 parent bbd59b9 commit fc040b3

14 files changed

Lines changed: 914 additions & 15 deletions

CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,12 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
66

77
## [Unreleased]
88

9+
### Compatibility
10+
11+
- Impact: Non-breaking
12+
- Migration: Not required
13+
- Downstream scope: Small
14+
915
### Added
1016

1117
- Initial OSS release documentation set.

README.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,8 @@ Typical HTML parsers optimize for browser-like behavior, strict correctness, or
2727
- **Navigation:** `parentNode`, `firstChild`, `lastChild`, `nextSibling`, `prevSibling`, `children` (element-only).
2828
- **In-place attributes:** attribute values are materialized/decoded lazily and cached in-place.
2929
- **Configurable parse work:** eager/lazy child views and optional whitespace-text dropping.
30+
- **Opt-in diagnostics:** `queryOneDebug` / `queryOneRuntimeDebug` expose near-miss reasons without changing hot-path APIs.
31+
- **Opt-in instrumentation:** compile-time hook wrappers for parse/query timing and node-count stats.
3032

3133
Target Zig version: `0.15.2`.
3234

@@ -96,6 +98,16 @@ const sel = try html.Selector.compileRuntime(arena.allocator(), "a[href^=https]"
9698
const node = doc.queryOneCompiled(&sel);
9799
```
98100

101+
### Debug query diagnostics
102+
103+
```zig
104+
var report: html.QueryDebugReport = .{};
105+
const node = try doc.queryOneRuntimeDebug("a[href^=https]", &report);
106+
if (node == null) {
107+
// Inspect report.visited_elements and report.near_misses
108+
}
109+
```
110+
99111
## Parse Option Recipes
100112

101113
Two bundles are used by the benchmark harness and conformance runner:
@@ -112,6 +124,8 @@ Two bundles are used by the benchmark harness and conformance runner:
112124

113125
`children()` returns a borrowed `[]const u32` index slice into the document's node array.
114126

127+
See `docs/malformed-html-guidance.md` for a mode matrix and fallback workflow on malformed pages.
128+
115129
## Selector Support (v1)
116130

117131
Supported (intentionally limited scope):
@@ -163,6 +177,13 @@ zig build conformance
163177

164178
See `docs/conformance.md` for what’s covered and what’s intentionally out of scope.
165179

180+
## Migration Notes
181+
182+
- `CHANGELOG.md` now includes compatibility labels in `Unreleased`:
183+
- `Impact: Breaking|Non-breaking`
184+
- `Migration: Required|Not required`
185+
- `Downstream scope: Small|Medium|Large`
186+
166187
## Documentation
167188

168189
See `docs/README.md`.

bench/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -64,5 +64,5 @@ The benchmark output also includes a hard gate table:
6464

6565
- `PASS/FAIL: ours-fastest > lol-html` per fixture
6666
- strict-mode regression checks against baseline:
67-
- parse throughput not worse than -3% per fixture
68-
- query parse/match/compiled not worse than -2%
67+
- parse/query throughput not worse than -1%
68+
- each failing case is automatically rerun 3 times and judged by rerun median

docs/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
- [API Reference](api-reference.md)
55
- [Selector Reference](selectors.md)
66
- [Performance Guide](performance.md)
7+
- [Malformed HTML Guidance](malformed-html-guidance.md)
78
- [Conformance Notes](conformance.md)
89
- [Architecture](architecture.md)
910
- [Troubleshooting](troubleshooting.md)

docs/api-reference.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,8 @@ Query entrypoints:
2222
- `try doc.queryAllRuntime(selector)`
2323
- `doc.queryOneCompiled(&selector)`
2424
- `doc.queryAllCompiled(&selector)`
25+
- `doc.queryOneDebug(comptime selector, report)`
26+
- `try doc.queryOneRuntimeDebug(selector, report)`
2527

2628
Helpers:
2729

@@ -56,6 +58,8 @@ Scoped query entrypoints:
5658
- `try queryAllRuntime(selector)`
5759
- `queryOneCompiled(&selector)`
5860
- `queryAllCompiled(&selector)`
61+
- `queryOneDebug(comptime selector, report)`
62+
- `try queryOneRuntimeDebug(selector, report)`
5963

6064
## `Selector`
6165

@@ -72,6 +76,16 @@ Scoped query entrypoints:
7276

7377
- `normalize_whitespace: bool = true`
7478

79+
## Debug/Instrumentation Types
80+
81+
- `QueryDebugReport` and related debug enums (`DebugFailureKind`, `NearMiss`)
82+
- Wrapper helpers:
83+
- `parseWithHooks(doc, input, opts, hooks)`
84+
- `queryOneRuntimeWithHooks(doc, selector, hooks)`
85+
- `queryOneCompiledWithHooks(doc, selector, hooks)`
86+
- `queryAllRuntimeWithHooks(doc, selector, hooks)`
87+
- `queryAllCompiledWithHooks(doc, selector, hooks)`
88+
7589
## Lifetime and Safety Notes
7690

7791
- Nodes borrow from their owning `Document`.

docs/malformed-html-guidance.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# Malformed HTML Guidance
2+
3+
This parser is permissive by design. For unreliable web HTML, choose parse options based on what your pipeline needs to optimize.
4+
5+
## Mode Matrix
6+
7+
| Mode | Parse Options | Best For | Tradeoffs |
8+
|---|---|---|---|
9+
| `strictest` | `.eager_child_views = true`, `.drop_whitespace_text_nodes = false` | Maximum traversal predictability and content fidelity | Higher parse-time work and memory traffic |
10+
| `fastest` | `.eager_child_views = false`, `.drop_whitespace_text_nodes = true` | Throughput-first extraction and selector-heavy scraping | Whitespace-only text nodes are dropped; child views materialize lazily on first `children()` |
11+
12+
## Practical Fallback Playbook
13+
14+
1. Start in `fastest` for bulk scraping.
15+
2. If a target site shows unstable text extraction or navigation assumptions, switch that site to `strictest`.
16+
3. Keep selectors robust:
17+
4. Prefer anchored selectors (`#id`, stable attrs) over deep sibling chains.
18+
5. Avoid dependence on whitespace-only text nodes.
19+
6. Use `queryOneRuntimeDebug` to inspect non-match reasons before changing selectors.
20+
21+
## Example
22+
23+
```zig
24+
try doc.parse(&input, .{
25+
.eager_child_views = false,
26+
.drop_whitespace_text_nodes = true,
27+
});
28+
29+
var report: html.QueryDebugReport = .{};
30+
const node = try doc.queryOneRuntimeDebug("article .title > a", &report);
31+
if (node == null) {
32+
// Inspect report.near_misses and report.visited_elements
33+
}
34+
```

docs/performance.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@ Output artifacts:
3232
- Parse throughput is benchmarked against `strlen`, `lexbor`, and parse-only `lol-html`.
3333
- Query parse/match sections are measured on `htmlparser` only.
3434
- For repeated runtime selector workloads, prefer `Selector.compileRuntime` + `query*Compiled` APIs.
35+
- For malformed pages and fallback strategy guidance, see `docs/malformed-html-guidance.md`.
3536

3637
## Methodology
3738

docs/troubleshooting.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ Checklist:
77
- Verify selector syntax (`queryOneRuntime` can surface `error.InvalidSelector`).
88
- Selectors are matched case-insensitively against tag/attribute names by default.
99
- Confirm you are querying the expected scope (`Document` vs `Node` scoped queries).
10+
- For selector diagnosis details (visited count + near misses), use `queryOneRuntimeDebug(...)` and inspect `QueryDebugReport`.
1011

1112
## Missing or Unexpected `innerText`
1213

@@ -25,6 +26,10 @@ Checklist:
2526

2627
This is expected. Parsing and lazy attr/entity decode mutate the source buffer in place.
2728

29+
## Malformed Page Behavior
30+
31+
Use `docs/malformed-html-guidance.md` for mode selection (`strictest` vs `fastest`) and fallback workflow.
32+
2833
## Example Drift Policy
2934

3035
All user-facing snippets must be backed by code under `examples/` and verified via:

src/debug/instrumentation.zig

Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
const std = @import("std");
2+
const ast = @import("../selector/ast.zig");
3+
const ParseOptions = @import("../html/document.zig").ParseOptions;
4+
5+
pub const QueryInstrumentationKind = enum(u8) {
6+
one_runtime,
7+
one_compiled,
8+
all_runtime,
9+
all_compiled,
10+
};
11+
12+
pub const ParseInstrumentationStats = struct {
13+
elapsed_ns: u64,
14+
input_len: usize,
15+
node_count: usize,
16+
};
17+
18+
pub const QueryInstrumentationStats = struct {
19+
elapsed_ns: u64,
20+
selector_len: usize,
21+
kind: QueryInstrumentationKind,
22+
matched: ?bool = null,
23+
};
24+
25+
fn elapsedNs(start: i128, finish: i128) u64 {
26+
if (finish <= start) return 0;
27+
return @intCast(finish - start);
28+
}
29+
30+
fn HookDeclType(comptime H: type) type {
31+
return switch (@typeInfo(H)) {
32+
.pointer => |p| p.child,
33+
else => H,
34+
};
35+
}
36+
37+
fn matchedFromValue(value: anytype) ?bool {
38+
return switch (@typeInfo(@TypeOf(value))) {
39+
.optional => value != null,
40+
else => true,
41+
};
42+
}
43+
44+
pub fn parseWithHooks(doc: anytype, input: []u8, comptime opts: ParseOptions, hooks: anytype) !void {
45+
if (comptime @hasDecl(HookDeclType(@TypeOf(hooks)), "onParseStart")) {
46+
hooks.onParseStart(input.len);
47+
}
48+
49+
const start = std.time.nanoTimestamp();
50+
try doc.parse(input, opts);
51+
const stats: ParseInstrumentationStats = .{
52+
.elapsed_ns = elapsedNs(start, std.time.nanoTimestamp()),
53+
.input_len = input.len,
54+
.node_count = doc.nodes.items.len,
55+
};
56+
57+
if (comptime @hasDecl(HookDeclType(@TypeOf(hooks)), "onParseEnd")) {
58+
hooks.onParseEnd(stats);
59+
}
60+
}
61+
62+
pub fn queryOneRuntimeWithHooks(doc: anytype, selector: []const u8, hooks: anytype) @TypeOf(doc.queryOneRuntime(selector)) {
63+
if (comptime @hasDecl(HookDeclType(@TypeOf(hooks)), "onQueryStart")) {
64+
hooks.onQueryStart(.one_runtime, selector.len);
65+
}
66+
67+
const start = std.time.nanoTimestamp();
68+
const out = doc.queryOneRuntime(selector);
69+
if (out) |value| {
70+
if (comptime @hasDecl(HookDeclType(@TypeOf(hooks)), "onQueryEnd")) {
71+
hooks.onQueryEnd(QueryInstrumentationStats{
72+
.elapsed_ns = elapsedNs(start, std.time.nanoTimestamp()),
73+
.selector_len = selector.len,
74+
.kind = .one_runtime,
75+
.matched = matchedFromValue(value),
76+
});
77+
}
78+
return value;
79+
} else |err| {
80+
if (comptime @hasDecl(HookDeclType(@TypeOf(hooks)), "onQueryEnd")) {
81+
hooks.onQueryEnd(QueryInstrumentationStats{
82+
.elapsed_ns = elapsedNs(start, std.time.nanoTimestamp()),
83+
.selector_len = selector.len,
84+
.kind = .one_runtime,
85+
.matched = null,
86+
});
87+
}
88+
return err;
89+
}
90+
}
91+
92+
pub fn queryOneCompiledWithHooks(doc: anytype, sel: *const ast.Selector, hooks: anytype) @TypeOf(doc.queryOneCompiled(sel)) {
93+
if (comptime @hasDecl(HookDeclType(@TypeOf(hooks)), "onQueryStart")) {
94+
hooks.onQueryStart(.one_compiled, sel.source.len);
95+
}
96+
97+
const start = std.time.nanoTimestamp();
98+
const value = doc.queryOneCompiled(sel);
99+
if (comptime @hasDecl(HookDeclType(@TypeOf(hooks)), "onQueryEnd")) {
100+
hooks.onQueryEnd(QueryInstrumentationStats{
101+
.elapsed_ns = elapsedNs(start, std.time.nanoTimestamp()),
102+
.selector_len = sel.source.len,
103+
.kind = .one_compiled,
104+
.matched = matchedFromValue(value),
105+
});
106+
}
107+
return value;
108+
}
109+
110+
pub fn queryAllRuntimeWithHooks(doc: anytype, selector: []const u8, hooks: anytype) @TypeOf(doc.queryAllRuntime(selector)) {
111+
if (comptime @hasDecl(HookDeclType(@TypeOf(hooks)), "onQueryStart")) {
112+
hooks.onQueryStart(.all_runtime, selector.len);
113+
}
114+
115+
const start = std.time.nanoTimestamp();
116+
const out = doc.queryAllRuntime(selector);
117+
if (out) |iter| {
118+
if (comptime @hasDecl(HookDeclType(@TypeOf(hooks)), "onQueryEnd")) {
119+
hooks.onQueryEnd(QueryInstrumentationStats{
120+
.elapsed_ns = elapsedNs(start, std.time.nanoTimestamp()),
121+
.selector_len = selector.len,
122+
.kind = .all_runtime,
123+
.matched = null,
124+
});
125+
}
126+
return iter;
127+
} else |err| {
128+
if (comptime @hasDecl(HookDeclType(@TypeOf(hooks)), "onQueryEnd")) {
129+
hooks.onQueryEnd(QueryInstrumentationStats{
130+
.elapsed_ns = elapsedNs(start, std.time.nanoTimestamp()),
131+
.selector_len = selector.len,
132+
.kind = .all_runtime,
133+
.matched = null,
134+
});
135+
}
136+
return err;
137+
}
138+
}
139+
140+
pub fn queryAllCompiledWithHooks(doc: anytype, sel: *const ast.Selector, hooks: anytype) @TypeOf(doc.queryAllCompiled(sel)) {
141+
if (comptime @hasDecl(HookDeclType(@TypeOf(hooks)), "onQueryStart")) {
142+
hooks.onQueryStart(.all_compiled, sel.source.len);
143+
}
144+
145+
const start = std.time.nanoTimestamp();
146+
const iter = doc.queryAllCompiled(sel);
147+
if (comptime @hasDecl(HookDeclType(@TypeOf(hooks)), "onQueryEnd")) {
148+
hooks.onQueryEnd(QueryInstrumentationStats{
149+
.elapsed_ns = elapsedNs(start, std.time.nanoTimestamp()),
150+
.selector_len = sel.source.len,
151+
.kind = .all_compiled,
152+
.matched = null,
153+
});
154+
}
155+
return iter;
156+
}

0 commit comments

Comments
 (0)