This is the canonical manual for usage, API, selector behavior, performance workflow, conformance expectations, and internals.
- Requirements
- Quick Start
- Core API
- Non-Destructive Parsing
- Selector Support
- Mode Guidance
- Performance and Benchmarks
- Latest Benchmark Snapshot
- Conformance Status
- Architecture
- Troubleshooting
- Zig
0.16.0-dev.3013+abd131e33 - Mutable input buffers (
[]u8) for destructive parsing []const u8inputs are supported whenParseOptions.non_destructive = true
const std = @import("std");
const html = @import("html");
const options: html.ParseOptions = .{};
const Document = options.GetDocument();
test "basic parse + query" {
var doc = Document.init(std.testing.allocator);
defer doc.deinit();
var input = "<div id='app'><a class='nav' href='/docs'>Docs</a></div>".*;
try doc.parse(&input, .{});
const a = doc.queryOne("div#app > a.nav") orelse return error.TestUnexpectedResult;
try std.testing.expectEqualStrings("/docs", a.getAttributeValue("href").?);
}Source examples:
examples/basic_parse_query.zigexamples/query_time_decode.zig
All examples are verified by running zig build examples-check
const opts: ParseOptions = .{};const Document = opts.GetDocument();Document.init(allocator)doc.deinit()doc.clear()doc.parse(input, comptime opts: ParseOptions)- destructive mode accepts mutable input and parses it in place
- non-destructive mode accepts mutable or read-only input and parses a private shadow copy
- maximum parseable input size is controlled at build time with
-Dintlen
- Compile-time selectors:
doc.queryOne(comptime selector)doc.queryAll(comptime selector)
- Runtime selectors:
try doc.queryOneRuntime(selector)try doc.queryAllRuntime(selector)
- Cached runtime selectors:
doc.queryOneCached(selector)doc.queryAllCached(selector)- selector created via
try Selector.compileRuntime(allocator, source)
- Diagnostics:
doc.queryOneDebug(comptime selector)doc.queryOneRuntimeDebug(selector)- both return
{ node, report, err }
- Navigation:
tagName()parentNode()firstChild()lastChild()nextSibling()prevSibling()children()(iterator of wrapped child nodes;collect(allocator)returns an owned[]Node)
- Text:
innerText(allocator)(borrowed or allocated depending on shape)innerTextWithOptions(allocator, TextOptions)innerTextOwned(allocator)(always allocated)innerTextOwnedWithOptions(allocator, TextOptions)
- Attributes:
getAttributeValue(name)
- Scoped queries:
- same query family as
Document(queryOne/queryAll, runtime, cached, debug)
- same query family as
doc.html(),doc.head(),doc.body()doc.isOwned(slice)to check whether a slice points into document source bytes
ParseOptionsdrop_whitespace_text_nodes: bool = truenon_destructive: bool = false
- build option:
-Dintlen=u16|u32|u64|usize- controls the integer width used for source spans and node indexes
- too-small widths fail fast with
error.InputTooLarge
TextOptionsnormalize_whitespace: bool = true
- parse/query work split:
- parse keeps raw text and attribute spans in-place
- entity decode and whitespace normalization are applied by query-time APIs (
getAttributeValue,innerText*, selector attribute predicates)
- destructive parsing is the default because the parser and lazy decode paths mutate source bytes in place for throughput
- non-destructive parsing pays for one private writable shadow copy per parse so the in-place parser core does not need a separate slow path
- nodes are stored in one contiguous array and linked by indexes rather than pointers to keep traversal cache-friendly and make
-Dintleneffective - attribute storage stays span-based instead of building heap objects so parse cost scales with actual queries, not attribute count
- query-time decoding keeps parse throughput high by avoiding eager entity decode and whitespace normalization for bytes that may never be read
Use non-destructive mode when the caller bytes must remain unchanged.
const html_bytes = "<div id='x' data-v='a&b'> hi & bye </div>";
try doc.parse(html_bytes, .{ .non_destructive = true });Behavior:
- the default destructive path is unchanged and still parses caller memory directly
- non-destructive mode allocates one writable shadow buffer per
parsecall - lazy decode and normalization mutate only the shadow buffer, never the caller bytes
Document.writeHtmlandDocument.formatreturn the exact original source bytes in non-destructive mode- node-level formatting still serializes from parsed state rather than replaying original source slices
Use cases:
- parsing file-backed memory maps
- preserving original bytes for hashing, diffing, or cache keys
- running parser queries without allowing in-place mutation of shared buffers
parseWithHooks(doc, input, opts, hooks)queryOneRuntimeWithHooks(doc, selector, hooks)queryOneCachedWithHooks(doc, selector, hooks)queryAllRuntimeWithHooks(doc, selector, hooks)queryAllCachedWithHooks(doc, selector, hooks)
Supported selectors:
- tag selectors and universal
* #id,.class- attributes:
[a],[a=v],[a^=v],[a$=v],[a*=v],[a~=v],[a|=v]
- combinators:
- descendant (
a b) - child (
a > b) - adjacent sibling (
a + b) - general sibling (
a ~ b)
- descendant (
- grouping:
a, b, c - pseudo-classes:
:first-child:last-child:nth-child(An+B)withodd/evenand forms like3n+1,+3n-2,-n+6:not(...)(simple selector payload)
- parser guardrails:
- multiple
#idpredicates in one compound (for example#a#b) are rejected as invalid
- multiple
Compilation modes:
- comptime selectors fail at compile time when invalid
- runtime selectors return
error.InvalidSelector
html is permissive by design. Choose parse options by workload:
| Mode | Parse Options | Best For | Tradeoffs |
|---|---|---|---|
strictest |
.drop_whitespace_text_nodes = false |
traversal predictability and text fidelity | keeps whitespace-only text nodes |
fastest |
.drop_whitespace_text_nodes = true |
throughput-first scraping | whitespace-only text nodes dropped |
non-destructive |
.non_destructive = true plus either profile above |
preserving input bytes, memory maps, exact whole-document formatting | one full input copy per parse |
Fallback playbook:
- Start with
fastestfor bulk workloads. - Move unstable domains to
strictest. - Use
queryOneRuntimeDebugandQueryDebugReportbefore changing selectors.
Run benchmarks:
zig build bench-compare
zig build tools -- run-benchmarks --profile quick
zig build tools -- run-benchmarks --profile stableArtifacts:
bench/results/latest.mdbench/results/latest.json
Benchmark policy:
- parse comparisons include
strlen,lexbor, and parse-onlylol-html - query parse/match/cached sections benchmark
html - repeated runtime selector workloads should use cached selectors
Warning: throughput numbers are not conformance claims. This parser is permissive by design; see Conformance Status.
Source: bench/results/latest.json (stable profile).
| Fixture | ours | lol-html | lexbor |
|---|---|---|---|
rust-lang.html |
2257.75 | 1470.03 | 216.04 |
wiki-html.html |
1785.43 | 1166.84 | 256.33 |
mdn-html.html |
3010.28 | 1792.72 | 393.27 |
w3-html52.html |
992.86 | 728.29 | 188.38 |
hn.html |
1530.30 | 855.35 | 210.08 |
python-org.html |
2035.85 | 1280.26 | 270.66 |
kernel-org.html |
1976.02 | 1282.78 | 277.88 |
gnu-org.html |
2406.24 | 1401.98 | 302.45 |
ziglang-org.html |
2017.94 | 1220.23 | 279.36 |
ziglang-doc-master.html |
1378.79 | 998.22 | 218.19 |
wikipedia-unicode-list.html |
1609.85 | 1039.85 | 217.28 |
whatwg-html-spec.html |
1323.46 | 851.84 | 215.36 |
synthetic-forms.html |
1379.57 | 728.61 | 181.08 |
synthetic-table-grid.html |
1081.59 | 678.24 | 162.10 |
synthetic-list-nested.html |
1184.73 | 618.23 | 155.30 |
synthetic-comments-doctype.html |
1796.81 | 885.58 | 213.86 |
synthetic-template-rich.html |
894.29 | 449.64 | 137.91 |
synthetic-whitespace-noise.html |
1433.10 | 1004.43 | 179.22 |
synthetic-news-feed.html |
1140.39 | 615.10 | 150.06 |
synthetic-ecommerce.html |
1050.46 | 598.30 | 154.63 |
synthetic-forum-thread.html |
1182.01 | 606.82 | 154.11 |
| Case | ours ops/s | ours ns/op |
|---|---|---|
attr-heavy-button |
156194.25 | 6402.28 |
attr-heavy-nav |
83862.90 | 11924.22 |
| Case | ours ops/s | ours ns/op |
|---|---|---|
attr-heavy-button |
157255.50 | 6359.08 |
attr-heavy-nav |
112678.84 | 8874.78 |
| Selector case | Ops/s | ns/op |
|---|---|---|
simple |
9998220.32 | 100.02 |
complex |
4751822.53 | 210.45 |
grouped |
6086328.27 | 164.30 |
For full per-parser, per-fixture tables and gate output:
bench/results/latest.mdbench/results/latest.json
Run conformance suites:
zig build conformance
# or
zig build tools -- run-external-suites --mode bothArtifact: bench/results/external_suite_report.json
Tracked suites:
- selector suites:
nwmatcher,qwery_contextual - parser suites:
- html5lib tree-construction subset
- WHATWG HTML parsing corpus (via WPT
html/syntax/parsing/html5lib_*.html)
Fetched suite repos are cached under bench/.cache/suites/ (gitignored).
Core modules:
src/html/parser.zig: permissive parse pipelinesrc/html/scanner.zig: byte-scanning hot-path helperssrc/html/tags.zig: tag metadata and hash dispatchsrc/html/attr_inline.zig: in-place attribute traversal/lazy materializationsrc/html/entities.zig: entity decode utilitiessrc/selector/runtime.zig,src/selector/compile_time.zig: selector parsingsrc/selector/matcher.zig: selector matching/combinator traversal
Data model highlights:
Documentalways owns node/index storage and may either borrow caller bytes directly or own a shadow parse buffer- nodes are contiguous and linked by indexes for traversal
- attributes are traversed directly from source spans (no heap attribute objects)
- the build-time
-Dintlenoption widens or shrinks those spans and indexes uniformly - destructive mode is the performance baseline; non-destructive mode exists as an opt-in isolation boundary
- validate selector syntax (
queryOneRuntimecan returnerror.InvalidSelector) - check scope (
Documentvs scopedNode) - use
queryOneRuntimeDebugand inspectQueryDebugReport
- default
innerTextnormalizes whitespace - use
innerTextWithOptions(..., .{ .normalize_whitespace = false })for raw spacing - use
innerTextOwned(...)when output must always be allocated - use
doc.isOwned(slice)to check borrowed vs allocated
queryAllRuntime iterators are invalidated by newer queryAllRuntime calls on the same Document.
Expected: parse and lazy decode paths mutate source bytes in place.
If the bytes must not change, parse with .non_destructive = true.