html Documentation

This is the canonical manual for usage, API, selector behavior, performance workflow, conformance expectations, and internals.

Requirements
Quick Start
Core API
Non-Destructive Parsing
Selector Support
Mode Guidance
Performance and Benchmarks
Latest Benchmark Snapshot
Conformance Status
Architecture
Troubleshooting

Requirements

Zig 0.16.0-dev.3013+abd131e33
Mutable input buffers ([]u8) for destructive parsing
[]const u8 inputs are supported when ParseOptions.non_destructive = true

Quick Start

const std = @import("std");
const html = @import("html");
const options: html.ParseOptions = .{};
const Document = options.GetDocument();

test "basic parse + query" {
    var doc = Document.init(std.testing.allocator);
    defer doc.deinit();

    var input = "<div id='app'><a class='nav' href='/docs'>Docs</a></div>".*;
    try doc.parse(&input, .{});

    const a = doc.queryOne("div#app > a.nav") orelse return error.TestUnexpectedResult;
    try std.testing.expectEqualStrings("/docs", a.getAttributeValue("href").?);
}

Source examples:

examples/basic_parse_query.zig
examples/query_time_decode.zig

All examples are verified by running zig build examples-check

Core API

`Document` factory and lifecycle

const opts: ParseOptions = .{};
const Document = opts.GetDocument();
Document.init(allocator)
doc.deinit()
doc.clear()
doc.parse(input, comptime opts: ParseOptions)
destructive mode accepts mutable input and parses it in place
non-destructive mode accepts mutable or read-only input and parses a private shadow copy
maximum parseable input size is controlled at build time with -Dintlen

Query APIs

Compile-time selectors:
- doc.queryOne(comptime selector)
- doc.queryAll(comptime selector)
Runtime selectors:
- try doc.queryOneRuntime(selector)
- try doc.queryAllRuntime(selector)
Cached runtime selectors:
- doc.queryOneCached(selector)
- doc.queryAllCached(selector)
- selector created via try Selector.compileRuntime(allocator, source)
Diagnostics:
- doc.queryOneDebug(comptime selector)
- doc.queryOneRuntimeDebug(selector)
- both return { node, report, err }

Node APIs

Navigation:
- tagName()
- parentNode()
- firstChild()
- lastChild()
- nextSibling()
- prevSibling()
- children() (iterator of wrapped child nodes; collect(allocator) returns an owned []Node)
Text:
- innerText(allocator) (borrowed or allocated depending on shape)
- innerTextWithOptions(allocator, TextOptions)
- innerTextOwned(allocator) (always allocated)
- innerTextOwnedWithOptions(allocator, TextOptions)
Attributes:
- getAttributeValue(name)
Scoped queries:
- same query family as Document (queryOne/queryAll, runtime, cached, debug)

Helpers

doc.html(), doc.head(), doc.body()
doc.isOwned(slice) to check whether a slice points into document source bytes

Parse/Text options

ParseOptions
- drop_whitespace_text_nodes: bool = true
- non_destructive: bool = false
build option:
- -Dintlen=u16|u32|u64|usize
- controls the integer width used for source spans and node indexes
- too-small widths fail fast with error.InputTooLarge
TextOptions
- normalize_whitespace: bool = true
parse/query work split:
- parse keeps raw text and attribute spans in-place
- entity decode and whitespace normalization are applied by query-time APIs (getAttributeValue, innerText*, selector attribute predicates)

Design Notes

destructive parsing is the default because the parser and lazy decode paths mutate source bytes in place for throughput
non-destructive parsing pays for one private writable shadow copy per parse so the in-place parser core does not need a separate slow path
nodes are stored in one contiguous array and linked by indexes rather than pointers to keep traversal cache-friendly and make -Dintlen effective
attribute storage stays span-based instead of building heap objects so parse cost scales with actual queries, not attribute count
query-time decoding keeps parse throughput high by avoiding eager entity decode and whitespace normalization for bytes that may never be read

Non-Destructive Parsing

Use non-destructive mode when the caller bytes must remain unchanged.

const html_bytes = "<div id='x' data-v='a&amp;b'> hi &amp; bye </div>";
try doc.parse(html_bytes, .{ .non_destructive = true });

Behavior:

the default destructive path is unchanged and still parses caller memory directly
non-destructive mode allocates one writable shadow buffer per parse call
lazy decode and normalization mutate only the shadow buffer, never the caller bytes
Document.writeHtml and Document.format return the exact original source bytes in non-destructive mode
node-level formatting still serializes from parsed state rather than replaying original source slices

Use cases:

parsing file-backed memory maps
preserving original bytes for hashing, diffing, or cache keys
running parser queries without allowing in-place mutation of shared buffers

Instrumentation wrappers

parseWithHooks(doc, input, opts, hooks)
queryOneRuntimeWithHooks(doc, selector, hooks)
queryOneCachedWithHooks(doc, selector, hooks)
queryAllRuntimeWithHooks(doc, selector, hooks)
queryAllCachedWithHooks(doc, selector, hooks)

Selector Support

Supported selectors:

tag selectors and universal *
#id, .class
attributes:
- [a], [a=v], [a^=v], [a$=v], [a*=v], [a~=v], [a|=v]
combinators:
- descendant (a b)
- child (a > b)
- adjacent sibling (a + b)
- general sibling (a ~ b)
grouping: a, b, c
pseudo-classes:
- :first-child
- :last-child
- :nth-child(An+B) with odd/even and forms like 3n+1, +3n-2, -n+6
- :not(...) (simple selector payload)
parser guardrails:
- multiple #id predicates in one compound (for example #a#b) are rejected as invalid

Compilation modes:

comptime selectors fail at compile time when invalid
runtime selectors return error.InvalidSelector

Mode Guidance

html is permissive by design. Choose parse options by workload:

Mode	Parse Options	Best For	Tradeoffs
`strictest`	`.drop_whitespace_text_nodes = false`	traversal predictability and text fidelity	keeps whitespace-only text nodes
`fastest`	`.drop_whitespace_text_nodes = true`	throughput-first scraping	whitespace-only text nodes dropped
`non-destructive`	`.non_destructive = true` plus either profile above	preserving input bytes, memory maps, exact whole-document formatting	one full input copy per parse

Fallback playbook:

Start with fastest for bulk workloads.
Move unstable domains to strictest.
Use queryOneRuntimeDebug and QueryDebugReport before changing selectors.

Performance and Benchmarks

Run benchmarks:

zig build bench-compare
zig build tools -- run-benchmarks --profile quick
zig build tools -- run-benchmarks --profile stable

Artifacts:

bench/results/latest.md
bench/results/latest.json

Benchmark policy:

parse comparisons include strlen, lexbor, and parse-only lol-html
query parse/match/cached sections benchmark html
repeated runtime selector workloads should use cached selectors

Latest Benchmark Snapshot

Warning: throughput numbers are not conformance claims. This parser is permissive by design; see Conformance Status.

Source: bench/results/latest.json (stable profile).

Parse Throughput Comparison (MB/s)

Fixture	ours	lol-html	lexbor
`rust-lang.html`	2257.75	1470.03	216.04
`wiki-html.html`	1785.43	1166.84	256.33
`mdn-html.html`	3010.28	1792.72	393.27
`w3-html52.html`	992.86	728.29	188.38
`hn.html`	1530.30	855.35	210.08
`python-org.html`	2035.85	1280.26	270.66
`kernel-org.html`	1976.02	1282.78	277.88
`gnu-org.html`	2406.24	1401.98	302.45
`ziglang-org.html`	2017.94	1220.23	279.36
`ziglang-doc-master.html`	1378.79	998.22	218.19
`wikipedia-unicode-list.html`	1609.85	1039.85	217.28
`whatwg-html-spec.html`	1323.46	851.84	215.36
`synthetic-forms.html`	1379.57	728.61	181.08
`synthetic-table-grid.html`	1081.59	678.24	162.10
`synthetic-list-nested.html`	1184.73	618.23	155.30
`synthetic-comments-doctype.html`	1796.81	885.58	213.86
`synthetic-template-rich.html`	894.29	449.64	137.91
`synthetic-whitespace-noise.html`	1433.10	1004.43	179.22
`synthetic-news-feed.html`	1140.39	615.10	150.06
`synthetic-ecommerce.html`	1050.46	598.30	154.63
`synthetic-forum-thread.html`	1182.01	606.82	154.11

Query Match Throughput (ours)

Case	ours ops/s	ours ns/op
`attr-heavy-button`	156194.25	6402.28
`attr-heavy-nav`	83862.90	11924.22

Cached Query Throughput (ours)

Case	ours ops/s	ours ns/op
`attr-heavy-button`	157255.50	6359.08
`attr-heavy-nav`	112678.84	8874.78

Query Parse Throughput (ours)

Selector case	Ops/s	ns/op
`simple`	9998220.32	100.02
`complex`	4751822.53	210.45
`grouped`	6086328.27	164.30

For full per-parser, per-fixture tables and gate output:

bench/results/latest.md
bench/results/latest.json

Conformance Status

Run conformance suites:

zig build conformance
# or
zig build tools -- run-external-suites --mode both

Artifact: bench/results/external_suite_report.json

Tracked suites:

selector suites: nwmatcher, qwery_contextual
parser suites:
- html5lib tree-construction subset
- WHATWG HTML parsing corpus (via WPT html/syntax/parsing/html5lib_*.html)

Fetched suite repos are cached under bench/.cache/suites/ (gitignored).

Architecture

Core modules:

src/html/parser.zig: permissive parse pipeline
src/html/scanner.zig: byte-scanning hot-path helpers
src/html/tags.zig: tag metadata and hash dispatch
src/html/attr_inline.zig: in-place attribute traversal/lazy materialization
src/html/entities.zig: entity decode utilities
src/selector/runtime.zig, src/selector/compile_time.zig: selector parsing
src/selector/matcher.zig: selector matching/combinator traversal

Data model highlights:

Document always owns node/index storage and may either borrow caller bytes directly or own a shadow parse buffer
nodes are contiguous and linked by indexes for traversal
attributes are traversed directly from source spans (no heap attribute objects)
the build-time -Dintlen option widens or shrinks those spans and indexes uniformly
destructive mode is the performance baseline; non-destructive mode exists as an opt-in isolation boundary

Troubleshooting

Query returns nothing

validate selector syntax (queryOneRuntime can return error.InvalidSelector)
check scope (Document vs scoped Node)
use queryOneRuntimeDebug and inspect QueryDebugReport

Unexpected `innerText`

default innerText normalizes whitespace
use innerTextWithOptions(..., .{ .normalize_whitespace = false }) for raw spacing
use innerTextOwned(...) when output must always be allocated
use doc.isOwned(slice) to check borrowed vs allocated

Runtime iterator invalidation

queryAllRuntime iterators are invalidated by newer queryAllRuntime calls on the same Document.

Input buffer changed

Expected: parse and lazy decode paths mutate source bytes in place.

If the bytes must not change, parse with .non_destructive = true.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

html Documentation

Table of Contents

Requirements

Quick Start

Core API

`Document` factory and lifecycle

Query APIs

Node APIs

Helpers

Parse/Text options

Design Notes

Non-Destructive Parsing

Instrumentation wrappers

Selector Support

Mode Guidance

Performance and Benchmarks

Latest Benchmark Snapshot

Parse Throughput Comparison (MB/s)

Query Match Throughput (ours)

Cached Query Throughput (ours)

Query Parse Throughput (ours)

Conformance Status

Architecture

Troubleshooting

Query returns nothing

Unexpected `innerText`

Runtime iterator invalidation

Input buffer changed

FilesExpand file tree

DOCUMENTATION.md

Latest commit

History

DOCUMENTATION.md

File metadata and controls

html Documentation

Table of Contents

Requirements

Quick Start

Core API

Document factory and lifecycle

Query APIs

Node APIs

Helpers

Parse/Text options

Design Notes

Non-Destructive Parsing

Instrumentation wrappers

Selector Support

Mode Guidance

Performance and Benchmarks

Latest Benchmark Snapshot

Parse Throughput Comparison (MB/s)

Query Match Throughput (ours)

Cached Query Throughput (ours)

Query Parse Throughput (ours)

Conformance Status

Architecture

Troubleshooting

Query returns nothing

Unexpected innerText

Runtime iterator invalidation

Input buffer changed

`Document` factory and lifecycle

Unexpected `innerText`