A fast Node.js HTML parsing add-on built in Rust via napi-rs.
Powered by the zero-copy tl HTML parser and htmd for Markdown conversion, with all batch operations running in parallel courtesy of Rayon.
| Factor | Detail |
|---|---|
| Zero-copy parsing | tl parses HTML without allocating a new string for every node β it borrows slices of the original input |
| Native code | The entire parse/query/convert path runs as compiled Rust with opt-level = 3 and full LTO; no interpreter overhead |
| No serialisation boundary | N-API is called directly β there is no JSON or msgpack round-trip between JS and native code |
| Automatic parallelism | All batch* functions use Rayon's work-stealing thread pool, scaling to every available CPU core with zero boilerplate |
| Stripped release binary | The .node binary has debug symbols stripped and dead code eliminated; the shared library is ~1.4 MB |
Rule of thumb: parsing a typical HTML page takes well under 100 Β΅s in a single-threaded call.
Batch-parsing 1 000 pages on an 8-core machine takes roughly the same wall-clock time as parsing 125 of them sequentially.
Measured on macOS / Intel i7-9750H (12 logical CPUs) with Node.js v22.15.0.
Single-document tests use a real-world 74.5 KB HTML page; batch tests process 200 copies (~4.2 MB total).
Each result is the median of multiple timed iterations after a warm-up pass.
| Scenario | web-parser | htmlparser2 | node-html-parser |
|---|---|---|---|
| Parse single doc | 7,800 ops/s (133 Β΅s) π | 634 ops/s β 12.3Γ slower | 424 ops/s β 18.4Γ slower |
| querySelector | 2,200 ops/s (441 Β΅s) π | 651 ops/s β 3.3Γ slower | 425 ops/s β 5.1Γ slower |
| querySelectorAll | 1,300 ops/s (745 Β΅s) π | 540 ops/s β 2.4Γ slower | 314 ops/s β 4.1Γ slower |
| Extract links | 2,000 ops/s (490 Β΅s) π | 558 ops/s β 3.5Γ slower | 302 ops/s β 6.5Γ slower |
| Batch parse 200 docs | 14 ops/s (68.8 ms) π | 9 ops/s β 1.6Γ slower | 5 ops/s β 3.0Γ slower |
| Batch parse files 200 | 110 ops/s (9.0 ms) π | 56 ops/s β 1.9Γ slower | 26 ops/s β 4.3Γ slower |
| Scenario | web-parser | turndown |
|---|---|---|
| Single doc | 224 ops/s (4.3 ms) π | 65 ops/s β 3.5Γ slower |
| Batch 200 docs | 13 ops/s (74.6 ms) π | 1 ops/s β 12.1Γ slower |
Run the benchmarks yourself:
npm run benchmark- Node.js β₯ 18
- Rust β₯ 1.70 (stable) and Cargo (only needed to build from source)
# 1. Clone / enter the project
cd web-parser
# 2. Compile the native release binary
cargo build --release
# 3. Copy the .dylib / .so / .dll to web_parser.node
node scripts/copy-binding.mjs
# 4. Run the demo script
node demo.mjsPre-built binaries can be distributed by packaging
web_parser.nodealongsideindex.jsandindex.d.ts.
No Rust toolchain is required at runtime β only at build time.
import { createRequire } from "node:module";
const wp = createRequire(import.meta.url)("./index.js");
// ββ Parse a string ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
const doc = wp.parse('<h1 id="main">Hello</h1><p>World</p>');
console.log(doc.title()); // β null
console.log(doc.querySelector("h1").textContent); // β 'Hello'
// ββ Parse a file ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
const page = wp.fromFile("/path/to/page.html");
console.log(page.title()); // β 'My Page'
console.log(page.stats());
// β { tagCount: 42, wordCount: 320, linkCount: 12, imageCount: 3, headingCount: 5, charCount: 1840 }
// ββ HTML β Markdown βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
console.log(
wp.toMarkdown("<h2>Section</h2><p>Hello <strong>world</strong></p>"),
);
// β ## Section
//
// Hello **world**parse(html: string): HtmlDocument
parseHtml(html: string): HtmlDocument // alias
fromFile(filePath: string): HtmlDocument
parseFile(filePath: string): HtmlDocument // alias| Method | Returns | Description |
|---|---|---|
source() |
string |
Raw HTML string this document was built from |
title() |
string | null |
Content of <title>, if present |
querySelector(sel) |
Element | null |
First element matching the CSS selector |
querySelectorAll(sel) |
Element[] |
All elements matching the CSS selector |
getByTag(tag) |
Element[] |
All elements with the given tag name |
textContent() |
string |
Plain text of the document body |
links() |
LinkInfo[] |
All <a href> links |
images() |
ImageInfo[] |
All <img> tags |
headings() |
HeadingInfo[] |
All h1βh6 headings in document order |
meta() |
MetaInfo[] |
All <meta> tags |
stats() |
DocumentStats |
Tag count, word count, link count, etc. |
toMarkdown() |
string |
Convert the full document to Markdown |
toMarkdownWithOptions(skipTags) |
string |
Convert to Markdown, skipping listed tags (e.g. "script,style,nav") |
snapshot() |
HtmlDocumentData |
Fully serialisable data snapshot (JSON-safe) |
interface Element {
tagName: string;
textContent: string;
innerHTML: string;
outerHTML: string;
attributes: Record<string, string>;
}// Parse
batchParse(htmlList: string[]): BatchParseResult[]
batchParseFiles(paths: string[]): BatchParseResult[]
// Markdown conversion
batchHtmlToMarkdown(htmlList: string[]): BatchMarkdownResult[]
batchFilesToMarkdown(paths: string[]): BatchMarkdownResult[]Every result carries an index matching the original array position, a document / markdown on success, and an error string on failure β so one bad file never aborts the whole batch.
interface BatchParseResult {
index: number;
document: HtmlDocumentData | null;
error: string | null;
}
interface BatchMarkdownResult {
index: number;
markdown: string | null;
error: string | null;
}Example β parse 500 files at once:
import { readdirSync } from "node:fs";
import { resolve } from "node:path";
const wp = require("./index.js");
const files = readdirSync("./html-archive")
.filter((f) => f.endsWith(".html"))
.map((f) => resolve("./html-archive", f));
const results = wp.batchParseFiles(files); // uses all CPU cores
const titles = results.filter((r) => !r.error).map((r) => r.document.title);extractTitle(html: string): string | null // parse just the <title>
extractLinks(html: string): string[] // all href values
extractText(html: string, selector: string): string | nullThese avoid constructing a full HtmlDocument object and are ideal for pipeline stages where only one piece of information is needed.
pipeline(inputs: string[]): Pipeline
class Pipeline {
parse(): this // treat inputs as HTML strings
parseFiles(): this // treat inputs as file paths
filterErrors(): this // drop failed results
toMarkdown(): this // convert to markdown
map(fn): Pipeline // apply a custom transform
run(): unknown // execute and return final result
}Example:
const results = wp
.pipeline(filePaths)
.parseFiles() // read & parse in parallel
.filterErrors() // skip unreadable files
.run();
results.forEach((r) => {
console.log(r.document.title, r.document.stats.wordCount);
});toMarkdown(html: string): string
htmlToMarkdown(html: string): string // alias
batchToMarkdown(htmlList: string[]): BatchMarkdownResult[]
batchHtmlToMarkdown(htmlList: string[]): BatchMarkdownResult[] // alias
batchFilesToMarkdown(paths: string[]): BatchMarkdownResult[]On the HtmlDocument class you can also call:
doc.toMarkdown();
doc.toMarkdownWithOptions("script,style,nav,footer"); // strip noisy tags firstinterface DocumentStats {
tagCount: number;
linkCount: number;
imageCount: number;
headingCount: number;
wordCount: number;
charCount: number;
}
interface LinkInfo {
href: string;
text: string;
title: string | null;
rel: string | null;
}
interface ImageInfo {
src: string;
alt: string | null;
title: string | null;
width: string | null;
height: string | null;
}
interface HeadingInfo {
level: number;
text: string;
id: string | null;
}
interface MetaInfo {
name: string | null;
property: string | null;
content: string | null;
charset: string | null;
httpEquiv: string | null;
}web-parser/
βββ src/lib.rs # Rust implementation (napi-rs bindings)
βββ Cargo.toml # tl, htmd, rayon, napi, tokio
βββ build.rs # napi-build setup
βββ index.js # JS wrapper + Pipeline class
βββ index.d.ts # TypeScript type definitions
βββ package.json
βββ scripts/
β βββ copy-binding.mjs # Copies target/release/*.dylib β web_parser.node
βββ sample_files/
β βββ page1.html # Systems programming blog post
β βββ page2.html # napi-rs tutorial
β βββ page3.html # E-commerce product listing
βββ demo.mjs # End-to-end demo (11 scenarios)
The release profile is tuned for maximum throughput:
[profile.release]
lto = true # link-time optimisation across all crates
opt-level = 3 # full optimisation
codegen-units = 1 # single codegen unit (better inlining)
strip = true # strip debug symbols from the binaryFor development / debugging, use cargo build (debug profile) and pass --debug to the copy script:
cargo build
node scripts/copy-binding.mjs --debugMIT