web-parser

A fast Node.js HTML parsing add-on built in Rust via napi-rs.
Powered by the zero-copy tl HTML parser and htmd for Markdown conversion, with all batch operations running in parallel courtesy of Rayon.

Why is it fast?

Factor	Detail
Zero-copy parsing	`tl` parses HTML without allocating a new string for every node — it borrows slices of the original input
Native code	The entire parse/query/convert path runs as compiled Rust with `opt-level = 3` and full LTO; no interpreter overhead
No serialisation boundary	N-API is called directly — there is no JSON or msgpack round-trip between JS and native code
Automatic parallelism	All `batch*` functions use Rayon's work-stealing thread pool, scaling to every available CPU core with zero boilerplate
Stripped release binary	The `.node` binary has debug symbols stripped and dead code eliminated; the shared library is ~1.4 MB

Rule of thumb: parsing a typical HTML page takes well under 100 µs in a single-threaded call.
Batch-parsing 1 000 pages on an 8-core machine takes roughly the same wall-clock time as parsing 125 of them sequentially.

Benchmarks

Measured on macOS / Intel i7-9750H (12 logical CPUs) with Node.js v22.15.0.
Single-document tests use a real-world 74.5 KB HTML page; batch tests process 200 copies (~4.2 MB total).

Each result is the median of multiple timed iterations after a warm-up pass.

Parsing & querying

Scenario	web-parser	htmlparser2	node-html-parser
Parse single doc	7,800 ops/s (133 µs) 🏆	634 ops/s — 12.3× slower	424 ops/s — 18.4× slower
querySelector	2,200 ops/s (441 µs) 🏆	651 ops/s — 3.3× slower	425 ops/s — 5.1× slower
querySelectorAll	1,300 ops/s (745 µs) 🏆	540 ops/s — 2.4× slower	314 ops/s — 4.1× slower
Extract links	2,000 ops/s (490 µs) 🏆	558 ops/s — 3.5× slower	302 ops/s — 6.5× slower
Batch parse 200 docs	14 ops/s (68.8 ms) 🏆	9 ops/s — 1.6× slower	5 ops/s — 3.0× slower
Batch parse files 200	110 ops/s (9.0 ms) 🏆	56 ops/s — 1.9× slower	26 ops/s — 4.3× slower

HTML → Markdown conversion

Scenario	web-parser	turndown
Single doc	224 ops/s (4.3 ms) 🏆	65 ops/s — 3.5× slower
Batch 200 docs	13 ops/s (74.6 ms) 🏆	1 ops/s — 12.1× slower

Run the benchmarks yourself:

npm run benchmark

Requirements

Node.js ≥ 18
Rust ≥ 1.70 (stable) and Cargo (only needed to build from source)

Installation / Build

# 1. Clone / enter the project
cd web-parser

# 2. Compile the native release binary
cargo build --release

# 3. Copy the .dylib / .so / .dll to web_parser.node
node scripts/copy-binding.mjs

# 4. Run the demo script
node demo.mjs

Pre-built binaries can be distributed by packaging web_parser.node alongside index.js and index.d.ts.
No Rust toolchain is required at runtime — only at build time.

Quick Start

import { createRequire } from "node:module";
const wp = createRequire(import.meta.url)("./index.js");

// ── Parse a string ────────────────────────────────────────────────────────────
const doc = wp.parse('<h1 id="main">Hello</h1><p>World</p>');

console.log(doc.title()); // → null
console.log(doc.querySelector("h1").textContent); // → 'Hello'

// ── Parse a file ──────────────────────────────────────────────────────────────
const page = wp.fromFile("/path/to/page.html");
console.log(page.title()); // → 'My Page'
console.log(page.stats());
// → { tagCount: 42, wordCount: 320, linkCount: 12, imageCount: 3, headingCount: 5, charCount: 1840 }

// ── HTML → Markdown ───────────────────────────────────────────────────────────
console.log(
  wp.toMarkdown("<h2>Section</h2><p>Hello <strong>world</strong></p>"),
);
// → ## Section
//
//   Hello **world**

API Reference

Parsing

parse(html: string): HtmlDocument
parseHtml(html: string): HtmlDocument          // alias

fromFile(filePath: string): HtmlDocument
parseFile(filePath: string): HtmlDocument      // alias

`HtmlDocument`

Method	Returns	Description
`source()`	`string`	Raw HTML string this document was built from
`title()`	`string \| null`	Content of `<title>`, if present
`querySelector(sel)`	`Element \| null`	First element matching the CSS selector
`querySelectorAll(sel)`	`Element[]`	All elements matching the CSS selector
`getByTag(tag)`	`Element[]`	All elements with the given tag name
`textContent()`	`string`	Plain text of the document body
`links()`	`LinkInfo[]`	All `<a href>` links
`images()`	`ImageInfo[]`	All `<img>` tags
`headings()`	`HeadingInfo[]`	All h1–h6 headings in document order
`meta()`	`MetaInfo[]`	All `<meta>` tags
`stats()`	`DocumentStats`	Tag count, word count, link count, etc.
`toMarkdown()`	`string`	Convert the full document to Markdown
`toMarkdownWithOptions(skipTags)`	`string`	Convert to Markdown, skipping listed tags (e.g. `"script,style,nav"`)
`snapshot()`	`HtmlDocumentData`	Fully serialisable data snapshot (JSON-safe)

`Element` shape

interface Element {
  tagName: string;
  textContent: string;
  innerHTML: string;
  outerHTML: string;
  attributes: Record<string, string>;
}

Batch Operations (parallelised across all CPU cores)

// Parse
batchParse(htmlList: string[]): BatchParseResult[]
batchParseFiles(paths: string[]): BatchParseResult[]

// Markdown conversion
batchHtmlToMarkdown(htmlList: string[]): BatchMarkdownResult[]
batchFilesToMarkdown(paths: string[]): BatchMarkdownResult[]

Every result carries an index matching the original array position, a document / markdown on success, and an error string on failure — so one bad file never aborts the whole batch.

interface BatchParseResult {
  index: number;
  document: HtmlDocumentData | null;
  error: string | null;
}

interface BatchMarkdownResult {
  index: number;
  markdown: string | null;
  error: string | null;
}

Example — parse 500 files at once:

import { readdirSync } from "node:fs";
import { resolve } from "node:path";
const wp = require("./index.js");

const files = readdirSync("./html-archive")
  .filter((f) => f.endsWith(".html"))
  .map((f) => resolve("./html-archive", f));

const results = wp.batchParseFiles(files); // uses all CPU cores
const titles = results.filter((r) => !r.error).map((r) => r.document.title);

Quick Helpers

extractTitle(html: string): string | null      // parse just the <title>
extractLinks(html: string): string[]           // all href values
extractText(html: string, selector: string): string | null

These avoid constructing a full HtmlDocument object and are ideal for pipeline stages where only one piece of information is needed.

Fluent Pipeline

pipeline(inputs: string[]): Pipeline

class Pipeline {
  parse(): this            // treat inputs as HTML strings
  parseFiles(): this       // treat inputs as file paths
  filterErrors(): this     // drop failed results
  toMarkdown(): this       // convert to markdown
  map(fn): Pipeline        // apply a custom transform
  run(): unknown           // execute and return final result
}

Example:

const results = wp
  .pipeline(filePaths)
  .parseFiles() // read & parse in parallel
  .filterErrors() // skip unreadable files
  .run();

results.forEach((r) => {
  console.log(r.document.title, r.document.stats.wordCount);
});

HTML → Markdown

toMarkdown(html: string): string
htmlToMarkdown(html: string): string               // alias
batchToMarkdown(htmlList: string[]): BatchMarkdownResult[]
batchHtmlToMarkdown(htmlList: string[]): BatchMarkdownResult[]   // alias
batchFilesToMarkdown(paths: string[]): BatchMarkdownResult[]

On the HtmlDocument class you can also call:

doc.toMarkdown();
doc.toMarkdownWithOptions("script,style,nav,footer"); // strip noisy tags first

Data Types

interface DocumentStats {
  tagCount: number;
  linkCount: number;
  imageCount: number;
  headingCount: number;
  wordCount: number;
  charCount: number;
}

interface LinkInfo {
  href: string;
  text: string;
  title: string | null;
  rel: string | null;
}
interface ImageInfo {
  src: string;
  alt: string | null;
  title: string | null;
  width: string | null;
  height: string | null;
}
interface HeadingInfo {
  level: number;
  text: string;
  id: string | null;
}
interface MetaInfo {
  name: string | null;
  property: string | null;
  content: string | null;
  charset: string | null;
  httpEquiv: string | null;
}

Project Structure

web-parser/
├── src/lib.rs                  # Rust implementation (napi-rs bindings)
├── Cargo.toml                  # tl, htmd, rayon, napi, tokio
├── build.rs                    # napi-build setup
├── index.js                    # JS wrapper + Pipeline class
├── index.d.ts                  # TypeScript type definitions
├── package.json
├── scripts/
│   └── copy-binding.mjs        # Copies target/release/*.dylib → web_parser.node
├── sample_files/
│   ├── page1.html              # Systems programming blog post
│   ├── page2.html              # napi-rs tutorial
│   └── page3.html              # E-commerce product listing
└── demo.mjs                    # End-to-end demo (11 scenarios)

Build Profiles

The release profile is tuned for maximum throughput:

[profile.release]
lto           = true   # link-time optimisation across all crates
opt-level     = 3      # full optimisation
codegen-units = 1      # single codegen unit (better inlining)
strip         = true   # strip debug symbols from the binary

For development / debugging, use cargo build (debug profile) and pass --debug to the copy script:

cargo build
node scripts/copy-binding.mjs --debug

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

web-parser

Why is it fast?

Benchmarks

Parsing & querying

HTML → Markdown conversion

Requirements

Installation / Build

Quick Start

API Reference

Parsing

`HtmlDocument`

`Element` shape

Batch Operations (parallelised across all CPU cores)

Quick Helpers

Fluent Pipeline

HTML → Markdown

Data Types

Project Structure

Build Profiles

License

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.vscode		.vscode
sample_files		sample_files
scripts		scripts
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
benchmark.mjs		benchmark.mjs
build.rs		build.rs
demo.mjs		demo.mjs
index.d.ts		index.d.ts
index.js		index.js
package-lock.json		package-lock.json
package.json		package.json
web_parser.node		web_parser.node

Folders and files

Latest commit

History

Repository files navigation

web-parser

Why is it fast?

Benchmarks

Parsing & querying

HTML → Markdown conversion

Requirements

Installation / Build

Quick Start

API Reference

Parsing

HtmlDocument

Element shape

Batch Operations (parallelised across all CPU cores)

Quick Helpers

Fluent Pipeline

HTML → Markdown

Data Types

Project Structure

Build Profiles

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages

`HtmlDocument`

`Element` shape