Skip to content

jbdotjs/web-parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

web-parser

A fast Node.js HTML parsing add-on built in Rust via napi-rs.
Powered by the zero-copy tl HTML parser and htmd for Markdown conversion, with all batch operations running in parallel courtesy of Rayon.


Why is it fast?

Factor Detail
Zero-copy parsing tl parses HTML without allocating a new string for every node β€” it borrows slices of the original input
Native code The entire parse/query/convert path runs as compiled Rust with opt-level = 3 and full LTO; no interpreter overhead
No serialisation boundary N-API is called directly β€” there is no JSON or msgpack round-trip between JS and native code
Automatic parallelism All batch* functions use Rayon's work-stealing thread pool, scaling to every available CPU core with zero boilerplate
Stripped release binary The .node binary has debug symbols stripped and dead code eliminated; the shared library is ~1.4 MB

Rule of thumb: parsing a typical HTML page takes well under 100 Β΅s in a single-threaded call.
Batch-parsing 1 000 pages on an 8-core machine takes roughly the same wall-clock time as parsing 125 of them sequentially.


Benchmarks

Measured on macOS / Intel i7-9750H (12 logical CPUs) with Node.js v22.15.0.
Single-document tests use a real-world 74.5 KB HTML page; batch tests process 200 copies (~4.2 MB total).

Each result is the median of multiple timed iterations after a warm-up pass.

Parsing & querying

Scenario web-parser htmlparser2 node-html-parser
Parse single doc 7,800 ops/s (133 Β΅s) πŸ† 634 ops/s β€” 12.3Γ— slower 424 ops/s β€” 18.4Γ— slower
querySelector 2,200 ops/s (441 Β΅s) πŸ† 651 ops/s β€” 3.3Γ— slower 425 ops/s β€” 5.1Γ— slower
querySelectorAll 1,300 ops/s (745 Β΅s) πŸ† 540 ops/s β€” 2.4Γ— slower 314 ops/s β€” 4.1Γ— slower
Extract links 2,000 ops/s (490 Β΅s) πŸ† 558 ops/s β€” 3.5Γ— slower 302 ops/s β€” 6.5Γ— slower
Batch parse 200 docs 14 ops/s (68.8 ms) πŸ† 9 ops/s β€” 1.6Γ— slower 5 ops/s β€” 3.0Γ— slower
Batch parse files 200 110 ops/s (9.0 ms) πŸ† 56 ops/s β€” 1.9Γ— slower 26 ops/s β€” 4.3Γ— slower

HTML β†’ Markdown conversion

Scenario web-parser turndown
Single doc 224 ops/s (4.3 ms) πŸ† 65 ops/s β€” 3.5Γ— slower
Batch 200 docs 13 ops/s (74.6 ms) πŸ† 1 ops/s β€” 12.1Γ— slower

Run the benchmarks yourself:

npm run benchmark

Requirements

  • Node.js β‰₯ 18
  • Rust β‰₯ 1.70 (stable) and Cargo (only needed to build from source)

Installation / Build

# 1. Clone / enter the project
cd web-parser

# 2. Compile the native release binary
cargo build --release

# 3. Copy the .dylib / .so / .dll to web_parser.node
node scripts/copy-binding.mjs

# 4. Run the demo script
node demo.mjs

Pre-built binaries can be distributed by packaging web_parser.node alongside index.js and index.d.ts.
No Rust toolchain is required at runtime β€” only at build time.


Quick Start

import { createRequire } from "node:module";
const wp = createRequire(import.meta.url)("./index.js");

// ── Parse a string ────────────────────────────────────────────────────────────
const doc = wp.parse('<h1 id="main">Hello</h1><p>World</p>');

console.log(doc.title()); // β†’ null
console.log(doc.querySelector("h1").textContent); // β†’ 'Hello'

// ── Parse a file ──────────────────────────────────────────────────────────────
const page = wp.fromFile("/path/to/page.html");
console.log(page.title()); // β†’ 'My Page'
console.log(page.stats());
// β†’ { tagCount: 42, wordCount: 320, linkCount: 12, imageCount: 3, headingCount: 5, charCount: 1840 }

// ── HTML β†’ Markdown ───────────────────────────────────────────────────────────
console.log(
  wp.toMarkdown("<h2>Section</h2><p>Hello <strong>world</strong></p>"),
);
// β†’ ## Section
//
//   Hello **world**

API Reference

Parsing

parse(html: string): HtmlDocument
parseHtml(html: string): HtmlDocument          // alias

fromFile(filePath: string): HtmlDocument
parseFile(filePath: string): HtmlDocument      // alias

HtmlDocument

Method Returns Description
source() string Raw HTML string this document was built from
title() string | null Content of <title>, if present
querySelector(sel) Element | null First element matching the CSS selector
querySelectorAll(sel) Element[] All elements matching the CSS selector
getByTag(tag) Element[] All elements with the given tag name
textContent() string Plain text of the document body
links() LinkInfo[] All <a href> links
images() ImageInfo[] All <img> tags
headings() HeadingInfo[] All h1–h6 headings in document order
meta() MetaInfo[] All <meta> tags
stats() DocumentStats Tag count, word count, link count, etc.
toMarkdown() string Convert the full document to Markdown
toMarkdownWithOptions(skipTags) string Convert to Markdown, skipping listed tags (e.g. "script,style,nav")
snapshot() HtmlDocumentData Fully serialisable data snapshot (JSON-safe)

Element shape

interface Element {
  tagName: string;
  textContent: string;
  innerHTML: string;
  outerHTML: string;
  attributes: Record<string, string>;
}

Batch Operations (parallelised across all CPU cores)

// Parse
batchParse(htmlList: string[]): BatchParseResult[]
batchParseFiles(paths: string[]): BatchParseResult[]

// Markdown conversion
batchHtmlToMarkdown(htmlList: string[]): BatchMarkdownResult[]
batchFilesToMarkdown(paths: string[]): BatchMarkdownResult[]

Every result carries an index matching the original array position, a document / markdown on success, and an error string on failure β€” so one bad file never aborts the whole batch.

interface BatchParseResult {
  index: number;
  document: HtmlDocumentData | null;
  error: string | null;
}

interface BatchMarkdownResult {
  index: number;
  markdown: string | null;
  error: string | null;
}

Example β€” parse 500 files at once:

import { readdirSync } from "node:fs";
import { resolve } from "node:path";
const wp = require("./index.js");

const files = readdirSync("./html-archive")
  .filter((f) => f.endsWith(".html"))
  .map((f) => resolve("./html-archive", f));

const results = wp.batchParseFiles(files); // uses all CPU cores
const titles = results.filter((r) => !r.error).map((r) => r.document.title);

Quick Helpers

extractTitle(html: string): string | null      // parse just the <title>
extractLinks(html: string): string[]           // all href values
extractText(html: string, selector: string): string | null

These avoid constructing a full HtmlDocument object and are ideal for pipeline stages where only one piece of information is needed.


Fluent Pipeline

pipeline(inputs: string[]): Pipeline

class Pipeline {
  parse(): this            // treat inputs as HTML strings
  parseFiles(): this       // treat inputs as file paths
  filterErrors(): this     // drop failed results
  toMarkdown(): this       // convert to markdown
  map(fn): Pipeline        // apply a custom transform
  run(): unknown           // execute and return final result
}

Example:

const results = wp
  .pipeline(filePaths)
  .parseFiles() // read & parse in parallel
  .filterErrors() // skip unreadable files
  .run();

results.forEach((r) => {
  console.log(r.document.title, r.document.stats.wordCount);
});

HTML β†’ Markdown

toMarkdown(html: string): string
htmlToMarkdown(html: string): string               // alias
batchToMarkdown(htmlList: string[]): BatchMarkdownResult[]
batchHtmlToMarkdown(htmlList: string[]): BatchMarkdownResult[]   // alias
batchFilesToMarkdown(paths: string[]): BatchMarkdownResult[]

On the HtmlDocument class you can also call:

doc.toMarkdown();
doc.toMarkdownWithOptions("script,style,nav,footer"); // strip noisy tags first

Data Types

interface DocumentStats {
  tagCount: number;
  linkCount: number;
  imageCount: number;
  headingCount: number;
  wordCount: number;
  charCount: number;
}

interface LinkInfo {
  href: string;
  text: string;
  title: string | null;
  rel: string | null;
}
interface ImageInfo {
  src: string;
  alt: string | null;
  title: string | null;
  width: string | null;
  height: string | null;
}
interface HeadingInfo {
  level: number;
  text: string;
  id: string | null;
}
interface MetaInfo {
  name: string | null;
  property: string | null;
  content: string | null;
  charset: string | null;
  httpEquiv: string | null;
}

Project Structure

web-parser/
β”œβ”€β”€ src/lib.rs                  # Rust implementation (napi-rs bindings)
β”œβ”€β”€ Cargo.toml                  # tl, htmd, rayon, napi, tokio
β”œβ”€β”€ build.rs                    # napi-build setup
β”œβ”€β”€ index.js                    # JS wrapper + Pipeline class
β”œβ”€β”€ index.d.ts                  # TypeScript type definitions
β”œβ”€β”€ package.json
β”œβ”€β”€ scripts/
β”‚   └── copy-binding.mjs        # Copies target/release/*.dylib β†’ web_parser.node
β”œβ”€β”€ sample_files/
β”‚   β”œβ”€β”€ page1.html              # Systems programming blog post
β”‚   β”œβ”€β”€ page2.html              # napi-rs tutorial
β”‚   └── page3.html              # E-commerce product listing
└── demo.mjs                    # End-to-end demo (11 scenarios)

Build Profiles

The release profile is tuned for maximum throughput:

[profile.release]
lto           = true   # link-time optimisation across all crates
opt-level     = 3      # full optimisation
codegen-units = 1      # single codegen unit (better inlining)
strip         = true   # strip debug symbols from the binary

For development / debugging, use cargo build (debug profile) and pass --debug to the copy script:

cargo build
node scripts/copy-binding.mjs --debug

License

MIT

About

12x faster Node.js HTML parsing add-on built in Rust via napi-rs. 1200 files in 0.9 seconds πŸ”₯

Topics

Resources

Stars

Watchers

Forks

Contributors