nostrax is a production-grade, async web crawler and URL extraction toolkit for mapping, analyzing, and auditing websites at scale. Built on aiohttp and lxml, it pairs high-throughput concurrent crawling with security-hardened input handling for SEO, QA, security research, and infrastructure audit workflows in regulated environments.
It is designed to be embedded in CI pipelines, scheduled jobs, and internal tooling - with deterministic exit codes, structured output, standards-based logging, zero telemetry, and no outbound calls beyond the sites you explicitly target.
The name derives from "Nostos" - the Greek concept of a heroic return journey - reflecting the tool's purpose of mapping the digital landscape.
- Why nostrax
- Features
- Compatibility
- Installation
- Quick Start
- Common Workflows
- Configuration
- Python API
- CLI Reference
- Exit Codes
- Observability
- Architecture
- Security
- Versioning and Stability
- Support
- Development
- License
| Need | Solution |
|---|---|
| Crawl a large site quickly | Async I/O with connection pooling and DNS caching |
| Audit for broken links | --check-status reports HTTP status codes for every URL |
| Map a specific section of a site | --scope restricts crawling to a URL path prefix |
| Resume interrupted crawls | --cache-dir persists state to disk |
| Comply with robots.txt | --respect-robots with a standards-compliant parser |
| Respect rate limits | --rate-limit enforces minimum delay between requests |
| Integrate with CI/CD | JSON/CSV output, exit codes, silent mode |
| Generate stakeholder reports | Self-contained HTML reports with filterable tables |
| Crawl behind a proxy | Proxy and HTTP basic auth support |
- Async I/O via
aiohttpwith shared session, connection pooling, and DNS caching - Unified DFS / BFS engine sharing a bounded frontier and a worker pool, so
--strategy bfsruns at the same concurrency as--strategy dfs - Full-jitter retry for transient network failures, avoiding lockstep retry storms
- Separated timeouts: total, connect, and per-read budgets can be set independently
- Content-Type aware - skips non-HTML responses before downloading
- Response size limits prevent memory exhaustion (default 10 MB for pages, 1 MiB for robots.txt, 50 MiB for sitemaps)
- Per-host rate limiting:
--rate-limitapplies to each netloc independently, not globally - robots.txt support with standards-compliant URL matching and redirects disabled
- Scope control to restrict crawling to a URL path prefix
- Resume from disk via on-disk cache with atomic rewrites of the visited set and a kept-open append handle for results
- Graceful shutdown:
SIGINT(Ctrl+C) flushes the cache before exiting with code 130
- URL extraction from
<a>,<img>,<script>,<link>,<form>,<iframe>,<video>,<audio>,<source>, and<meta http-equiv="refresh">tags <img srcset>and<source srcset>are fully unpacked (every candidate URL is emitted)<base href>is honoured when resolving relative URLs on the page- lxml C-based parser with
SoupStrainerfor targeted parsing - URL normalization - strips fragments, trailing slashes, sorts query params, lowercases host
- Credential stripping prevents leaking userinfo in output
- Sitemap.xml parsing with sitemap index support, through
defusedxmlto refuse XML-bomb / external-DTD payloads at parse time
- Filter by domain (internal / external / all), protocol, regex include, regex exclude
--pattern/--excluderegexes run through theregexpackage with a per-URL timeout, bounding worst-case ReDoS exposure- Output formats: plain, JSON, CSV, self-contained HTML report
- Rich metadata per URL: source page, tag type, depth, response time, status
- Deduplication and sorting
- File output with path traversal protection
- HTTP basic authentication
- HTTP, HTTPS, SOCKS4, SOCKS5 proxy support, credentials redacted from logs
- Broken link checker via concurrent HEAD requests on the same aiohttp session the crawl used (one DNS/TLS handshake per host, not two)
- Progress bar via optional
tqdmdependency - Config file via
.nostraxrc(TOML format) - PyPI version check via
--check-update
| Component | Supported |
|---|---|
| Python | 3.10, 3.11, 3.12, 3.13 |
| Operating systems | Linux, macOS, Windows |
| Architectures | x86_64, arm64 |
| Runtime dependencies | aiohttp >= 3.9, beautifulsoup4 >= 4.12, lxml >= 5.0 |
| Optional dependencies | tqdm >= 4.60 (progress bar) |
| Network | Outbound HTTP/HTTPS to target hosts; HTTP/HTTPS/SOCKS4/SOCKS5 proxy support |
| Packaging | PEP 517 / PEP 621 (pyproject.toml), installs via pip or any PEP 517 frontend |
No telemetry is collected. The process makes no network calls to third parties beyond the target host(s) and, when explicitly requested, a single PyPI lookup via --check-update.
nostrax is not yet published to PyPI, so pip install nostrax and pipx install nostrax will not resolve. Install from source or directly from the Git repository using either pip (library or CLI) or pipx (isolated CLI).
pipx installs the nostrax command into an isolated virtual environment and puts it on your PATH, without polluting your system or project Python.
# from a local clone
git clone https://github.com/prodrom3/nostrax.git
cd nostrax
pipx install .
# or directly from GitHub
pipx install git+https://github.com/prodrom3/nostrax.git
# with the progress bar extra
pipx install "git+https://github.com/prodrom3/nostrax.git#egg=nostrax[progress]"
# upgrade later
pipx upgrade nostrax
# uninstall
pipx uninstall nostraxgit clone https://github.com/prodrom3/nostrax.git
cd nostrax
pip install . # library + CLI
pip install ".[progress]" # with progress bar support
pip install -e ".[dev]" # editable, with test toolingOr directly from GitHub:
pip install "git+https://github.com/prodrom3/nostrax.git"
pip install "git+https://github.com/prodrom3/nostrax.git@v2.0.0" # pin to a tagnostrax --version
nostrax --check-updateRequires Python 3.10 or newer.
Extract all links from a single page:
nostrax -t https://example.comCrawl internal links two levels deep and save to JSON:
nostrax -t https://example.com -d 2 --domain internal -f json -o results.jsonAudit a site for broken links and generate an HTML report:
nostrax -t https://example.com -d 1 --check-status -f html -o audit.htmlCrawl internal links with a sitemap seed, sorted, with response timings:
nostrax -t https://example.com -d 3 \
--domain internal --sitemap \
--metadata --sort \
-f html -o audit.htmlCheck every discovered URL's HTTP status:
nostrax -t https://example.com -d 2 --check-status -f json -o links.jsonRate-limited, robots-respecting crawl suitable for production targets:
nostrax -t https://example.com -d 3 \
--rate-limit 1.0 --retries 3 \
--respect-robots \
--user-agent "MyBot/1.0 (+https://mysite.com/bot)"Crawl only the /docs/ section of a site using BFS:
nostrax -t https://example.com -d 5 \
--strategy bfs --scope /docs/ \
-f csv -o docs-coverage.csvSave state to disk so interrupted crawls can resume:
nostrax -t https://example.com -d 5 --cache-dir .crawl_cache
# If interrupted (Ctrl+C, crash, timeout), re-run the same command to resume.nostrax -t https://internal.example.com \
--proxy http://corporate-proxy:8080 \
--auth username:passwordExtract all static assets, excluding images:
nostrax -t https://example.com --all-tags --exclude "\.(jpg|png|gif|svg)$"Persist common options in .nostraxrc (TOML format) in the current directory or home folder. CLI arguments override config values.
# .nostraxrc
depth = 2
rate_limit = 0.5
respect_robots = true
user_agent = "mybot/1.0"
max_concurrent = 20
scope = "/docs/"
retries = 3Bypass the config file with --no-config.
from nostrax import crawl
urls = crawl("https://example.com", depth=1)from nostrax import crawl
results = crawl("https://example.com", depth=1, include_metadata=True)
for r in results:
time_str = f"{r.response_time:.0f}ms" if r.response_time else "n/a"
print(f"{r.url} ({time_str}, from {r.source}, tag={r.tag})")from nostrax import crawl
urls = crawl(
"https://example.com",
depth=3,
strategy="bfs",
scope="/docs/",
cache_dir=".crawl_cache",
respect_robots=True,
rate_limit=0.5,
)import asyncio
from nostrax import crawl_async
urls = asyncio.run(crawl_async(
"https://example.com",
depth=2,
max_concurrent=20,
rate_limit=0.5,
retries=3,
respect_robots=True,
))from nostrax import normalize_url
assert normalize_url("https://Example.COM/page/") == "https://example.com/page"
assert normalize_url("https://example.com/page#top") == "https://example.com/page"
assert normalize_url("https://user:pass@example.com/") == "https://example.com/"from nostrax import crawl, NostraxError
try:
urls = crawl("https://example.com")
except NostraxError as e:
# Raised when the starting URL could not be fetched
# (network error, robots block, content-type mismatch, etc.)
print(f"Crawl failed: {e}")crawl and crawl_async accept a metrics= argument implementing the
MetricsSink Protocol. Pair this with your Prometheus / OpenTelemetry
/ Datadog client to stream crawl events:
from nostrax import crawl
from nostrax.metrics import MetricsSink
class PrometheusSink:
def on_page_fetched(self, url, depth, elapsed_ms, urls_found):
PAGES_FETCHED.inc()
FETCH_LATENCY_MS.observe(elapsed_ms)
def on_fetch_failed(self, url, depth):
PAGES_FAILED.inc()
def on_robots_blocked(self, url):
ROBOTS_BLOCKED.inc()
crawl("https://example.com", depth=2, metrics=PrometheusSink())Sink methods are called synchronously on the event-loop thread, so implementations must be cheap. Exceptions raised by a sink are isolated: they are logged as warnings and do not abort the crawl.
For Playwright-rendered pages, custom authentication flows, or a
caching fetcher, pass any callable that matches the Fetcher or
Extractor Protocol:
from nostrax import crawl
from nostrax.protocols import Fetcher
async def playwright_fetcher(session, url, *, timeout, max_response_size,
retries, proxy, connect_timeout, read_timeout):
# Render JS, return (html, elapsed_ms) or (None, elapsed_ms) on failure
...
urls = crawl("https://example.com", fetcher=playwright_fetcher)usage: nostrax [-h] [-V] [--check-update] -t TARGET [-s] [-d DEPTH]
[--all-tags] [--tags TAGS] [--domain {all,internal,external}]
[--protocol PROTOCOL] [--pattern PATTERN] [--exclude EXCLUDE]
[--sort] [-f {plain,json,csv,html}] [-o OUTPUT]
[--timeout TIMEOUT] [--connect-timeout SECS] [--read-timeout SECS]
[--user-agent USER_AGENT] [-v] [--no-dedup]
[--max-concurrent N] [--respect-robots] [--max-urls N]
[--rate-limit SECS] [--proxy URL] [--auth USER:PASS]
[--sitemap] [--check-status] [--metadata] [--progress]
[--retries N] [--scope PATH] [--strategy {dfs,bfs}]
[--cache-dir DIR] [--no-config]
| Flag | Description |
|---|---|
-V, --version |
Show version and exit |
--check-update |
Check PyPI for a newer version and exit |
-t, --target |
Target URL to extract from (required) |
-s, --silent |
Suppress all output (exit code only) |
-d, --depth |
Recursion depth for crawling (default: 0) |
--all-tags |
Extract URLs from all supported tags |
--tags |
Comma-separated list of HTML tags to extract from |
--domain |
Filter by domain: all, internal, or external |
--protocol |
Comma-separated protocols to keep |
--pattern |
Regex pattern to filter URLs (keep matches) |
--exclude |
Regex pattern to exclude URLs (remove matches) |
--sort |
Sort URLs alphabetically |
-f, --format |
Output format: plain, json, csv, or html |
-o, --output |
Write output to file instead of stdout |
--timeout |
Total request timeout in seconds (default: 10) |
--connect-timeout |
Per-connection timeout in seconds (defaults to --timeout) |
--read-timeout |
Per-socket-read timeout in seconds (defaults to --timeout) |
--user-agent |
Custom User-Agent string |
-v, --verbose |
Enable verbose logging |
--no-dedup |
Keep duplicate URLs |
--max-concurrent |
Max concurrent HTTP requests (default: 10) |
--respect-robots |
Check robots.txt before crawling |
--max-urls |
Stop crawling after this many URLs (default: 50000) |
--rate-limit |
Minimum seconds between requests per host (default: 0) |
--proxy |
Proxy URL (http, https, socks4, socks5) |
--auth |
HTTP basic auth as user:password |
--sitemap |
Also parse sitemap.xml for additional URLs |
--check-status |
Check HTTP status code of each discovered URL |
--metadata |
Include source page, tag type, and depth in output |
--progress |
Show a progress bar (requires tqdm) |
--retries |
Retry attempts for failed requests (default: 2) |
--scope |
Restrict crawling to a URL path prefix |
--strategy |
Crawl strategy: dfs or bfs (default: dfs) |
--cache-dir |
Directory to cache crawl state for resume support |
--no-config |
Ignore .nostraxrc config file |
JSON output with --metadata:
[
{
"url": "https://example.com/about",
"source": "https://example.com",
"tag": "a",
"depth": 0,
"response_time_ms": 142.3
}
]nostrax returns deterministic exit codes suitable for CI gates and orchestration:
| Code | Meaning |
|---|---|
0 |
Success - crawl completed and output was produced |
1 |
Failure - invalid input, network error, write failure, or no URLs discovered |
2 |
Argument parsing error (argparse default) |
Combine with --silent to suppress stdout and rely on the exit code alone in pipelines.
nostrax uses Python's standard logging module with module-level loggers under the nostrax.* namespace. No logging is configured by default when used as a library - callers control destinations, format, and levels.
import logging
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(levelname)s %(name)s %(message)s",
)
logging.getLogger("nostrax").setLevel(logging.DEBUG)From the CLI, use -v / --verbose to raise the default log level to DEBUG. Loggers are namespaced by module (nostrax.crawler, nostrax.robots, nostrax.validation, etc.) so operators can filter or route per-subsystem to JSON log aggregators, SIEM pipelines, or structured log processors of their choice.
flowchart TD
CLI["cli.py<br/>argparse + validation + config merge"]
Crawler["crawler.py<br/>unified frontier + worker pool"]
Post["filters / normalize / output<br/>post-processing"]
Frontier["LifoQueue (DFS) / Queue (BFS)<br/>bounded at max_urls * 2"]
RateLimiter["PerHostRateLimiter<br/>per-netloc serialisation"]
Fetcher["Fetcher protocol<br/>default: fetch_page"]
Extractor["Extractor protocol<br/>default: extract_urls"]
Robots["robots.py"]
Cache["cache.py<br/>atomic visited + kept-open results"]
Sitemap["sitemap.py<br/>defusedxml"]
Status["status.py<br/>same aiohttp session"]
Connector["aiohttp TCPConnector"]
Resolver["SafeResolver<br/>re-applies SSRF classifier"]
Metrics["MetricsSink<br/>on_page_fetched / failed / robots_blocked"]
Filters["filter_by_domain / protocol<br/>regex with per-URL timeout"]
Output["plain / json<br/>csv / html"]
CLI --> Crawler
CLI --> Post
Crawler --> Frontier
Crawler --> RateLimiter
Crawler --> Fetcher
Crawler --> Extractor
Crawler --> Robots
Crawler --> Cache
Crawler --> Sitemap
Crawler --> Status
Crawler --> Metrics
Fetcher --> Connector
Status --> Connector
Connector --> Resolver
Post --> Filters
Post --> Output
| Module | Purpose |
|---|---|
nostrax.cli |
Command-line interface, argument parsing, input validation |
nostrax.crawler |
Unified async crawl engine (frontier + worker pool), PerHostRateLimiter |
nostrax.extractor |
HTML parsing and URL extraction (lxml + SoupStrainer, <base>, srcset, meta-refresh) |
nostrax.filters |
Domain, protocol, and regex-backed timeout-bounded filters |
nostrax.output |
Output formatting (plain, JSON, CSV) |
nostrax.report |
HTML report generation |
nostrax.models |
UrlResult dataclass |
nostrax.normalize |
URL normalization |
nostrax.sitemap |
sitemap.xml parser through defusedxml |
nostrax.status |
Async HTTP status checker |
nostrax.robots |
robots.txt compliance |
nostrax.cache |
On-disk crawl cache with atomic visited save + kept-open results handle |
nostrax.config |
.nostraxrc loader |
nostrax.validation |
Input validation (SSRF classifier, bounds, headers, credential redaction) |
nostrax.resolver |
SafeResolver for aiohttp, applies the SSRF classifier to every DNS resolution |
nostrax.metrics |
MetricsSink protocol and NullMetricsSink default |
nostrax.protocols |
Fetcher and Extractor protocols for extensibility |
nostrax.updater |
PyPI version check |
nostrax.exceptions |
Custom exception hierarchy |
nostrax is hardened by default against common attack vectors in crawler and scraper tooling. Defenses are implemented as non-optional controls inside nostrax.validation, nostrax.sitemap, nostrax.normalize, and the request layer - they are not opt-in flags.
| Control | Mitigation |
|---|---|
| SSRF prevention | Target URL validation rejects file://, private IPs (RFC 1918), loopback, link-local, multicast, reserved, unspecified, IPv4-mapped IPv6 variants of the above, and cloud metadata endpoints (169.254.169.254). Domain-name targets are resolved via socket.getaddrinfo and every returned address is run through the same unsafe-IP classifier. A custom aiohttp resolver re-applies the classifier at connection time, so every DNS resolution during the crawl is re-validated and a TTL=0 DNS rebinding attack that flips the authoritative record between CLI validation and fetch cannot slip past the filter. |
| XXE prevention | Sitemap XML is parsed through defusedxml, which refuses <!DOCTYPE>, <!ENTITY>, external DTDs, and classic billion-laughs / entity-expansion payloads at parse time. A redundant string-level DOCTYPE/ENTITY rejection is kept as belt-and-suspenders in case the library is ever swapped or downgraded. |
| Sitemap loop protection | Max recursion depth of 5 with cycle detection on sitemap indexes |
| Path traversal prevention | Cache and output file paths are restricted to the working directory |
| Header injection prevention | User-Agent values are validated for CR/LF and length bounds |
| Open redirect prevention | HTTP redirect following is disabled on all fetches |
| ReDoS mitigation | User-supplied --pattern and --exclude regexes are parsed and matched via the regex package, whose engine is largely immune to catastrophic backtracking out of the box. Each per-URL match is additionally wrapped in a 0.5 s budget; if it expires the URL is skipped with a warning rather than allowed to consume the crawl. |
| Response size limits | Default 10 MB per-response cap prevents memory exhaustion |
| Credential scrubbing | userinfo (user:pass@) components are stripped during URL normalization |
| Content-Type gating | Non-HTML responses are skipped before download where possible |
- All runtime dependencies are published, pinned-by-floor packages:
aiohttp,beautifulsoup4,defusedxml,lxml,packaging, plustomlion Python 3.10 (stdlibtomllibon 3.11+). - No runtime code execution of scraped content - HTML is parsed, never rendered in a JavaScript engine.
- No outbound telemetry. The only optional outbound call outside of targets is
--check-update, which queries PyPI on explicit user request.
See SECURITY.md for the full vulnerability reporting policy. In short: do not file public issues for suspected vulnerabilities; open a private GitHub security advisory or contact the maintainers privately. Acknowledgement within 5 business days; triage and remediation plan within 10.
nostrax follows Semantic Versioning 2.0.0.
- Major (
X.0.0) - backwards-incompatible changes to the public Python API (nostrax.crawl,nostrax.crawl_async,nostrax.extract_urls,nostrax.normalize_url,nostrax.UrlResult,nostrax.NostraxError) or to documented CLI flags and output schemas. - Minor (
2.X.0) - additive changes: new flags, new fields in structured output (JSON/CSV additive only), new optional parameters with safe defaults. - Patch (
2.0.X) - bug fixes, security fixes, documentation, dependency floor bumps that do not change behaviour.
The public surface is the symbols exported from nostrax.__init__ and the CLI flags documented in this README. Anything under nostrax._* or any module not re-exported from the top-level package is considered internal and may change without notice.
Security-only patch releases are issued out-of-band for confirmed vulnerabilities with the corresponding advisory.
Per-release changes are tracked in CHANGELOG.md.
| Channel | Purpose |
|---|---|
| GitHub Issues | Bug reports, feature requests, usage questions |
| GitHub Discussions | Design discussion, Q&A, community workflows |
| Security advisories | Private disclosure of suspected vulnerabilities |
| Releases | Release notes, changelog, upgrade guidance |
This project is maintained on a best-effort basis by the authors and community contributors under the MIT License. No commercial SLA is offered through this repository; organisations requiring a formal support arrangement should contact the maintainers directly.
pip install -e ".[dev]"
pytest # 150+ tests across 16 modules
pytest --cov=nostrax # With coverage
pytest tests/test_crawler.py # Single moduleTests use pytest, pytest-asyncio, pytest-cov, and aioresponses - all declared in the dev optional dependency group. The suite covers crawling, extraction, filtering, normalization, sitemap parsing, robots compliance, cache resume, CLI argument parsing, input validation (SSRF, path traversal, header injection, ReDoS), and output formatting.
black nostrax tests # FormatContributions are welcome. Please:
- Fork the repository and create a feature branch (
feature/my-feature). - Add or update tests for your change - new code is expected to keep coverage steady or improve it.
- Run
pytestandblack nostrax testsbefore opening a pull request. - Keep commits focused; prefer small, reviewable PRs over large ones.
- Sign off or otherwise indicate you have the right to contribute the code under the MIT License.
Commit messages follow Conventional Commits: feat:, fix:, docs:, refactor:, test:, chore:, perf:, ci:.
Created by prodrom3 at radamic.
Released under the MIT License. The MIT License imposes no restrictions on commercial use, modification, distribution, or private use, subject to inclusion of the copyright notice and license text. Third-party dependencies retain their own licenses; review them before redistributing bundled artifacts.
Built on the work of the open-source community, particularly aiohttp, beautifulsoup4, and lxml.
