Spartan Scraper is a local-first scraping workbench for turning a URL into a clean result, a bounded crawl, or a research job without standing up cloud infrastructure.
It is built for people who want one dependable workflow from fetch to stored artifacts: open the UI or CLI, submit work, inspect results locally, promote a verified job into reusable automation, and reopen saved watch/export history when daily operations or failure recovery matter.
A healthy first run works without any AI, proxy-pool, or retention setup. Those are optional subsystems you can enable later when a workflow actually calls for them.
If you want the fastest path in, start with the 5-minute demo below. If you are integrating it into a real workflow, the API, MCP server, schedules, and local artifact model all build on that same core path.
Planning and future work live in docs/roadmap.md. That document is the canonical source of truth for what is in flight, next, and explicitly out of scope for the current cutover.
- Start from a URL and get something useful quickly: extracted content, crawl output, or a research bundle.
- Keep everything local by default: jobs, artifacts, auth profiles, schedules, and render rules stay on disk.
- Use the same job model everywhere: Web UI, CLI, API, TUI, and MCP all operate on the same persisted workflows.
- Stay practical for real sites: HTTP-first by default, Chromedp/Playwright when pages are JS-heavy or need login flows.
This walkthrough uses the default out-of-the-box path. Leave AI, proxy pooling, and retention off; they are optional and not needed for the first successful scrape.
git clone https://github.com/fitchmultz/spartan-scraper.git
cd spartan-scraper
make verify-toolchain
make install
make generate
make build
# terminal 1
./bin/spartan server
# terminal 2
make web-devIf startup detects a legacy or unsupported .data directory, Spartan now serves guided setup mode instead of failing only in the terminal. Run ./bin/spartan reset-data once to archive the current data directory under output/cutover/ and recreate .data for a fresh Balanced 1.0 boot.
Open http://localhost:5173, submit a scrape job for https://example.com, and expect:
- the dashboard to show a new job move into
succeeded - the results panel to include
Example Domain - the metrics widget to show a live WebSocket connection
If you want a CLI-only proof first:
./bin/spartan scrape --url https://example.com --out ./out/example.json
cat ./out/example.jsonExpected output includes the page title text Example Domain.
For a more guided version with expected checkpoints, see docs/demo.md.
The current operator path is:
- submit a scrape, crawl, or research job
- inspect the saved result on the job detail route
- promote that verified job into a template, watch, or export schedule draft
- save the automation from its destination surface
- inspect persisted watch checks and export outcomes from
/automation/watchesand/automation/exports - follow guided recovery actions when a watch check or export run fails
The README keeps the fast first-success path near the top. For the guided continuation into promotion, saved automation, history inspection, and failure recovery, use:
- Single pages, full websites, and deep research workflows.
- Works for static HTML and JS‑heavy sites (headless Chromium or Playwright).
- Unified interfaces: CLI, TUI, and Web UI.
- Clean API contract (OpenAPI) with generated TS client.
- Local, self‑hosted, no SaaS dependencies.
- Webhook integrations that now distinguish JSON job events from multipart export deliveries, so downstream receivers get actual exported bytes on
export.completedinstead of placeholder path metadata.
Spartan Scraper is in 1.0.0-rc1 release prep. The latest published release-readiness evidence snapshot lives in docs/evidence/release-readiness/2026-04-05/README.md.
# Quick install (CLI-focused; requires Go 1.26+)
go install github.com/fitchmultz/spartan-scraper/cmd/spartan@latest
# Full local setup (recommended for contributors and operators)
git clone https://github.com/fitchmultz/spartan-scraper.git
cd spartan-scraper
make verify-toolchain
make install
make generate
make build
./bin/spartan --help
./bin/spartan server
make web-devIf the default .data directory came from a pre-cutover build, reset it with:
./bin/spartan reset-dataOpen http://localhost:5173, submit a scrape for https://example.com, and confirm the saved result contains Example Domain.
For a full local validation pass, run:
make ci-prOptional: install the binary into ~/.local/bin (or $XDG_BIN_HOME) with:
make install-bin- Agents get an MCP surface, a deterministic local API, and a persistent job store they can inspect and reuse.
- Developers get one local system for UI, CLI, and API validation instead of separate throwaway scripts.
- Saved results can be re-authored locally too: bounded AI helpers can now refine research output, shape recurring exports, and generate validated JMESPath/JSONata transforms from representative persisted artifacts without launching new jobs.
- Teams get reproducible CI, generated API types, and stored artifacts that make behavior easier to verify.
- LICENSE - Apache License 2.0
- NOTICE - Apache 2.0 notice file
- CONTRIBUTING.md - How to contribute
- CODE_OF_CONDUCT.md - Code of conduct
- SECURITY.md - Security policy
- CHANGELOG.md - Release history
- RELEASING.md - Release workflow
- docs/README.md: docs index and navigation.
- docs/usage.md: CLI/API/Web/MCP entry points and flags.
- docs/architecture.md: system structure and flow.
- docs/demo.md: a fast clone-to-value walkthrough that continues into promotion and saved automation.
- docs/operator.md: canonical operator workflow for promotion, saved automation, history inspection, and failure recovery.
- docs/validation_checklist.md: copy/paste validation steps for setup, runtime checks, and public-readiness smoke tests.
- RELEASING.md: release workflow and pre-tag checklist.
- docs/ci.md: CI tiers, runtime expectations, and resource profile guidance.
- docs/performance.md: tuning and scaling guidance.
- docs/landscape.md: ecosystem positioning and design trade-offs.
# Single page scrape (HTTP)
./bin/spartan scrape \
--url https://example.com \
--out ./out/example.json
# Headless scrape with login (form-based)
./bin/spartan scrape \
--url https://example.com/dashboard \
--headless \
--playwright \
--login-url https://example.com/login \
--login-user-selector '#email' \
--login-pass-selector '#password' \
--login-submit-selector 'button[type=submit]' \
--login-user you@example.com \
--login-pass 'demo-password' \
--out ./out/dashboard.json
# Crawl a site (depth-limited)
./bin/spartan crawl \
--url https://example.com \
--max-depth 2 \
--max-pages 200 \
--out ./out/site.jsonl
# Deep research
./bin/spartan research \
--query "pricing model" \
--urls https://example.com,https://example.com/docs \
--out ./out/research.jsonl
# If you later load a proxy pool, prefer residential us-east proxies from it for one request
./bin/spartan scrape \
--url https://example.com \
--proxy-region us-east \
--proxy-tag residential \
--out ./out/example.json
# MCP server (stdio)
./bin/spartan mcp
# Auth profiles
./bin/spartan auth set --name acme --auth-basic user:pass --header "X-API-Key: token-from-provider"
./bin/spartan scrape --url https://example.com --auth-profile acme
# Extraction Templates
./bin/spartan scrape --url https://example.com/product/123 --extract-template product
./bin/spartan scrape --url https://example.com --extract-config my-template.json
# Export results
./bin/spartan export --job-id <id> --format md --out ./out/report.md
./bin/spartan export --job-id <id> --schedule-id <export-schedule-id>
./bin/spartan export --inspect-id <export-id>
./bin/spartan export --history-job-id <job-id>
./bin/spartan export-schedule history --id <export-schedule-id>
# Watch inspection
./bin/spartan watch check <watch-id>
./bin/spartan watch history <watch-id>
./bin/spartan watch history <watch-id> --check-id <check-id>
# Schedules
./bin/spartan schedule add --kind scrape --interval 3600 --url https://example.com
./bin/spartan schedule list
# Run API server + background worker (API binds to localhost by default)
./bin/spartan server
# Archive and recreate a legacy/default .data directory for Balanced 1.0
./bin/spartan reset-data
# Expose API on all interfaces (use with caution)
# API key auth is auto-enforced when BIND_ADDR is non-localhost.
BIND_ADDR=0.0.0.0 ./bin/spartan server
# Launch TUI
./bin/spartan tui./bin/spartan server
# in a second terminal:
make web-devOpen http://localhost:5173 for the UI.
Validated operator continuation after a successful job:
/jobs/:idis the promotion starting point for templates, watches, and export schedules/templatesis where promoted templates are saved and previewed/automation/watchesis where promoted watches are saved, checked, and inspected through persisted history/automation/exportsis where promoted export schedules are saved and where export outcome history stays inspectable across success and failure states
The dev server proxies API requests to http://localhost:8741 by default.
WebSocket upgrades to /v1/ws accept browser origins from loopback hosts only (localhost, 127.0.0.1, ::1).
Non-browser clients without an Origin header remain supported.
If you run the backend on a different local port, set DEV_API_PROXY_TARGET=http://127.0.0.1:<port> in web/.env so the dev proxy stays same-origin.
Use VITE_API_BASE_URL only for deployed cross-origin builds where the browser should call a remote API directly.
Repo-local AI defaults live in .env and config/pi-routes.json, but core scrape/crawl/research workflows work without AI. Leave PI_ENABLED=false for the default first run; when you later set PI_ENABLED=true, Spartan asks pi for routes in this order: kimi-coding/k2p5, zai/glm-5, openai-codex/gpt-5.4. Auth, account selection, and billing stay in pi; if you want a different route order or different provider/model IDs, override PI_CONFIG_PATH or edit that routes file locally.
Proxy pooling and retention are optional too: leave PROXY_POOL_FILE unset and RETENTION_ENABLED=false until you actually need pooled routing or automated cleanup.
- Web UI for job submission, monitoring, automation, and settings
- CLI for scripting and local automation
- REST + WebSocket APIs for integrations
- MCP server for agent orchestration
- TUI for terminal-first inspection
cmd/spartan: main entrypoint for CLI/TUI/API server.internal/fetch: HTTP + headless fetchers (Chromedp + Playwright).internal/extract: HTML parsing + text/metadata extraction.internal/crawl: BFS crawler with depth/limit and domain scoping (robots ignored by default; opt-in support available).internal/jobs: persistent job store + queue runner.internal/api: HTTP API + OpenAPI contract.web: Vite + React UI; generated API client underweb/src/api.internal/research: multi-source workflow (scrape/crawl → evidence → simhash dedup → clustering → citations + confidence → summary).internal/mcp: MCP stdio server for agent orchestration.internal/exporter: exports results to json/jsonl/csv/markdown/xlsx.internal/scheduler: recurring job runner with interval schedules.
- Robots.txt is ignored by default; enable compliance with
--respect-robotsorRESPECT_ROBOTS_TXT=true. - Auth support: headers, cookies, basic auth, tokens, query params, and form login (headless).
- Auth vault is stored in
.data/auth_vault.json(profiles + presets + inheritance). - Render profiles (adaptive rules) are stored in
.data/render_profiles.json. - Rate limiting + retries are configurable via
.env. - Playwright can be enabled with
USE_PLAYWRIGHT=1or--playwright(CLI/API). Install browsers withmake install-playwright.make ci-slowprovisions Playwright automatically for clean-machine heavy runs.
Pinned in .tool-versions:
- Go
1.26.2 - Node
25.9(any25.9.xpatch) - pnpm
10.33.0
Use a .tool-versions-compatible version manager (for example mise install) to provision those pinned versions, then run make verify-toolchain before build/test work.
GitHub workflow split:
- PR required:
.github/workflows/ci-pr.yml(make ci-pr) - Nightly/manual heavy checks:
.github/workflows/ci-slow.yml(make ci-slow, deterministic local-fixture heavy lane that provisions Playwright browsers)
make verify-toolchain # Print and enforce the exact Go/Node/pnpm contract from .tool-versions
make audit-public # Scan tracked files + branch history for public-readiness leaks/secrets/placeholders
make secret-scan # Deep git-history secret scan (manual/nightly release-tier check)
make ci-pr # PR-equivalent deterministic gate (requires clean git state)
make ci # Full local gate (Go + web + pi-bridge install/build/tests)
make ci-slow # Deterministic heavy stress/e2e checks (local fixture; provisions Playwright)
make ci-network # Optional live-Internet smoke validation
CI_VITEST_MAX_WORKERS=2 make ci-pr # Optional local worker cap override
make ci-manual # Manual full heavy sweep (ci-slow + ci-network)