Same input + same seed = identical crawl, identical artifacts, identical replay. Everything bends around this. If we lose this, we've built noise — not history.
A crawl kernel, not a crawler. Provides deterministic scheduling, reproducible fetch execution, verifiable artifact generation, replay-grade normalization, and temporal indexing as a first-class primitive. CLI, UI, dashboards are thin skin. The kernel is the product.
| Crate | Responsibility | Key Invariant |
|---|---|---|
palimpsest-core |
Shared types, error taxonomy, hash primitives | No IO. Pure types only. |
palimpsest-frontier |
Deterministic scheduler, priority, politeness | Same seed = same traversal order |
palimpsest-envelope |
Sealed execution context (seed, timestamp, DNS, TLS, headers) | Immutable after construction |
palimpsest-fetch |
Raw HTTP (reqwest/hyper) + Browser (CDP/WebKit) | Every fetch wraps an envelope |
palimpsest-artifact |
WARC++ serialization: HTTP exchange, DOM, resource graph | Content-addressed outputs |
palimpsest-storage |
Content-addressable blobs, chunked, compressed | Dedup is native, not post-process |
palimpsest-index |
Temporal graph: URL x time x hash x crawl context | Queryable history, not lookup |
palimpsest-replay |
HTTP reconstruction, DOM rehydration, resource graph rebuild | Bit-identical replay from artifacts |
- Determinism — Frontier ordering is seed-driven. Retry logic is explicit. No hidden randomness anywhere. No
randin core paths — seeded PRNG only. - Idempotence — Same URL + same execution context = identical artifact hash.
- Content Addressability — All artifacts are BLAKE3 hash-addressed. Deduplication is structural.
- Temporal Integrity — Every capture binds wall clock + logical clock + crawl context + dependency chain.
- Replay Fidelity — Stored artifacts must be sufficient to reconstruct the HTTP exchange, DOM state, and resource dependency graph.
- Observability as Proof — Every decision is queryable. Every failure is replayable. Every artifact is verifiable.
Every failure is classified into exactly one of: Network, Protocol, Rendering, Policy, DeterminismViolation, Storage, Replay. No silent retries. No swallowed errors. Failures are stored artifacts — they are part of the crawl record, not noise to discard.
- Zero shared mutable state in the core kernel
- The
ExecutionEnvelopeis the critical abstraction — seed, timestamp, headers, DNS snapshot, TLS fingerprint, browser config. This is what makes replay + determinism possible. - WARC++ (Palimpsest Format Extension) extends standard WARC with structured metadata, resource graphs, and envelope context
- The temporal index is a graph, not a flat lookup table. Dimensions: URL, time, content hash, crawl context.
- Browser capture is first-class, not bolted on. Same envelope model, same artifact pipeline.
@.claude/rules/rust-invariants.md
@.claude/rules/testing.md
@.claude/rules/security.md
@.claude/rules/warc-format.md
- Prefix:
feat:,fix:,refactor:,test:,docs:,perf:,chore: - Scope in parens:
feat(frontier):,fix(envelope):,refactor(storage): - Body explains why, not what. The diff shows what.
- Breaking changes:
BREAKING:prefix in commit body - Every commit that touches fetch/artifact/replay must include replay fidelity test
- Every commit that touches frontier/envelope must include determinism test
- One concern per PR. No bundled drive-bys.
- Must include tests exercising the invariant being changed
- Benchmark before/after for any performance-sensitive path
- No PR merges without
cargo clippy -- -D warningsandcargo testpassing
Minimize external dependencies — every dep is attack surface.
tokiofor async runtime (no alternatives)reqwest/hyperfor HTTPserde+serde_jsonfor serializationblake3for content hashingchronofor temporal typestracingfor structured observability- No
randcrate in any core crate. Usepalimpsest_core::CrawlSeedfor all randomness. - New deps require justification in the PR description.
- Millions of URLs/hour per cluster
- Linear horizontal scaling
- Sub-millisecond frontier dequeue
- Content-hash lookup in constant time
- Zero-copy artifact serialization where possible
Not a crawler. Not a Wayback clone. The canonical memory layer of the web. Something auditors trust, AI systems consume, historians depend on, and adversaries cannot easily corrupt.
If the web disappeared tomorrow, Palimpsest could rebuild it — provably, deterministically, and without ambiguity.