Skip to content

Latest commit

 

History

History
105 lines (75 loc) · 4.86 KB

File metadata and controls

105 lines (75 loc) · 4.86 KB

PALIMPSEST — Deterministic Crawl Kernel

Same input + same seed = identical crawl, identical artifacts, identical replay. Everything bends around this. If we lose this, we've built noise — not history.

What This Is

A crawl kernel, not a crawler. Provides deterministic scheduling, reproducible fetch execution, verifiable artifact generation, replay-grade normalization, and temporal indexing as a first-class primitive. CLI, UI, dashboards are thin skin. The kernel is the product.

Crate Map

Crate Responsibility Key Invariant
palimpsest-core Shared types, error taxonomy, hash primitives No IO. Pure types only.
palimpsest-frontier Deterministic scheduler, priority, politeness Same seed = same traversal order
palimpsest-envelope Sealed execution context (seed, timestamp, DNS, TLS, headers) Immutable after construction
palimpsest-fetch Raw HTTP (reqwest/hyper) + Browser (CDP/WebKit) Every fetch wraps an envelope
palimpsest-artifact WARC++ serialization: HTTP exchange, DOM, resource graph Content-addressed outputs
palimpsest-storage Content-addressable blobs, chunked, compressed Dedup is native, not post-process
palimpsest-index Temporal graph: URL x time x hash x crawl context Queryable history, not lookup
palimpsest-replay HTTP reconstruction, DOM rehydration, resource graph rebuild Bit-identical replay from artifacts

The Six Laws (Never Break These)

  1. Determinism — Frontier ordering is seed-driven. Retry logic is explicit. No hidden randomness anywhere. No rand in core paths — seeded PRNG only.
  2. Idempotence — Same URL + same execution context = identical artifact hash.
  3. Content Addressability — All artifacts are BLAKE3 hash-addressed. Deduplication is structural.
  4. Temporal Integrity — Every capture binds wall clock + logical clock + crawl context + dependency chain.
  5. Replay Fidelity — Stored artifacts must be sufficient to reconstruct the HTTP exchange, DOM state, and resource dependency graph.
  6. Observability as Proof — Every decision is queryable. Every failure is replayable. Every artifact is verifiable.

Error Philosophy

Every failure is classified into exactly one of: Network, Protocol, Rendering, Policy, DeterminismViolation, Storage, Replay. No silent retries. No swallowed errors. Failures are stored artifacts — they are part of the crawl record, not noise to discard.

Architecture Principles

  • Zero shared mutable state in the core kernel
  • The ExecutionEnvelope is the critical abstraction — seed, timestamp, headers, DNS snapshot, TLS fingerprint, browser config. This is what makes replay + determinism possible.
  • WARC++ (Palimpsest Format Extension) extends standard WARC with structured metadata, resource graphs, and envelope context
  • The temporal index is a graph, not a flat lookup table. Dimensions: URL, time, content hash, crawl context.
  • Browser capture is first-class, not bolted on. Same envelope model, same artifact pipeline.

Rust Standards

@.claude/rules/rust-invariants.md

Testing

@.claude/rules/testing.md

Security

@.claude/rules/security.md

WARC++ Format

@.claude/rules/warc-format.md

Commit Convention

  • Prefix: feat:, fix:, refactor:, test:, docs:, perf:, chore:
  • Scope in parens: feat(frontier):, fix(envelope):, refactor(storage):
  • Body explains why, not what. The diff shows what.
  • Breaking changes: BREAKING: prefix in commit body
  • Every commit that touches fetch/artifact/replay must include replay fidelity test
  • Every commit that touches frontier/envelope must include determinism test

PR Convention

  • One concern per PR. No bundled drive-bys.
  • Must include tests exercising the invariant being changed
  • Benchmark before/after for any performance-sensitive path
  • No PR merges without cargo clippy -- -D warnings and cargo test passing

Dependency Policy

Minimize external dependencies — every dep is attack surface.

  • tokio for async runtime (no alternatives)
  • reqwest/hyper for HTTP
  • serde + serde_json for serialization
  • blake3 for content hashing
  • chrono for temporal types
  • tracing for structured observability
  • No rand crate in any core crate. Use palimpsest_core::CrawlSeed for all randomness.
  • New deps require justification in the PR description.

Performance Targets

  • Millions of URLs/hour per cluster
  • Linear horizontal scaling
  • Sub-millisecond frontier dequeue
  • Content-hash lookup in constant time
  • Zero-copy artifact serialization where possible

What This Becomes

Not a crawler. Not a Wayback clone. The canonical memory layer of the web. Something auditors trust, AI systems consume, historians depend on, and adversaries cannot easily corrupt.

If the web disappeared tomorrow, Palimpsest could rebuild it — provably, deterministically, and without ambiguity.