High-throughput web relevancy scoring service built in Rust. Crawls URLs at scale, extracts structured metadata, and delegates AI-powered relevancy scoring via the Anthropic Claude API. Designed for sustained operation at 10K+ URLs/day with minimal resource overhead.
┌──────────────────────────────────────────────────────┐
│ Axum HTTP API │
│ /campaigns /urls /config /jobs /stats /schema │
└────────┬───────────────┬───────────────┬─────────────┘
│ │ │
┌────▼────┐ ┌─────▼─────┐ ┌─────▼─────┐
│ Crawler │ │ Scorer │ │ Queue │
│ Service │ │ (Claude) │ │ Worker │
└────┬────┘ └─────┬─────┘ └─────┬─────┘
│ │ │
└───────────────▼───────────────┘
PostgreSQL
The backend exposes a REST API consumed by the admin portal. Long-running operations (crawling, scoring) are dispatched through a job queue backed by PostgreSQL, allowing the API to return immediately while work proceeds asynchronously.
| Layer | Choice | Rationale |
|---|---|---|
| Language | Rust 1.93+ | Memory safety, zero-cost abstractions, predictable latency |
| Framework | Axum 0.8 | Tower-based, async-native, composable middleware |
| Database | PostgreSQL via SQLx 0.8 | Compile-time query checking, async, migrations |
| HTTP Client | Reqwest 0.12 | Connection pooling, gzip/brotli, redirect handling |
| HTML Parsing | Scraper 0.22 | CSS selector-based extraction, built on html5ever |
| Observability | tracing + tracing-subscriber | Structured logging, env-based filtering |
src/
├── main.rs # Entrypoint: pool init, migrations, router composition
├── api/
│ ├── mod.rs # Route tree
│ ├── campaigns.rs # CRUD + start + results
│ ├── urls.rs # Bulk import, list, delete, CSV export
│ ├── config.rs # Runtime config CRUD (stored in DB)
│ ├── jobs.rs # Queue inspection and cancellation
│ └── crawl.rs # Manual crawl trigger
├── models/
│ └── mod.rs # Domain types: UrlMetadata, Campaign, RelevancyCheck, Job, SystemConfig
├── crawler/
│ └── mod.rs # URL fetcher + HTML metadata extractor
├── scorer/
│ └── mod.rs # Claude API integration (batched scoring)
├── queue/
│ └── mod.rs # Background job worker (poll + dispatch)
├── db/
│ └── mod.rs # Database utilities
└── config/
└── mod.rs # Typed application configuration
migrations/
└── 20260221_initial.sql # Schema + indexes + seed config
All endpoints return JSON. Base path: /api.
| Method | Path | Description |
|---|---|---|
GET |
/health |
Returns "ok" |
| Method | Path | Description |
|---|---|---|
GET |
/campaigns |
List campaigns (paginated: ?page=&per_page=) |
POST |
/campaigns |
Create campaign { name, reference_url?, urls[] } |
GET |
/campaigns/:id |
Get campaign detail |
PUT |
/campaigns/:id |
Update campaign fields |
DELETE |
/campaigns/:id |
Delete campaign and associated results |
POST |
/campaigns/:id/start |
Queue campaign for processing |
GET |
/campaigns/:id/results |
Paginated relevancy check results |
| Method | Path | Description |
|---|---|---|
GET |
/urls |
List URLs (paginated) |
POST |
/urls |
Bulk import { urls: string[] }. Deduplicates on insert. |
GET |
/urls/:id |
Get single URL metadata |
DELETE |
/urls/:id |
Remove URL from database |
GET |
/urls/export |
CSV export of all URL metadata |
| Method | Path | Description |
|---|---|---|
GET |
/config |
List all config entries |
GET |
/config/:key |
Get single config value |
POST |
/config |
Upsert { key, value, description? } |
Runtime-editable keys seeded at migration:
scoring_weights—{ relevancy, quality, authority }point allocationdecision_thresholds—{ approve, reject }score boundariesschema_auto_reject— array of schema.org types to filter pre-AIai_model— Claude model identifierai_batch_size— URLs per API callcrawl_concurrency— max parallel fetchescrawl_timeout_secs— per-request timeoutstale_threshold_days— cache expiry window
| Method | Path | Description |
|---|---|---|
GET |
/jobs |
List all jobs (paginated) |
GET |
/jobs/:id |
Get job detail |
POST |
/jobs/:id/cancel |
Cancel a pending or running job |
| Method | Path | Description |
|---|---|---|
GET |
/stats/dashboard |
Aggregated operational metrics |
GET |
/stats/costs |
Cost breakdown (daily/weekly/monthly) |
| Method | Path | Description |
|---|---|---|
GET |
/schema/columns |
List custom columns on url_metadata |
POST |
/schema/columns |
Add a column to url_metadata dynamically |
Four core tables plus a job queue:
url_metadata— Crawled URL data with GIN-indexedschema_typesfor fast filteringcampaigns— Scoring runs with cost/performance trackingrelevancy_checks— Per-URL scores and AI reasoning, foreign-keyed to campaignssystem_config— JSONB key-value store for runtime configurationjobs— Queue table with status, priority, retry tracking, and error capture
Indexes cover the primary query patterns: domain lookup, status filtering, score ordering, campaign association, and schema type containment.
- Rust 1.93+ (
rustup update stable) - PostgreSQL 15+
- An Anthropic API key (for scoring)
# Creates the database parsed from DATABASE_URL in .env
# Idempotent: safe to run multiple times
./scripts/create-db.sh
# Or pass the URL directly:
DATABASE_URL=postgres://user:pass@localhost:5432/relevance_engine ./scripts/create-db.shcp .env.example .env
# Edit .env:
# DATABASE_URL=postgres://localhost:5432/relevance_engine
# BIND_ADDR=0.0.0.0:3001
# ANTHROPIC_API_KEY=sk-ant-...cargo build --release
./target/release/relevance-engineMigrations run automatically on startup via SQLx.
cargo run
# Server starts on http://localhost:3001
# Logs controlled via RUST_LOG env varPostgreSQL as job queue. For this throughput (10K jobs/day), a dedicated broker adds operational complexity without proportional benefit. The jobs table with SELECT ... FOR UPDATE SKIP LOCKED provides exactly-once delivery with zero additional infrastructure.
Config in database, not files. The spec requires non-technical operators to change scoring prompts, weights, and thresholds without redeployment. JSONB columns give schema flexibility; the API layer validates on read.
Schema pre-filtering. URLs with schema types like Product or JobPosting are rejected before reaching the Claude API, targeting 35-45% cost reduction on typical datasets.
Batch scoring. URLs are sent to Claude in configurable batches (default: 20) to amortize prompt overhead and reduce API call count.
Proprietary. All rights reserved.