Skip to content

zaghadon/homefield-webscraper-backend

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Relevance Engine — Backend

High-throughput web relevancy scoring service built in Rust. Crawls URLs at scale, extracts structured metadata, and delegates AI-powered relevancy scoring via the Anthropic Claude API. Designed for sustained operation at 10K+ URLs/day with minimal resource overhead.

Architecture

┌──────────────────────────────────────────────────────┐
│                     Axum HTTP API                    │
│  /campaigns  /urls  /config  /jobs  /stats  /schema  │
└────────┬───────────────┬───────────────┬─────────────┘
         │               │               │
    ┌────▼────┐    ┌─────▼─────┐   ┌─────▼─────┐
    │ Crawler │    │  Scorer   │   │   Queue   │
    │ Service │    │ (Claude)  │   │  Worker   │
    └────┬────┘    └─────┬─────┘   └─────┬─────┘
         │               │               │
         └───────────────▼───────────────┘
                   PostgreSQL

The backend exposes a REST API consumed by the admin portal. Long-running operations (crawling, scoring) are dispatched through a job queue backed by PostgreSQL, allowing the API to return immediately while work proceeds asynchronously.

Tech Stack

Layer Choice Rationale
Language Rust 1.93+ Memory safety, zero-cost abstractions, predictable latency
Framework Axum 0.8 Tower-based, async-native, composable middleware
Database PostgreSQL via SQLx 0.8 Compile-time query checking, async, migrations
HTTP Client Reqwest 0.12 Connection pooling, gzip/brotli, redirect handling
HTML Parsing Scraper 0.22 CSS selector-based extraction, built on html5ever
Observability tracing + tracing-subscriber Structured logging, env-based filtering

Project Structure

src/
├── main.rs              # Entrypoint: pool init, migrations, router composition
├── api/
│   ├── mod.rs           # Route tree
│   ├── campaigns.rs     # CRUD + start + results
│   ├── urls.rs          # Bulk import, list, delete, CSV export
│   ├── config.rs        # Runtime config CRUD (stored in DB)
│   ├── jobs.rs          # Queue inspection and cancellation
│   └── crawl.rs         # Manual crawl trigger
├── models/
│   └── mod.rs           # Domain types: UrlMetadata, Campaign, RelevancyCheck, Job, SystemConfig
├── crawler/
│   └── mod.rs           # URL fetcher + HTML metadata extractor
├── scorer/
│   └── mod.rs           # Claude API integration (batched scoring)
├── queue/
│   └── mod.rs           # Background job worker (poll + dispatch)
├── db/
│   └── mod.rs           # Database utilities
└── config/
    └── mod.rs           # Typed application configuration

migrations/
└── 20260221_initial.sql # Schema + indexes + seed config

API Reference

All endpoints return JSON. Base path: /api.

Health

Method Path Description
GET /health Returns "ok"

Campaigns

Method Path Description
GET /campaigns List campaigns (paginated: ?page=&per_page=)
POST /campaigns Create campaign { name, reference_url?, urls[] }
GET /campaigns/:id Get campaign detail
PUT /campaigns/:id Update campaign fields
DELETE /campaigns/:id Delete campaign and associated results
POST /campaigns/:id/start Queue campaign for processing
GET /campaigns/:id/results Paginated relevancy check results

URLs

Method Path Description
GET /urls List URLs (paginated)
POST /urls Bulk import { urls: string[] }. Deduplicates on insert.
GET /urls/:id Get single URL metadata
DELETE /urls/:id Remove URL from database
GET /urls/export CSV export of all URL metadata

Configuration

Method Path Description
GET /config List all config entries
GET /config/:key Get single config value
POST /config Upsert { key, value, description? }

Runtime-editable keys seeded at migration:

  • scoring_weights{ relevancy, quality, authority } point allocation
  • decision_thresholds{ approve, reject } score boundaries
  • schema_auto_reject — array of schema.org types to filter pre-AI
  • ai_model — Claude model identifier
  • ai_batch_size — URLs per API call
  • crawl_concurrency — max parallel fetches
  • crawl_timeout_secs — per-request timeout
  • stale_threshold_days — cache expiry window

Jobs

Method Path Description
GET /jobs List all jobs (paginated)
GET /jobs/:id Get job detail
POST /jobs/:id/cancel Cancel a pending or running job

Stats

Method Path Description
GET /stats/dashboard Aggregated operational metrics
GET /stats/costs Cost breakdown (daily/weekly/monthly)

Schema Management

Method Path Description
GET /schema/columns List custom columns on url_metadata
POST /schema/columns Add a column to url_metadata dynamically

Database Schema

Four core tables plus a job queue:

  • url_metadata — Crawled URL data with GIN-indexed schema_types for fast filtering
  • campaigns — Scoring runs with cost/performance tracking
  • relevancy_checks — Per-URL scores and AI reasoning, foreign-keyed to campaigns
  • system_config — JSONB key-value store for runtime configuration
  • jobs — Queue table with status, priority, retry tracking, and error capture

Indexes cover the primary query patterns: domain lookup, status filtering, score ordering, campaign association, and schema type containment.

Setup

Prerequisites

  • Rust 1.93+ (rustup update stable)
  • PostgreSQL 15+
  • An Anthropic API key (for scoring)

Database

# Creates the database parsed from DATABASE_URL in .env
# Idempotent: safe to run multiple times
./scripts/create-db.sh

# Or pass the URL directly:
DATABASE_URL=postgres://user:pass@localhost:5432/relevance_engine ./scripts/create-db.sh

Configuration

cp .env.example .env
# Edit .env:
#   DATABASE_URL=postgres://localhost:5432/relevance_engine
#   BIND_ADDR=0.0.0.0:3001
#   ANTHROPIC_API_KEY=sk-ant-...

Build and Run

cargo build --release
./target/release/relevance-engine

Migrations run automatically on startup via SQLx.

Development

cargo run
# Server starts on http://localhost:3001
# Logs controlled via RUST_LOG env var

Design Decisions

PostgreSQL as job queue. For this throughput (10K jobs/day), a dedicated broker adds operational complexity without proportional benefit. The jobs table with SELECT ... FOR UPDATE SKIP LOCKED provides exactly-once delivery with zero additional infrastructure.

Config in database, not files. The spec requires non-technical operators to change scoring prompts, weights, and thresholds without redeployment. JSONB columns give schema flexibility; the API layer validates on read.

Schema pre-filtering. URLs with schema types like Product or JobPosting are rejected before reaching the Claude API, targeting 35-45% cost reduction on typical datasets.

Batch scoring. URLs are sent to Claude in configurable batches (default: 20) to amortize prompt overhead and reduce API call count.

License

Proprietary. All rights reserved.

About

High-throughput web relevancy scoring service built in Rust. Crawls URLs at scale, extracts structured metadata, and delegates AI-powered relevancy scoring via the Anthropic Claude API. Designed for sustained operation at 10K+ URLs/day with minimal resource overhead.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors