Wayback Archiver

The Memory of Your Internet — Archive Everything You Browse.

A self-hosted personal web archiving system that automatically captures and preserves web pages you visit in Chrome — HTML, CSS, JavaScript, images, and all. When the original page goes offline, you can still browse your archived copy with styles and layout intact.

How It Works

Chrome + Tampermonkey ──HTTP POST──▶ Go Server ──▶ PostgreSQL (metadata)
  (auto-capture on                    │               + File System (assets)
   page load)                         │
                                      ▼
                                   Web UI ──▶ Browse / Search / Replay

A Tampermonkey userscript runs in your browser, automatically capturing the full DOM and resources once the page finishes loading. If significant DOM changes occur afterward, it submits one additional update.
The Go server acknowledges archive/update requests as soon as the snapshot metadata is stored. Resource downloading, deduplication, URL rewriting, and snapshot finalization continue in the background.
A built-in Web UI lets you list, search, and replay any archived page — fully offline, no external dependencies.

Features

High-fidelity replay — CSSOM serialization, computed styles inlining, and anti-refresh protection reproduce pages as close to the original as possible
Full-page capture — HTML, CSS, JS, images, fonts; resource URLs are rewritten to local paths
Cross-origin resource recovery — server-side extraction and download of resources blocked by CORS
Content-hash deduplication — identical resources shared across pages are stored only once (SHA-256)
Version history — same URL archived multiple times, distinguished by timestamp
Timeline view — browse all snapshots of a URL on a visual timeline (like web.archive.org), with prev/next navigation between snapshots
Smart dedup — session-level + server-level dedup prevents redundant captures; content-hash comparison skips unchanged pages
Dynamic content support — captures the live DOM state; MutationObserver triggers one auto-update if significant changes occur after initial capture
SPA-aware — detects SPA navigation, resets capture state per route
Anti-refresh protection — archived pages are frozen: timers, WebSockets, and navigation APIs are neutralized
Web UI — responsive interface to browse, full-text search (page content, URL, and title), filter by date range and domain, and replay archived pages
RESTful API — programmatic access to all archiving and query operations

Prerequisites

PostgreSQL 14+
Chrome or Firefox + Tampermonkey extension (v5.3+)

Quick Start

Option A: Docker (Recommended)

The fastest way to get started. Docker Compose will set up both the server and PostgreSQL automatically.

# Clone the repository
git clone https://github.com/icodeface/wayback-archiver.git
cd wayback-archiver

# Start all services
docker compose up -d

# View logs
docker compose logs -f wayback

The server will be available at http://localhost:8080. Skip to step 4 (Install the Userscript).

For detailed Docker configuration and deployment options, see docs/DOCKER.md.

Option B: Pre-built Binaries

1. Download Pre-built Binaries

Download the latest release from the Releases page:

macOS: wayback-server-darwin-amd64.tar.gz (Intel) or wayback-server-darwin-arm64.tar.gz (Apple Silicon)
Linux: wayback-server-linux-amd64.tar.gz or wayback-server-linux-arm64.tar.gz
Windows: wayback-server-windows-amd64.zip
Userscript: wayback-userscript.js

Extract the archive:

# macOS/Linux
tar -xzf wayback-server-*.tar.gz

# Windows: extract the .zip file

Building from source? See docs/BUILD.md for manual compilation instructions.

2. Database Setup

# PostgreSQL uses your current system username as the default database user
# If your system username is alice, this is equivalent to: createdb -U alice wayback
createdb wayback

# Run the schema (init_db.sql is included in the release archive)
psql wayback < init_db.sql

Upgrade note for existing installs: Any installation upgrading from < 1.1.0 to 1.1.0 or later needs the pages.snapshot_state column. Current server builds automatically apply this change during startup; if your database user does not have ALTER TABLE permission, run the following idempotent SQL manually first:
ALTER TABLE pages ADD COLUMN IF NOT EXISTS snapshot_state VARCHAR(16);
UPDATE pages SET snapshot_state = 'ready' WHERE snapshot_state IS NULL OR snapshot_state = '';
ALTER TABLE pages ALTER COLUMN snapshot_state SET DEFAULT 'pending';
ALTER TABLE pages ALTER COLUMN snapshot_state SET NOT NULL;

3. Start the Server

# Optional: create .env file for custom configuration
# See Configuration section below for available options

./wayback-server

The server starts at http://localhost:8080 by default.

If you need a proxy for downloading external resources:

export http_proxy=http://127.0.0.1:7897
export https_proxy=http://127.0.0.1:7897
./wayback-server

4. Install the Userscript

Download wayback-userscript.js from the Releases page
Open Tampermonkey dashboard in your browser
Click "Create a new script"
Paste the contents of wayback-userscript.js
Save and enable

Chrome users: Right-click the Tampermonkey icon → Manage extension, then enable the "Allow user scripts" toggle. Firefox does not require this step.

5. Start Browsing

That's it. Pages are automatically archived as soon as they load. Open http://localhost:8080 to browse your archive.

Puppeteer Integration: For automated archiving, see docs/PUPPETEER.md.

Configuration

Environment variables (or .env file in the project root):

The server automatically loads .env from the working directory if it exists. You can also set environment variables directly.

Variable	Default	Description
`DB_HOST`	`localhost`	PostgreSQL host
`DB_PORT`	`5432`	PostgreSQL port
`DB_USER`	`postgres`	Database user. PostgreSQL defaults to the current system username, so leaving this unset is usually recommended.
`DB_PASSWORD`	(empty)	Database password
`DB_NAME`	`wayback`	Database name
`DB_SSLMODE`	`disable`	PostgreSQL SSL mode used by the server connection
`SERVER_HOST`	`127.0.0.1`	Server bind address (`0.0.0.0` = all interfaces, `127.0.0.1` = localhost only)
`SERVER_PORT`	`8080`	HTTP server port
`ALLOWED_ORIGINS`	`http://localhost:8080,http://127.0.0.1:8080`	CORS allowed origins (comma-separated). For remote deployment, add your domain: `https://your-domain.com`
`DATA_DIR`	`./data`	Storage directory for HTML and resources
`LOG_DIR`	`./data/logs`	Log file directory
`AUTH_PASSWORD`	(empty)	HTTP Basic Auth password (disabled when empty, username: `wayback`). REQUIRED for remote deployment
`RESOURCE_METADATA_CACHE_MB`	10% of system memory	Metadata cache budget for resource URL lookups plus HTTP freshness/validator reuse and revalidation. `RESOURCE_CACHE_MB` is still accepted as a legacy alias.
`ENABLE_DEBUG_API`	`false`	Enable `/api/debug/*` endpoints (`memstats`, `gc`, `pprof`). Keep disabled unless you are actively debugging.
`COMPRESSION_LEVEL`	`-1`	Compression level: 1 (fastest) to 9 (best), -1 (default/balanced). Response compression always enabled, auto-negotiated via Accept-Encoding

Compression Settings

Server-side (responses): Always enabled, auto-negotiated

Clients that send Accept-Encoding: gzip get compressed responses
Clients that don't support gzip get uncompressed responses
No configuration needed - works automatically

Client-side (uploads): Configurable in browser/src/config.ts

For local deployment (default): Keep ENABLE_COMPRESSION: false
- Localhost transfer is already fast
- No CPU overhead from compression
For remote deployment: Set ENABLE_COMPRESSION: true
- 95%+ reduction for uploads (large HTML snapshots)
- Rebuild userscript: cd browser && npm run build

Remote Deployment

For deploying to a remote server, see REMOTE_DEPLOYMENT.md for detailed instructions.

Quick setup:

# .env configuration
ALLOWED_ORIGINS=https://your-domain.com
AUTH_PASSWORD=your_secure_password
SERVER_HOST=0.0.0.0

# Browser config.ts
SERVER_URL: 'https://your-domain.com/api/archive'
AUTH_PASSWORD: 'your_secure_password'
ENABLE_COMPRESSION: true  # Enable upload compression for remote deployment

Security Notes:

Always use HTTPS for remote deployment
Set a strong AUTH_PASSWORD
Limit ALLOWED_ORIGINS to trusted domains only
Origin: null is intentionally rejected because it also covers sandboxed iframes and data/file-backed opaque origins
Both CORS and Basic Auth are required for security (defense in depth)

Performance Notes:

Enable ENABLE_COMPRESSION in browser config for remote deployment
Reduces upload bandwidth by 95%+ (especially for large HTML snapshots)
Response compression is automatic (no configuration needed)
Minimal CPU overhead, significant network savings

Capture Notes:

/api/archive and /api/archive/:id reject decompressed JSON bodies larger than 32 MiB with HTTP 413
Cross-origin iframe snapshots remain enabled, but the browser bridge now signs requests and returns frame HTML over a private MessageChannel instead of public window.postMessage
Resource downloads only forward cookies that still match the target URL under browser rules (hostOnly, domain, path, secure, expiry, SameSite, partition top-level site)
The browser script skips local/private targets including localhost, 127.0.0.1, 172.16.0.0/12, 169.254.0.0/16, ::1, fc00::/7, fe80::/10, and .local

API

Method	Endpoint	Description
`GET`	`/api/version`	Server version and build info
`POST`	`/api/archive`	Create a page archive
`PUT`	`/api/archive/:id`	Update an existing archive snapshot
`GET`	`/api/pages`	List all archived pages
`GET`	`/api/pages/:id`	Get page details
`GET`	`/api/pages/:id/content`	Get page content as Markdown (for AI/LLM consumption)
`GET`	`/api/search?q=keyword`	Search pages by URL or title
`GET`	`/api/pages/timeline?url=URL`	Get all snapshots of a URL (timeline view)
`GET`	`/api/logs`	List available log files
`GET`	`/api/logs/latest`	Get latest log file content (supports `?tail=N&grep=keyword`)
`GET`	`/api/logs/:filename`	Get log file content (supports `?tail=N&grep=keyword`)
`GET`	`/view/:id`	Replay an archived page
`GET`	`/timeline?url=URL`	Visual timeline page for a URL
`GET`	`/logs`	Server logs viewer

POST /api/archive

Returns { status, page_id, action } where action is created or unchanged (content identical, only last_visited updated). When action is created, the response is sent immediately after the page row and raw HTML are stored; resource downloads and HTML rewriting continue in the background. The server tracks background create progress with snapshot_state (pending, ready, failed). Only ready snapshots are treated as fully deduplicated; if background finalize fails, a later capture of the same URL and content hash reuses the same page row and automatically retries finalize instead of getting stuck as unchanged.

PUT /api/archive/:id

Accepts the same body as POST. Returns immediately once the server has accepted the update request. If the content changed, resource re-processing and the final snapshot swap continue in the background; old HTML is queued for delayed deletion after a successful swap. Returns { status, page_id, action } where action is updated or unchanged. The request body url must exactly match the existing page URL for :id; otherwise the server rejects the update with HTTP 400 to prevent cross-page snapshot corruption.

Project Structure

wayback-archiver/
├── Makefile                  # Build, test, cross-compile
├── bin/                      # Build output (server binary + userscript)
├── browser/                  # Tampermonkey userscript (TypeScript)
│   ├── src/
│   │   ├── main.ts           # Entry point & orchestration
│   │   ├── config.ts         # Constants
│   │   ├── types.ts          # TypeScript interfaces
│   │   ├── page-filter.ts    # URL filtering logic
│   │   ├── page-freezer.ts   # Freeze page runtime state
│   │   ├── dom-collector.ts  # DOM serialization
│   │   └── archiver.ts       # Server communication
│   ├── dist/                 # Built userscript
│   └── build.js              # Bundle script
│
├── server/                   # Go backend
│   ├── cmd/server/main.go    # Entry point
│   ├── internal/
│   │   ├── api/              # HTTP handlers (modular)
│   │   ├── config/           # Environment-based config
│   │   ├── database/         # PostgreSQL operations
│   │   ├── logging/          # File-based logging with rotation
│   │   ├── models/           # Data models
│   │   └── storage/          # File storage & dedup
│   └── web/                  # Web UI static files
│
├── .env.example              # Configuration template
└── tests/                    # Test suites
    ├── browser/              # Browser-side tests
    └── server/               # Server-side & E2E tests

Storage Layout

data/
├── html/                     # HTML snapshots, organized by date
│   └── 2026/03/09/
│       └── <timestamp>_<hash>.html
├── logs/                     # Server logs, rotated by size (10MB) and date (7-day retention)
│   ├── wayback-2026-03-12.001.log
│   └── wayback-2026-03-12.002.log
└── resources/                # Deduplicated static resources
    └── ab/cd/
        └── <sha256>.css

Building from Source

See docs/BUILD.md for build instructions, cross-compilation, and testing.

Agent Integration

This project includes an Agent skill for AI-assisted querying and exploration of your archived pages. Use it to search, analyze, and interact with your archive through natural language.

Known Limitations

Some cross-origin resources may still fail due to server-side 403/404 responses
Dynamically injected scripts (loaded via JS at runtime) may not be captured
Tracking pixels and analytics URLs with dynamic parameters are not preserved (they don't affect page rendering)
Very large media files (video, large images) will consume significant storage

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 128 Commits
.github/workflows		.github/workflows
browser		browser
docs		docs
screenshot		screenshot
server		server
tests		tests
vendor/proxy-agent		vendor/proxy-agent
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README-zh.md		README-zh.md
README.md		README.md
docker-compose.yml		docker-compose.yml
package-lock.json		package-lock.json
package.json		package.json
skill.md		skill.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wayback Archiver

How It Works

Features

Prerequisites

Quick Start

Option A: Docker (Recommended)

Option B: Pre-built Binaries

1. Download Pre-built Binaries

2. Database Setup

3. Start the Server

4. Install the Userscript

5. Start Browsing

Configuration

Compression Settings

Remote Deployment

API

POST /api/archive

PUT /api/archive/:id

Project Structure

Storage Layout

Building from Source

Agent Integration

Known Limitations

License

About

Uh oh!

Releases 17

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Wayback Archiver

How It Works

Features

Prerequisites

Quick Start

Option A: Docker (Recommended)

Option B: Pre-built Binaries

1. Download Pre-built Binaries

2. Database Setup

3. Start the Server

4. Install the Userscript

5. Start Browsing

Configuration

Compression Settings

Remote Deployment

API

POST /api/archive

PUT /api/archive/:id

Project Structure

Storage Layout

Building from Source

Agent Integration

Known Limitations

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 17

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages