Claude Code Guidelines for ChatToMap Core Library

Project Overview

ChatToMap core library is a pure TypeScript library that transforms chat exports into geocoded activity suggestions. It provides shared logic for both the open-source CLI and the commercial SaaS.

License: AGPL-3.0 Runtime: Bun (strict requirement)

Architecture Principle

Pure functions only. No IO, no progress reporting, no orchestration.

The library is stateless and side-effect-free (except for API calls to external services). Orchestration (parallelization, progress, rate limiting) is the coordinator's responsibility:

CLI - Spawns its own parallel workers locally
Cloudflare Workers - ARE the parallelization units in the SaaS

CLI Worker Pool Convention

All CLI parallelism uses runWorkerPool from src/cli/worker-pool.ts.

import { runWorkerPool } from '../worker-pool'

const { successes, errors } = await runWorkerPool(
  items,
  async (item, index) => processItem(item),
  {
    concurrency: 5,
    onProgress: ({ completed, total }) => logger.log(`${completed}/${total}`)
  }
)

This ensures consistent:

Concurrency control (default 5 workers)
Progress reporting
Error handling
Result ordering

Never use manual Promise.all loops for parallel work in CLI steps.

Key Files

File	Purpose
`src/types.ts`	All TypeScript type definitions
`src/index.ts`	Public API exports
`src/cli.ts`	CLI entry point (orchestrator)
`src/parser/`	WhatsApp export parsing
`src/extractor/`	Regex/URL candidate extraction
`src/embeddings/`	OpenAI embeddings + semantic search
`src/classifier/`	AI classification (Claude/OpenAI)
`src/geocoder/`	Google Maps geocoding
`src/export/`	CSV, Excel, JSON, Map, PDF generation

Python Prototype Reference

The Python prototype in src/*.py serves as the reference implementation:

Python File	What to Learn
`src/parser.py`	WhatsApp export format patterns
`src/suggestion_extractor.py`	Regex patterns that work
`src/embeddings.py`	Semantic search approach
`src/classifier.py`	Claude prompt structure
`src/geocoder.py`	Geocoding approach
`src/export.py`	Leaflet.js map generation

Goal: TypeScript version should produce identical results for the same input.

Quality Standards

Non-Negotiable Rules

Rule	Limit
File length (code)	500 lines max (NEVER remove comments to reduce - REFACTOR and SPLIT the file)
File length (tests)	1000 lines max
Function length	50 lines max
Line length	100 chars max
Cognitive complexity	15 max
Code duplication	Zero tolerance
`any` types	Forbidden
`biome-ignore`	Forbidden
`--no-verify`	Forbidden
Test coverage	90%+ statements, 80%+ branches

Before Marking ANY Task Complete

task ci

This runs: typecheck, lint, check-ignores, duplication, file-length, test. Must pass completely.

CLI Pipeline

Commands build on each other. Each command runs earlier steps if not cached.

parse            → Parse messages from chat export
scan             → Heuristic candidate extraction (quick, free)
embed            → Embed messages for semantic search (~$0.001/1000 msgs)
filter           → scan + embed + semantic search + merge → candidates.all
scrape-urls      → Get URL metadata for candidates
preview          → Quick AI classification on scan results only (~$0.01)
classify         → filter → scrape-urls → AI classify (full pipeline)
geocode          → classify → geocode locations
fetch-image-urls → geocode → fetch images from various sources
analyze          → Full pipeline with export (not yet tested)

Key points:

scan is the quick/free heuristic preview
filter combines heuristics + embeddings into candidates.all
classify runs the FULL pipeline (filter → scrape → classify)
geocode builds on classify results
Each step uses pipeline cache if already run

CLI Command Reference

Run with bun run cli <command> or after build chat-to-map <command>.

⚠️ Default output is limited to 10 results. Use -a/--all or -n/--max-results to see more.

Command	Purpose	Key Options
`parse <input>`	Parse chat, show stats	`--json [file]`, `-m <num>`
`scan <input>`	Heuristic extraction (free)	`-n <num>` (default: 10)
`embed <input>`	Embed for semantic search	`--dry-run`
`filter <input>`	Heuristics + embeddings	`--method`, `--json [file]`, `-a`
`scrape-urls <input>`	Scrape URL metadata	`--dry-run`, `--json [file]`
`preview <input>`	Quick AI preview (scan only)	`-c <country>`, `--dry-run`
`classify <input>`	AI classification	`-c <country>`, `--json [file]`, `-a`
`geocode <input>`	Google Maps geocoding	`-c <country>`, `--json [file]`, `-a`
`fetch-image-urls <input>`	Fetch images for activities	`--no-image-cdn`, `--skip-pixabay`, `-a`

Common options (all commands):

--no-cache - Skip cache, regenerate results
--cache-dir <dir> - Custom cache directory
-m, --max-messages <num> - Limit messages processed
--dry-run - Show cost estimate without API calls

JSON output: Most commands support --json [file] to output JSON to stdout or a file.

Examples:

bun run cli scan ./chat.txt -n 50           # Show 50 heuristic candidates
bun run cli filter ./chat.txt --all         # Show ALL candidates
bun run cli classify ./chat.txt --json out.json -c "New Zealand"
bun run cli geocode ./chat.txt -a           # Show all geocoded activities
bun run cli fetch-image-urls ./chat.txt --no-image-cdn --all  # Fetch images from APIs

Commands

# Development
task dev              # Run CLI in watch mode
task build            # Build library and CLI
task build:binary     # Build standalone binary

# Quality
task ci               # Run ALL CI checks
task lint             # Check linting
task lint:fix         # Auto-fix linting
task typecheck        # TypeScript checking
task duplication      # Check for duplication
task file-length      # Check file lengths
task check-ignores    # Verify no biome-ignore

# Testing
task test             # Run unit tests (excludes E2E)
task test:watch       # Run unit tests in watch mode
task test:cov         # Run unit tests with coverage
task test:e2e         # Run CLI E2E tests (separate vitest config)

# Git hooks
task hooks:install    # Install lefthook hooks
task hooks:run        # Run pre-commit manually

Code Standards

TypeScript

Strict mode enabled
No any types
Explicit return types on exported functions
Use interface for object types, type for unions/aliases
Use readonly for immutable data

Naming Conventions

Functions: camelCase
Types/Interfaces: PascalCase
Constants: SCREAMING_SNAKE_CASE
Files: kebab-case.ts

Testing

🚨 NEVER use bun test - ALWAYS use bun run test or task test:

❌ bun test → Bun's native runner - NO .env, NO vitest config, BROKEN setup files
✅ bun run test → Vitest - loads .env, proper config, works correctly
✅ task test → Same as above (preferred)
✅ task test:e2e → E2E tests only (uses separate vitest config)

E2E Tests:

Located in src/cli/e2e/
Use their own vitest config: src/cli/e2e/vitest.config.ts
Excluded from regular task test runs
Run with task test:e2e or bun run test:e2e
To run a single E2E test file: task test:e2e -- src/cli/e2e/07-classify.test.ts
To preserve temp cache dir for debugging: DEBUG_E2E=1 task test:e2e -- src/cli/e2e/07-classify.test.ts
To update cache fixture (allow real API calls): UPDATE_E2E_CACHE=true task test:e2e
To rebuild cache fixture from scratch: REPLACE_E2E_CACHE=true task test:e2e

VCR Testing Model: Tests are NEVER skipped. API responses are recorded locally and replayed on CI:

Run tests locally with API keys in .env → responses cached to fixtures
Commit fixture files (tests/fixtures/)
CI replays from cached fixtures (no real API calls needed)

// Test file naming
src/parser/whatsapp.ts        // Implementation
src/parser/whatsapp.test.ts   // Tests

// Use vitest
import { describe, expect, it } from 'vitest'

Test Fixtures & Caching

Three mechanisms for caching external API responses in tests:

Class	Purpose	Use For
`FixtureCache`	AI API responses (classifier, embeddings, geocoder)	Single .json.gz file, pass to functions as `cache` param
`HttpRecorder`	Raw HTTP responses (scrapers)	Auto-records to fixtures dir, provides `recorder.fetch`
`FilesystemCache`	General response cache	Production-style caching in tests

FixtureCache - For AI API calls (classifier, embeddings, geocoder):

import { FixtureCache } from '../test-support/fixture-cache.js'

const cache = new FixtureCache('tests/fixtures/my-test.json.gz')
await cache.load()

// Pass to classifier/embeddings/geocoder - auto records on first run
const result = await classifyMessages(candidates, config, cache)

await cache.save() // Writes new entries to fixture

HttpRecorder - For HTTP scraper tests:

import { HttpRecorder } from './test-support/http-recorder.js'

const recorder = new HttpRecorder('tests/fixtures/scraper-name')

// Pass recorder.fetch to scrapers - auto records/replays
const result = await scrapeTikTok(url, { fetch: recorder.fetch })

CI HTTP Guard - Blocks uncached requests in CI:

// In src/http.ts - throws UncachedHttpRequestError when:
// - CI=true AND (NODE_ENV=test OR VITEST=true)
// Run: CI=true bun run test  # Verifies all HTTP is cached

Documentation

Document	Location
Core Library PRD	`project/PRD_CORE.md`
CLI PRD	`project/PRD_CLI.md`
Phase 8 TODO	`project/todo/PHASE_8_TODO.md`

Cache System

The CLI uses a two-layer caching system:

Layer	Purpose	Location
Pipeline Cache	Per-run stage outputs (messages, candidates, classifications)	`~/.cache/chat-to-map/chats/<input>/<datetime>-<hash>/`
API Cache	Deduplicate API calls (embeddings, classification, geocoding, scraping)	`~/.cache/chat-to-map/requests/`

Cache Location

~/.cache/chat-to-map/
├── chats/                              # Pipeline cache (per-run outputs)
│   └── WhatsApp_Chat/
│       └── 2025-01-15T10-30-45-abc123/ # datetime-filehash
│           ├── chat.txt
│           ├── messages.json
│           ├── candidates.heuristics.json
│           ├── classifications.json
│           └── ...
└── requests/                           # API cache (response deduplication)
    ├── ai/openai/text-embedding-3-large/{hash}.json
    ├── ai/anthropic/claude-haiku-4-5/{hash}.json
    ├── geo/google/{hash}.json
    └── web/https_example_com_{hash}.json

Configuration

# Custom cache directory
chat-to-map analyze ./chat.zip --cache-dir /tmp/cache
export CHAT_TO_MAP_CACHE_DIR="/custom/path"

# Skip all caching
chat-to-map analyze ./chat.zip --no-cache

Cache Key Generation

Keys are deterministic SHA256 hashes with sorted object properties:

import { generateCacheKey, generateUrlCacheKey } from 'src/cache/key'

// Same key regardless of property order
generateCacheKey({ a: 1, b: 2 }) === generateCacheKey({ b: 2, a: 1 })

// URL cache key includes sanitized URL + hash
generateUrlCacheKey('https://example.com/path')
// → 'web/https_example_com_path_abc12345.json'

No TTL

Both caches store entries forever. Manual cleanup:

rm -rf ~/.cache/chat-to-map           # Clear all
rm -rf ~/.cache/chat-to-map/requests  # Clear API cache only

User CLI Config

The CLI config file is at ~/.config/chat-to-map/config.json:

{
  "homeCountry": "New Zealand",
  "timezone": "Pacific/Auckland",
  "mediaLibraryPath": "/Users/ndbroadbent/code/chat_to_map_worktrees/media-library/media_library/images",
  "pdfThumbnails": true,
  "fetchImages": true
}

Media Library

Local path: /Users/ndbroadbent/code/chat_to_map_worktrees/media-library/media_library/images/ CDN: https://media.chattomap.com/images/ for image assets, with the index at https://media.chattomap.com/index.json.gz (synced via rclone to R2)

images/
├── objects/          # ~700 curated activity images (swimming, restaurant, etc.)
├── categories/       # Category fallback images (references to objects)
├── countries/        # Country images (France, Japan, New Zealand, etc.)
├── regions/          # Region images (future)
├── cities/           # City images (future)
├── venues/           # Venue images (future)
├── index.json        # Index with all entries and synonyms (local)
└── index.json.gz     # Compressed index (CDN)

Use mediaLibraryPath in config or --media-library-path CLI flag to use local images instead of CDN.

What NOT to Do

❌ Add IO operations to core library functions
❌ Add progress callbacks or events
❌ Add database operations
❌ Add rate limiting logic (coordinator's job)
❌ Use biome-ignore comments
❌ Skip task ci before completing work
❌ Forget to update project/TODO.md
❌ Use inline imports like import('../../types').SomeType - add proper imports at the top of the file
❌ Investigate bugs by reading code first - always write a failing test FIRST to reproduce the issue
❌ Assume CLI stdout shows all results - commands default to 10 results max, use --max-results or --all to see more

Dependencies

Core dependencies are minimal:

exceljs - Excel export
jszip - Zip file handling
pdfkit - PDF generation

AI SDKs are peer dependencies (optional):

openai - Embeddings
@anthropic-ai/sdk - Classification

Default AI Models

You MUST use current model IDs. Outdated models will fail or produce poor results.

CLI Model Selection

Model ID determines provider. Set via CLASSIFIER_MODEL env var.

Model ID	Provider	API Model ID	Required Env Var
`gemini-3-flash`	google	`gemini-3-flash-preview`	`GOOGLE_AI_API_KEY`
`gemini-3-flash-or`	openrouter	`google/gemini-3-flash-preview`	`OPENROUTER_API_KEY`
`haiku-4.5`	anthropic	`claude-haiku-4-5`	`ANTHROPIC_API_KEY`
`haiku-4.5-or`	openrouter	`anthropic/claude-3-5-haiku-latest`	`OPENROUTER_API_KEY`
`gpt-5-mini`	openai	`gpt-5-mini`	`OPENAI_API_KEY`

Default: gemini-3-flash (falls back to haiku-4.5 if no Google AI key)

Keep these updated! Model constants are in src/classifier/models.ts - update LATEST_* when new models are released.

Last updated: 2025-12-25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Claude Code Guidelines for ChatToMap Core Library

Project Overview

Architecture Principle

CLI Worker Pool Convention

Key Files

Python Prototype Reference

Quality Standards

Non-Negotiable Rules

Before Marking ANY Task Complete

CLI Pipeline

CLI Command Reference

Commands

Code Standards

TypeScript

Naming Conventions

Testing

Test Fixtures & Caching

Documentation

Cache System

Cache Location

Configuration

Cache Key Generation

No TTL

User CLI Config

Media Library

What NOT to Do

Dependencies

Default AI Models

CLI Model Selection

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

Claude Code Guidelines for ChatToMap Core Library

Project Overview

Architecture Principle

CLI Worker Pool Convention

Key Files

Python Prototype Reference

Quality Standards

Non-Negotiable Rules

Before Marking ANY Task Complete

CLI Pipeline

CLI Command Reference

Commands

Code Standards

TypeScript

Naming Conventions

Testing

Test Fixtures & Caching

Documentation

Cache System

Cache Location

Configuration

Cache Key Generation

No TTL

User CLI Config

Media Library

What NOT to Do

Dependencies

Default AI Models

CLI Model Selection