Refactor: Documentation Fetcher

What

Rewrite scripts/fetchccdocs.py → scripts/fetcher.py

Clean 1:1 mirror of sitemap. Auto-discover sections. Zero hardcoding.

Why

Current problems:

Hardcoded patterns (fragile, needs updates for new sections)
Convoluted mapping (api/agent-sdk → claude-code-docs/api/api/agent-sdk)
50+ lines of path logic
Missing 6 API endpoints
Adding sections requires code changes

Goals:

100% coverage (all 269 sitemap URLs)
Zero hardcoding (auto-discover from sitemap)
Simple mapping (3 lines vs 50 lines)
Future-proof (new languages/sections auto-work)
Fast (~5 sec, async concurrent fetching)

How

Target Structure

True 1:1 mirror:

content/
├── en/
│   ├── docs/
│   ├── api/
│   ├── resources/
│   └── release-notes/
├── zh/                      # Future
├── blog/                    # anthropic.com posts
├── claude-code-manifest.json
└── .metadata.json

Mapping: https://docs.claude.com/{path} → content/{path}.md

Core Changes

1. Auto-discover sections

def extract_urls(session):
    sitemap = await fetch_sitemap(session)
    urls = defaultdict(list)

    for url in parse_sitemap_urls(sitemap):
        if '/en/' not in url:
            continue
        urls[url].append(url)

    return urls

2. Simple path mapping

def get_output_path(url):
    path = url.replace('https://docs.claude.com/', '')
    return output_dir / f"{path}.md"

3. Special files (outside /en/)

# NPM manifest
await fetch_npm_manifest()
→ content/claude-code-manifest.json

# GitHub CHANGELOG
await fetch_github_changelog()
→ content/CHANGELOG.md

# Blog posts (anthropic.com)
await fetch_blog_posts()
→ content/blog/{slug}.md

New Features

--tree: Browse sitemap structure

$ fetcher.py --tree

en/ (269)
  docs/ (116)
    claude-code/ (44)
    build-with-claude/ (30)
    ...
  api/ (84)
    agent-sdk/ (16)
    admin-api/ (25)
    ...

--section: Filter by path

$ fetcher.py --section en/docs/claude-code
$ fetcher.py --section en/api/agent-sdk
$ fetcher.py --section en/docs

CLI

fetcher.py                          Fetch all
fetcher.py --tree                   Show structure
fetcher.py --section en/docs        Filter by path
fetcher.py --incremental            Skip existing
fetcher.py --jobs 100               Parallel jobs

Implementation

Single-file script following @repo/sbin/AGENTS.md:

Minimal comments; --help is docs
Clean output (no emojis, minimal logging)
Fail fast
Type hints
PEP 723 inline deps

Core logic:

class Fetcher:
    async def fetch_all(self):
        # Fetch special files first
        await self.fetch_npm_manifest()
        await self.fetch_github_changelog()

        # Auto-discover and fetch sitemap URLs
        urls = await self.extract_urls()

        # Filter if --section specified
        if self.section:
            urls = [u for u in urls if u.startswith(self.section)]

        # Concurrent download
        semaphore = asyncio.Semaphore(self.jobs)
        tasks = [self.download(url, semaphore) for url in urls]
        await asyncio.gather(*tasks)

    def get_output_path(self, url):
        path = url.replace('https://docs.claude.com/', '')
        return self.output_dir / f"{path}.md"

Efficiency:

Async/await with aiohttp
Concurrent downloads (default 50 jobs)
Single sitemap fetch
Incremental mode (skip existing)
Python 3.14t free-threaded

Update CLAUDE.md

Replace all doc references:

@./content/claude-code-docs/          → @./content/en/docs/claude-code/
@./content/build-with-claude/         → @./content/en/docs/build-with-claude/
@./content/agents-and-tools/          → @./content/en/docs/agents-and-tools/
@./content/about-claude/              → @./content/en/docs/about-claude/
@./content/test-and-evaluate/         → @./content/en/docs/test-and-evaluate/
@./content/resources/                 → @./content/en/resources/
@./content/release-notes/             → @./content/en/release-notes/

Keep:

@./content/claude-code-manifest.json  (unchanged)
@./content/CHANGELOG.md               (new location)

Workflow Updates

.github/workflows/fetch-claude-docs.yml:

- name: Fetch docs
  run: uv run scripts/fetcher.py

No other changes needed.

Verification

After implementation:

Run fetcher.py - verify all 269 URLs fetched
Check content/en/ structure mirrors sitemap
Verify special files (manifest.json, CHANGELOG.md, blog/)
Test --tree output
Test --section filtering
Update and test CLAUDE.md references
Remove old content/ directories
Delete scripts/fetchccdocs.py and scripts/fetchccdocs.sh

Done when:

100% coverage (269/269 URLs + manifest + changelog + blog)
Clean 1:1 structure
CLAUDE.md references work
Workflow runs successfully

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: Documentation Fetcher

What

Why

How

Target Structure

Core Changes

New Features

CLI

Implementation

Update CLAUDE.md

Workflow Updates

Verification

FilesExpand file tree

REFACTOR.md

Latest commit

History

REFACTOR.md

File metadata and controls

Refactor: Documentation Fetcher

What

Why

How

Target Structure

Core Changes

New Features

CLI

Implementation

Update CLAUDE.md

Workflow Updates

Verification