| name | web-scraping |
|---|---|
| description | This skill activates for web scraping and Actor development. It proactively discovers APIs via traffic interception, recommends optimal strategy (traffic interception/sitemap/API/DOM scraping/hybrid), and implements iteratively. For production, it guides TypeScript Actor creation via Apify CLI. |
| license | MIT |
Activate automatically when user requests:
- "Scrape [website]"
- "Extract data from [site]"
- "Get product information from [URL]"
- "Find all links/pages on [site]"
- "I'm getting blocked" or "Getting 403 errors" (loads
strategies/anti-blocking.md) - "Make this an Apify Actor" (loads
apify/subdirectory) - "Productionize this scraper"
Determine reconnaissance depth from user request:
| User Says | Mode | Phases Run |
|---|---|---|
| "quick recon", "just check", "what framework" | Quick | Phase 0 only |
| "scrape X", "extract data from X" (default) | Standard | Phases 0-3 + 5, Phase 4 only if protection signals detected |
| "full recon", "deep scan", "production scraping" | Full | All phases (0-5) including protection testing |
Default is Standard mode. Escalate to Full if protection signals appear during any phase.
This skill uses an adaptive phased workflow with quality gates. Each gate asks "Do I have enough?" — continue only when the answer is no.
See: strategies/framework-signatures.md for framework detection tables referenced throughout.
Gather maximum intelligence with minimum cost — a single HTTP request.
Step 0a: Fetch raw HTML and headers
curl -s -D- -L "https://target.com/page" -o response.htmlStep 0b: Check response headers
- Match headers against
strategies/framework-signatures.md→ Response Header Signatures table - Note
Server,X-Powered-By,X-Shopify-Stage,Set-Cookie(protection markers) - Check HTTP status code (200 = accessible, 403 = protected, 3xx = redirects)
Step 0c: Check Known Major Sites table
- Match domain against
strategies/framework-signatures.md→ Known Major Sites - If matched: use the specified data strategy, skip generic pattern scanning
Step 0d: Detect framework from HTML
- Search raw HTML for signatures in
strategies/framework-signatures.md→ HTML Signatures table - Look for
__NEXT_DATA__,__NUXT__,ld+json,/wp-content/,data-reactroot
Step 0e: Search for target data points
- For each data point the user wants: search raw HTML for that content
- Track which data points are found vs missing
- Check for sitemaps:
curl -s https://[site]/robots.txt | grep -i Sitemap
Step 0f: Note protection signals
- 403/503 status, Cloudflare challenge HTML, CAPTCHA elements,
cf-rayheader - Record for Phase 4 decision
See: strategies/cheerio-vs-browser-test.md for the Cheerio viability assessment
QUALITY GATE A: All target data points found in raw HTML + no protection signals? → YES: Skip to Phase 3 (Validate Findings). No browser needed. → NO: Continue to Phase 1.
Launch browser only for data points missing from raw HTML or when JavaScript rendering is required.
Step 1a: Initialize browser session
proxy_start()→ Start traffic interception proxyinterceptor_chrome_launch(url, stealthMode: true)→ Launch Chrome with anti-detectioninterceptor_chrome_devtools_attach(target_id)→ Attach DevTools bridgeinterceptor_chrome_devtools_screenshot()→ Capture visual state
Step 1b: Capture traffic and rendered DOM
proxy_list_traffic()→ Review all traffic from page loadproxy_search_traffic(query: "application/json")→ Find JSON responsesinterceptor_chrome_devtools_list_network(resource_types: ["xhr", "fetch"])→ XHR/fetch callsinterceptor_chrome_devtools_snapshot()→ Accessibility tree (rendered DOM)
Step 1c: Search rendered DOM for missing data points
- For each data point NOT found in Phase 0: search rendered DOM
- Use framework-specific search strategy from
strategies/framework-signatures.md→ Framework → Search Strategy table - Only search patterns relevant to the detected framework
Step 1d: Inspect discovered endpoints
proxy_get_exchange(exchange_id)→ Full request/response for promising endpoints- Document: method, headers, auth, response structure, pagination
QUALITY GATE B: All target data points now covered (raw HTML + rendered DOM + traffic)? → YES: Skip to Phase 3 (Validate Findings). No deep scan needed. → NO: Continue to Phase 2 for missing data points only.
Targeted investigation for data points not yet found. Only search for what's missing.
Step 2a: Test interactions for missing data
proxy_clear_traffic()before each action → Isolate API callshumanizer_click(target_id, selector)→ Trigger dynamic content loadshumanizer_scroll(target_id, direction, amount)→ Trigger lazy loading / infinite scrollhumanizer_idle(target_id, duration_ms)→ Wait for delayed content- After each action:
proxy_list_traffic()→ Check for new API calls
Step 2b: Sniff APIs (framework-aware)
- Search only patterns relevant to detected framework:
- Next.js →
proxy_list_traffic(url_filter: "/_next/data/") - WordPress →
proxy_list_traffic(url_filter: "/wp-json/") - GraphQL →
proxy_search_traffic(query: "graphql") - Generic →
proxy_list_traffic(url_filter: "/api/")+proxy_search_traffic(query: "application/json")
- Next.js →
- Skip patterns that don't apply to the detected framework
Step 2c: Test pagination and filtering
- Only if pagination data is a missing data point or needed for coverage assessment
proxy_clear_traffic()→ click next page →proxy_list_traffic(url_filter: "page=")- Document pagination type (URL-based, API offset, cursor, infinite scroll)
QUALITY GATE C: Enough data points covered for a useful report? → YES: Go to Phase 3. → NO: Document gaps, go to Phase 3 anyway (report will note missing data in self-critique).
Every claimed extraction method must be verified. A data point is not "found" until the extraction path is specified and tested.
See: strategies/cheerio-vs-browser-test.md for validation methodology
Step 3a: Validate CSS selectors
- For each Cheerio/selector-based method: confirm the selector matches actual HTML
- Test against raw HTML (curl output) or rendered DOM (snapshot)
- Confirm selector extracts the correct value, not a different element
Step 3b: Validate JSON paths
- For each JSON extraction (e.g.,
__NEXT_DATA__, API response): confirm the path resolves - Parse the JSON, follow the path, verify it returns the expected data type and value
Step 3c: Validate API endpoints
- For each discovered API: replay the request (curl or
proxy_get_exchange) - Confirm: response status 200, expected data structure, correct values
- Test pagination if claimed (at least page 1 and page 2)
Step 3d: Downgrade or re-investigate failures
- If a selector doesn't match: try alternative selectors, or downgrade to PARTIAL confidence
- If an API returns 403: note protection requirement, flag for Phase 4
- If a JSON path is wrong: re-examine the JSON structure, correct the path
See: strategies/proxy-escalation.md for complete skip/run decision logic
Skip Phase 4 when ALL true:
- No protection signals detected in Phases 0-2
- All data points have validated extraction methods
- User didn't request "full recon"
Run Phase 4 when ANY true:
- 403/challenge page observed during any phase
- Known high-protection domain
- High-volume or production intent
- User explicitly requested it
If running:
Step 4a: Test raw HTTP access
curl -s -o /dev/null -w "%{http_code}" "https://target.com/page"- 200 → Cheerio viable, no browser needed for accessible endpoints
- 403/503 → Escalate to stealth browser
Step 4b: Test with stealth browser (if needed)
- Already running from Phase 1 — check if pages loaded without challenges
interceptor_chrome_devtools_list_cookies(domain_filter: "cloudflare")→ Protection cookiesinterceptor_chrome_devtools_list_storage_keys(storage_type: "local")→ Fingerprint markersproxy_get_tls_fingerprints()→ TLS fingerprint analysis
Step 4c: Test with upstream proxy (if needed)
proxy_set_upstream("http://user:pass@proxy-provider:port")- Re-test blocked endpoints through proxy
- Document minimum access level for each data point
Step 4d: Document protection profile
- What protections exist, what worked to bypass them, what production scrapers will need
Generate the intelligence report, then critically review it for gaps.
See: reference/report-schema.md for complete report format
Step 5a: Generate report
- Follow
reference/report-schema.mdschema (Sections 1-6) - Include
Validated?status for every strategy (YES / PARTIAL / NO) - Include all discovered endpoints with full specs
Step 5b: Self-critique
- Write Section 7 (Self-Critique) per
reference/report-schema.md:- Gaps: Data points not found — why, and what would find them
- Skipped steps: Which phases skipped, with quality gate reasoning
- Unvalidated claims: Anything marked PARTIAL or NO
- Assumptions: Things not verified (e.g., "consistent layout across categories")
- Staleness risk: Geo-dependent prices, A/B layouts, session-specific content
- Recommendations: Targeted next steps (not "re-run everything")
Step 5c: Fix gaps with targeted re-investigation
- If self-critique reveals fixable gaps: go back to the specific phase/step, not a full re-run
- Example: "Price selector untested" → run one curl + parse, don't re-launch browser
- Update report with results
Step 5d: Record session (if browser was used)
proxy_session_start(name)→proxy_session_stop(session_id)→proxy_export_har(session_id, path)- HAR file captures all traffic for replay. See
strategies/session-workflows.md
After reconnaissance report is accepted, implement scraper iteratively.
Core Pattern:
- Implement recommended approach (minimal code)
- Test with small batch (5-10 items)
- Validate data quality
- Scale to full dataset or fallback
- Handle blocking if encountered
- Add robustness (error handling, retries, logging)
See: workflows/implementation.md for complete implementation patterns and code examples
Convert scraper to production-ready Apify Actor.
Activation triggers: "Make this an Apify Actor", "Productionize this", "Deploy to Apify"
Core Pattern:
- Confirm TypeScript preference (STRONGLY RECOMMENDED)
- Initialize with
apify createcommand (CRITICAL) - Port scraping logic to Actor format
- Test locally and deploy
Note: During development, proxy-mcp provides reconnaissance and traffic analysis. For production Actors, use Crawlee crawlers (CheerioCrawler/PlaywrightCrawler) on Apify infrastructure.
See: workflows/productionization.md for complete workflow and apify/ for Actor development guides
| Task | Pattern/Command | Documentation |
|---|---|---|
| Reconnaissance | Adaptive Phases 0-5 | workflows/reconnaissance.md |
| Framework detection | Header + HTML signature matching | strategies/framework-signatures.md |
| Cheerio vs Browser | Three-way test + early exit | strategies/cheerio-vs-browser-test.md |
| Traffic analysis | proxy_list_traffic() + proxy_get_exchange() |
strategies/traffic-interception.md |
| Protection testing | Conditional escalation | strategies/proxy-escalation.md |
| Report format | Sections 1-7 with self-critique | reference/report-schema.md |
| Find sitemaps | RobotsFile.find(url) |
strategies/sitemap-discovery.md |
| Filter sitemap URLs | RequestList + regex |
reference/regex-patterns.md |
| Discover APIs | Traffic capture (automatic) | strategies/api-discovery.md |
| DOM scraping | DevTools bridge + humanizer | strategies/dom-scraping.md |
| HTTP scraping | CheerioCrawler |
strategies/cheerio-scraping.md |
| Hybrid approach | Sitemap + API | strategies/hybrid-approaches.md |
| Handle blocking | Stealth mode + upstream proxies | strategies/anti-blocking.md |
| Session recording | proxy_session_start() / proxy_export_har() |
strategies/session-workflows.md |
| Proxy-MCP tools | Complete reference | reference/proxy-tool-reference.md |
| Fingerprint configs | Stealth + TLS presets | reference/fingerprint-patterns.md |
| Create Apify Actor | apify create |
apify/cli-workflow.md |
| Template selection | Cheerio vs Playwright | workflows/productionization.md |
| Input schema | .actor/input_schema.json |
apify/input-schemas.md |
| Deploy actor | apify push |
apify/deployment.md |
import { RobotsFile, CheerioCrawler, Dataset } from 'crawlee';
// Auto-discover and parse sitemaps
const robots = await RobotsFile.find('https://example.com');
const urls = await robots.parseUrlsFromSitemaps();
const crawler = new CheerioCrawler({
async requestHandler({ $, request }) {
const data = {
title: $('h1').text().trim(),
// ... extract data
};
await Dataset.pushData(data);
},
});
await crawler.addRequests(urls);
await crawler.run();See examples/sitemap-basic.js for complete example.
import { gotScraping } from 'got-scraping';
const productIds = [123, 456, 789];
for (const id of productIds) {
const response = await gotScraping({
url: `https://api.example.com/products/${id}`,
responseType: 'json',
});
console.log(response.body);
}See examples/api-scraper.js for complete example.
// Get URLs from sitemap
const robots = await RobotsFile.find('https://shop.com');
const urls = await robots.parseUrlsFromSitemaps();
// Extract IDs from URLs
const productIds = urls
.map(url => url.match(/\/products\/(\d+)/)?.[1])
.filter(Boolean);
// Fetch data via API
for (const id of productIds) {
const data = await gotScraping({
url: `https://api.shop.com/v1/products/${id}`,
responseType: 'json',
});
// Process data
}See examples/hybrid-sitemap-api.js for complete example.
This skill uses progressive disclosure - detailed information is organized in subdirectories and loaded only when needed.
For: Step-by-step workflow guides for each phase
workflows/reconnaissance.md- Phase 1 interactive reconnaissance (CRITICAL)workflows/implementation.md- Phase 4 iterative implementation patternsworkflows/productionization.md- Phase 5 Apify Actor creation workflow
For: Detailed guides on specific scraping approaches
strategies/framework-signatures.md- Framework detection lookup tables (Phase 0/1)strategies/cheerio-vs-browser-test.md- Cheerio vs Browser decision test with early exitstrategies/proxy-escalation.md- Protection testing skip/run conditions (Phase 4)strategies/traffic-interception.md- Traffic interception via MITM proxystrategies/sitemap-discovery.md- Complete sitemap guide (4 patterns)strategies/api-discovery.md- Finding and using APIsstrategies/dom-scraping.md- DOM scraping via DevTools bridgestrategies/cheerio-scraping.md- HTTP-only scrapingstrategies/hybrid-approaches.md- Combining strategiesstrategies/anti-blocking.md- Multi-layer anti-detection (stealth, humanizer, proxies, TLS)strategies/session-workflows.md- Session recording, HAR export, replay
For: Working code to reference or execute
JavaScript Learning Examples (Simple standalone scripts):
examples/sitemap-basic.js- Simple sitemap scraperexamples/api-scraper.js- Pure API approachexamples/traffic-interception-basic.js- Proxy-based reconnaissanceexamples/hybrid-sitemap-api.js- Combined approachexamples/iterative-fallback.js- Try traffic interception→sitemap→API→DOM scraping
TypeScript Production Examples (Complete Actors):
apify/examples/basic-scraper/- Sitemap + Playwrightapify/examples/anti-blocking/- Fingerprinting + proxiesapify/examples/hybrid-api/- Sitemap + API (optimal)
For: Quick patterns and troubleshooting
reference/report-schema.md- Intelligence report format (Sections 1-7 + self-critique)reference/proxy-tool-reference.md- Proxy-MCP tool reference (all 80+ tools)reference/regex-patterns.md- Common URL regex patternsreference/fingerprint-patterns.md- Stealth mode + TLS fingerprint presetsreference/anti-patterns.md- What NOT to do
For: Creating production Apify Actors
apify/README.md- When and how to use Apifyapify/typescript-first.md- Why TypeScript for actorsapify/cli-workflow.md- apify create workflow (CRITICAL)apify/initialization.md- Complete setup guideapify/input-schemas.md- Input validation patternsapify/configuration.md- actor.json setupapify/deployment.md- Testing and deploymentapify/templates/- TypeScript boilerplate
Note: Each file is self-contained and can be read independently. Claude will navigate to specific files as needed.
Start cheap (curl), escalate only when needed:
- Phase 0 (curl) before Phase 1 (browser) before Phase 2 (deep scan)
- Quality gates skip phases when data is sufficient
- Never launch a browser if curl gives you everything
Use framework detection to focus searches:
- Match against
strategies/framework-signatures.mdbefore scanning - Skip patterns that don't apply (no
__NEXT_DATA__on Amazon) - Known major sites get direct strategy lookup
Every claimed extraction method must be tested:
- "Found text in HTML" is not enough — need a working selector/path
- Phase 3 validates every finding before the report
- Unvalidated claims are marked PARTIAL or NO in the report
Build incrementally:
- Small test batch first (5-10 items)
- Validate quality
- Scale or fallback
- Add robustness last
When productionizing:
- Use TypeScript (strongly recommended)
- Use
apify create(never manual setup) - Add proper error handling
- Include logging and monitoring
Remember: Traffic interception first, sitemaps second, APIs third, DOM scraping last!
For detailed guidance on any topic, navigate to the relevant subdirectory file listed above.