From 7d3b177318f0652e085ac5fac89dd555b7d8f682 Mon Sep 17 00:00:00 2001 From: Mohammad Faiz Date: Mon, 22 Jun 2026 19:06:10 +0530 Subject: [PATCH 1/4] fix: fix cache markdown crash, viewport aspect ratio, dead code, typos, and cleanup docs emojis - fix(critical): cache markdown dict collapsing causes AttributeError on cached URLs - fix(high): viewport aspect ratio formula was inverted, producing wrong sizes - fix(high): remove unreachable duplicate except block in execute_user_script - fix(medium): remove dead duplicate normalize_url, normalize_url_tmp definitions - fix(medium): remove dead html2text.HTML2Text() instantiation - fix(medium): fix typo fa_user_agenr_generator -> ua_generator - fix(medium): SCREENSHOT_HEIGHT_TRESHOLD -> THRESHOLD with backward compat - fix(medium): remove orphaned adaptive_crawler copy.py - fix(low): fix configure_windows_event_loop import path in docstring - fix(low): remove duplicate exclude_internal_links docstring - fix: complete markdown cache read rewrite to handle all json.loads() outcomes - fix: add type safety in CrawlResult.__init__ for markdown parameter - fix: acache_url MarkdownGenerationResult() Pydantic v2 crash with missing fields - docs: remove decorative emojis from 46 .md files, keep only functional ones --- CHANGELOG.md | 86 +- README-first.md | 154 +- README.md | 244 +-- ROADMAP.md | 12 +- SPONSORS.md | 22 +- crawl4ai/adaptive_crawler copy.py | 1847 ----------------- crawl4ai/async_configs.py | 12 +- crawl4ai/async_crawler_strategy.py | 14 +- crawl4ai/async_database.py | 61 +- crawl4ai/config.py | 3 +- crawl4ai/models.py | 13 +- crawl4ai/utils.py | 59 +- deploy/docker/README.md | 24 +- docs/blog/release-v0.7.0.md | 14 +- docs/blog/release-v0.7.1.md | 8 +- docs/blog/release-v0.7.3.md | 46 +- docs/blog/release-v0.7.4.md | 24 +- docs/blog/release-v0.7.5.md | 18 +- docs/blog/release-v0.7.6.md | 26 +- docs/blog/release-v0.7.7.md | 76 +- docs/deprecated/docker-deployment.md | 10 +- .../c4a_script/amazon_example/README.md | 18 +- docs/examples/c4a_script/tutorial/README.md | 40 +- docs/examples/chainlit.md | 2 +- .../url_seeder/tutorial_url_seeder.md | 86 +- docs/md_v2/advanced/session-management.md | 2 +- docs/md_v2/api/c4a-script-reference.md | 16 +- docs/md_v2/apps/c4a-script/README.md | 40 +- docs/md_v2/apps/crawl4ai-assistant/README.md | 16 +- docs/md_v2/apps/index.md | 30 +- docs/md_v2/basic/installation.md | 6 +- .../articles/adaptive-crawling-revolution.md | 8 +- .../blog/articles/llm-context-revolution.md | 8 +- .../articles/virtual-scroll-revolution.md | 6 +- docs/md_v2/blog/index.md | 12 +- docs/md_v2/blog/releases/0.4.1.md | 4 +- docs/md_v2/blog/releases/0.4.2.md | 12 +- docs/md_v2/blog/releases/0.6.0.md | 4 +- docs/md_v2/blog/releases/0.7.0.md | 14 +- docs/md_v2/blog/releases/0.7.1.md | 8 +- docs/md_v2/blog/releases/0.7.2.md | 14 +- docs/md_v2/blog/releases/0.7.3.md | 16 +- docs/md_v2/blog/releases/0.7.6.md | 26 +- docs/md_v2/blog/releases/v0.4.3b1.md | 6 +- docs/md_v2/blog/releases/v0.7.5.md | 18 +- docs/md_v2/blog/releases/v0.7.7.md | 76 +- docs/md_v2/branding/index.md | 24 +- docs/md_v2/core/c4a-script.md | 22 +- docs/md_v2/core/deep-crawling.md | 2 +- docs/md_v2/core/self-hosting.md | 58 +- docs/md_v2/core/table_extraction.md | 2 +- docs/md_v2/index.md | 12 +- 52 files changed, 748 insertions(+), 2633 deletions(-) delete mode 100644 crawl4ai/adaptive_crawler copy.py diff --git a/CHANGELOG.md b/CHANGELOG.md index c09b7d2e1..fbd6a3ff2 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -166,17 +166,17 @@ Song Binglin (q1uf3ng), by111 (August829), Jeongbean Jeon, wulonchia, secsys_cod ### Added - **๐Ÿš€ init_scripts for BrowserConfig**: Pre-page-load JavaScript injection for stealth evasions -- **๐Ÿ”„ CDP Connection Improvements**: WebSocket URL support, proper cleanup, browser reuse -- **๐Ÿ’พ Crash Recovery for Deep Crawl**: `resume_state` and `on_state_change` for BFS/DFS/Best-First strategies -- **๐Ÿ“„ PDF/MHTML for raw:/file:// URLs**: Generate PDFs and MHTML from cached HTML content -- **๐Ÿ“ธ Screenshots for raw:/file:// URLs**: Render cached HTML and capture screenshots +- ** CDP Connection Improvements**: WebSocket URL support, proper cleanup, browser reuse +- ** Crash Recovery for Deep Crawl**: `resume_state` and `on_state_change` for BFS/DFS/Best-First strategies +- ** PDF/MHTML for raw:/file:// URLs**: Generate PDFs and MHTML from cached HTML content +- ** Screenshots for raw:/file:// URLs**: Render cached HTML and capture screenshots - **๐Ÿ”— base_url Parameter**: Proper URL resolution for raw: HTML processing -- **โšก Prefetch Mode**: Two-phase deep crawling with fast link extraction -- **๐Ÿ”€ Enhanced Proxy Support**: Improved proxy rotation and sticky sessions -- **๐ŸŒ HTTP Strategy Proxy Support**: Non-browser crawler now supports proxies -- **๐Ÿ–ฅ๏ธ Browser Pipeline for raw:/file://**: New `process_in_browser` parameter -- **๐Ÿ“‹ Smart TTL Cache for Sitemap Seeder**: `cache_ttl_hours` and `validate_sitemap_lastmod` parameters -- **๐Ÿ“š Security Documentation**: Added SECURITY.md with vulnerability reporting guidelines +- ** Prefetch Mode**: Two-phase deep crawling with fast link extraction +- ** Enhanced Proxy Support**: Improved proxy rotation and sticky sessions +- ** HTTP Strategy Proxy Support**: Non-browser crawler now supports proxies +- ** Browser Pipeline for raw:/file://**: New `process_in_browser` parameter +- ** Smart TTL Cache for Sitemap Seeder**: `cache_ttl_hours` and `validate_sitemap_lastmod` parameters +- ** Security Documentation**: Added SECURITY.md with vulnerability reporting guidelines ### Fixed - **raw: URL Parsing**: Fixed truncation at `#` character (CSS color codes like `#eee`) @@ -201,7 +201,7 @@ Song Binglin (q1uf3ng), by111 (August829), Jeongbean Jeon, wulonchia, secsys_cod ## [0.7.3] - 2025-08-09 ### Added -- **๐Ÿ•ต๏ธ Undetected Browser Support**: New browser adapter pattern with stealth capabilities +- ** Undetected Browser Support**: New browser adapter pattern with stealth capabilities - `browser_adapter.py` with undetected Chrome integration - Bypass sophisticated bot detection systems (Cloudflare, Akamai, custom solutions) - Support for headless stealth mode with anti-detection techniques @@ -209,7 +209,7 @@ Song Binglin (q1uf3ng), by111 (August829), Jeongbean Jeon, wulonchia, secsys_cod - Comprehensive examples for anti-bot strategies and stealth crawling - Full documentation guide for undetected browser usage -- **๐ŸŽจ Multi-URL Configuration System**: URL-specific crawler configurations for batch processing +- ** Multi-URL Configuration System**: URL-specific crawler configurations for batch processing - Different crawling strategies for different URL patterns in a single batch - Support for string patterns with wildcards (`"*.pdf"`, `"*/blog/*"`) - Lambda function matchers for complex URL logic @@ -217,7 +217,7 @@ Song Binglin (q1uf3ng), by111 (August829), Jeongbean Jeon, wulonchia, secsys_cod - Fallback configuration support when no patterns match - First-match-wins configuration selection with optional fallback -- **๐Ÿง  Memory Monitoring & Optimization**: Comprehensive memory usage tracking +- ** Memory Monitoring & Optimization**: Comprehensive memory usage tracking - New `memory_utils.py` module for memory monitoring and optimization - Real-time memory usage tracking during crawl sessions - Memory leak detection and reporting @@ -225,21 +225,21 @@ Song Binglin (q1uf3ng), by111 (August829), Jeongbean Jeon, wulonchia, secsys_cod - Peak memory usage analysis and efficiency metrics - Automatic cleanup suggestions for memory-intensive operations -- **๐Ÿ“Š Enhanced Table Extraction**: Improved table access and DataFrame conversion +- ** Enhanced Table Extraction**: Improved table access and DataFrame conversion - Direct `result.tables` interface replacing generic `result.media` approach - Instant pandas DataFrame conversion with `pd.DataFrame(table['data'])` - Enhanced table detection algorithms for better accuracy - Table metadata including source XPath and headers - Improved table structure preservation during extraction -- **๐Ÿ’ฐ GitHub Sponsors Integration**: 4-tier sponsorship system +- ** GitHub Sponsors Integration**: 4-tier sponsorship system - Supporter ($5/month): Community support + early feature previews - Professional ($25/month): Priority support + beta access - Business ($100/month): Direct consultation + custom integrations - Enterprise ($500/month): Dedicated support + feature development - Custom arrangement options for larger organizations -- **๐Ÿณ Docker LLM Provider Flexibility**: Environment-based LLM configuration +- ** Docker LLM Provider Flexibility**: Environment-based LLM configuration - `LLM_PROVIDER` environment variable support for dynamic provider switching - `.llm.env` file support for secure configuration management - Per-request provider override capabilities in API endpoints @@ -1172,14 +1172,14 @@ asyncio.run(browser_management_demo()) - Introduced `CacheMode` enum (`ENABLED`, `DISABLED`, `READ_ONLY`, `WRITE_ONLY`, `BYPASS`) and `always_bypass_cache` parameter in AsyncWebCrawler for fine-grained cache control. This replaces `bypass_cache`, `no_cache_read`, `no_cache_write`, and `always_by_pass_cache`. -### ๐Ÿ—‘๏ธ Removals +### Removals - Removed deprecated: `crawl4ai/content_cleaning_strategy.py`. - Removed internal class ContentCleaningStrategy - Removed legacy cache control flags: `bypass_cache`, `disable_cache`, `no_cache_read`, `no_cache_write`, and `always_by_pass_cache`. These have been superseded by `cache_mode`. -### โš™๏ธ Other Changes +### Other Changes - Moved version file to `crawl4ai/__version__.py`. - Added `crawl4ai/cache_context.py`. @@ -1196,7 +1196,7 @@ asyncio.run(browser_management_demo()) - The synchronous version of `WebCrawler` is being phased out. While still available via `crawl4ai[sync]`, it will eventually be removed. Transition to `AsyncWebCrawler` is strongly recommended. Boolean cache control flags in `arun` are also deprecated, migrate to using the `cache_mode` parameter. See examples in the "New Features" section above for correct usage. -### ๐Ÿ› Bug Fixes +### Bug Fixes - Resolved issue with browser context closing unexpectedly in Docker. This significantly improves stability, particularly within containerized environments. - Fixed memory leaks associated with incorrect asynchronous cleanup by removing the `__del__` method and ensuring the browser context is closed explicitly using context managers. @@ -1680,8 +1680,8 @@ Significant improvements in text processing and performance: - ๐Ÿš€ **Dependency reduction**: Removed dependency on spaCy model for text chunk labeling in cosine extraction strategy. - ๐Ÿค– **Transformer upgrade**: Implemented text sequence classification using a transformer model for labeling text chunks. -- โšก **Performance enhancement**: Improved model loading speed due to removal of spaCy dependency. -- ๐Ÿ”ง **Future-proofing**: Laid groundwork for potential complete removal of spaCy dependency in future versions. +- **Performance enhancement**: Improved model loading speed due to removal of spaCy dependency. +- **Future-proofing**: Laid groundwork for potential complete removal of spaCy dependency in future versions. These changes address issue #68 and provide a foundation for faster, more efficient text processing in Crawl4AI. @@ -1689,40 +1689,40 @@ These changes address issue #68 and provide a foundation for faster, more effici Major improvements in functionality, performance, and cross-platform compatibility! ๐Ÿš€ -- ๐Ÿณ **Docker enhancements**: Significantly improved Dockerfile for easy installation on Linux, Mac, and Windows. -- ๐ŸŒ **Official Docker Hub image**: Launched our first official image on Docker Hub for streamlined deployment. -- ๐Ÿ”ง **Selenium upgrade**: Removed dependency on ChromeDriver, now using Selenium's built-in capabilities for better compatibility. -- ๐Ÿ–ผ๏ธ **Image description**: Implemented ability to generate textual descriptions for extracted images from web pages. -- โšก **Performance boost**: Various improvements to enhance overall speed and performance. +- **Docker enhancements**: Significantly improved Dockerfile for easy installation on Linux, Mac, and Windows. +- **Official Docker Hub image**: Launched our first official image on Docker Hub for streamlined deployment. +- **Selenium upgrade**: Removed dependency on ChromeDriver, now using Selenium's built-in capabilities for better compatibility. +- **Image description**: Implemented ability to generate textual descriptions for extracted images from web pages. +- **Performance boost**: Various improvements to enhance overall speed and performance. A big shoutout to our amazing community contributors: - [@aravindkarnam](https://github.com/aravindkarnam) for developing the textual description extraction feature. - [@FractalMind](https://github.com/FractalMind) for creating the first official Docker Hub image and fixing Dockerfile errors. - [@ketonkss4](https://github.com/ketonkss4) for identifying Selenium's new capabilities, helping us reduce dependencies. -Your contributions are driving Crawl4AI forward! ๐Ÿ™Œ +Your contributions are driving Crawl4AI forward! ## [v0.2.75] - 2024-07-19 Minor improvements for a more maintainable codebase: -- ๐Ÿ”„ Fixed typos in `chunking_strategy.py` and `crawler_strategy.py` to improve code readability -- ๐Ÿ”„ Removed `.test_pads/` directory from `.gitignore` to keep our repository clean and organized +- Fixed typos in `chunking_strategy.py` and `crawler_strategy.py` to improve code readability +- Removed `.test_pads/` directory from `.gitignore` to keep our repository clean and organized These changes may seem small, but they contribute to a more stable and sustainable codebase. By fixing typos and updating our `.gitignore` settings, we're ensuring that our code is easier to maintain and scale in the long run. ## [v0.2.74] - 2024-07-08 -A slew of exciting updates to improve the crawler's stability and robustness! ๐ŸŽ‰ +A slew of exciting updates to improve the crawler's stability and robustness! -- ๐Ÿ’ป **UTF encoding fix**: Resolved the Windows \"charmap\" error by adding UTF encoding. +- **UTF encoding fix**: Resolved the Windows \"charmap\" error by adding UTF encoding. - ๐Ÿ›ก๏ธ **Error handling**: Implemented MaxRetryError exception handling in LocalSeleniumCrawlerStrategy. -- ๐Ÿงน **Input sanitization**: Improved input sanitization and handled encoding issues in LLMExtractionStrategy. -- ๐Ÿšฎ **Database cleanup**: Removed existing database file and initialized a new one. +- **Input sanitization**: Improved input sanitization and handled encoding issues in LLMExtractionStrategy. +- **Database cleanup**: Removed existing database file and initialized a new one. ## [v0.2.73] - 2024-07-03 -๐Ÿ’ก In this release, we've bumped the version to v0.2.73 and refreshed our documentation to ensure you have the best experience with our project. + In this release, we've bumped the version to v0.2.73 and refreshed our documentation to ensure you have the best experience with our project. * Supporting website need "with-head" mode to crawl the website with head. * Fixing the installation issues for setup.py and dockerfile. @@ -1730,23 +1730,23 @@ A slew of exciting updates to improve the crawler's stability and robustness! ## [v0.2.72] - 2024-06-30 -This release brings exciting updates and improvements to our project! ๐ŸŽ‰ +This release brings exciting updates and improvements to our project! -* ๐Ÿ“š **Documentation Updates**: Our documentation has been revamped to reflect the latest changes and additions. +* **Documentation Updates**: Our documentation has been revamped to reflect the latest changes and additions. * ๐Ÿš€ **New Modes in setup.py**: We've added support for three new modes in setup.py: default, torch, and transformers. This enhances the project's flexibility and usability. -* ๐Ÿณ **Docker File Updates**: The Docker file has been updated to ensure seamless compatibility with the new modes and improvements. -* ๐Ÿ•ท๏ธ **Temporary Solution for Headless Crawling**: We've implemented a temporary solution to overcome issues with crawling websites in headless mode. +* **Docker File Updates**: The Docker file has been updated to ensure seamless compatibility with the new modes and improvements. +* **Temporary Solution for Headless Crawling**: We've implemented a temporary solution to overcome issues with crawling websites in headless mode. These changes aim to improve the overall user experience, provide more flexibility, and enhance the project's performance. We're thrilled to share these updates with you and look forward to continuing to evolve and improve our project! ## [0.2.71] - 2024-06-26 -**Improved Error Handling and Performance** ๐Ÿšง +**Improved Error Handling and Performance** -* ๐Ÿšซ Refactored `crawler_strategy.py` to handle exceptions and provide better error messages, making it more robust and reliable. -* ๐Ÿ’ป Optimized the `get_content_of_website_optimized` function in `utils.py` for improved performance, reducing potential bottlenecks. -* ๐Ÿ’ป Updated `utils.py` with the latest changes, ensuring consistency and accuracy. -* ๐Ÿšซ Migrated to `ChromeDriverManager` to resolve Chrome driver download issues, providing a smoother user experience. +* Refactored `crawler_strategy.py` to handle exceptions and provide better error messages, making it more robust and reliable. +* Optimized the `get_content_of_website_optimized` function in `utils.py` for improved performance, reducing potential bottlenecks. +* Updated `utils.py` with the latest changes, ensuring consistency and accuracy. +* Migrated to `ChromeDriverManager` to resolve Chrome driver download issues, providing a smoother user experience. These changes focus on refining the existing codebase, resulting in a more stable, efficient, and user-friendly experience. With these improvements, you can expect fewer errors and better performance in the crawler strategy and utility functions. diff --git a/README-first.md b/README-first.md index 2a21df395..d369627b8 100644 --- a/README-first.md +++ b/README-first.md @@ -27,12 +27,12 @@ Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease. -[โœจ Check out latest update v0.7.0](#-recent-updates) +[ Check out latest update v0.7.0](#-recent-updates) -๐ŸŽ‰ **Version 0.7.0 is now available!** The Adaptive Intelligence Update introduces groundbreaking features: Adaptive Crawling that learns website patterns, Virtual Scroll support for infinite pages, intelligent Link Preview with 3-layer scoring, Async URL Seeder for massive discovery, and significant performance improvements. [Read the release notes โ†’](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.0.md) + **Version 0.7.0 is now available!** The Adaptive Intelligence Update introduces groundbreaking features: Adaptive Crawling that learns website patterns, Virtual Scroll support for infinite pages, intelligent Link Preview with 3-layer scoring, Async URL Seeder for massive discovery, and significant performance improvements. [Read the release notes โ†’](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.0.md)
-๐Ÿค“ My Personal Story + My Personal Story My journey with computers started in childhood when my dad, a computer scientist, introduced me to an Amstrad computer. Those early days sparked a fascination with technology, leading me to pursue computer science and specialize in NLP during my postgraduate studies. It was during this time that I first delved into web crawling, building tools to help researchers organize papers and extract information from publications a challenging yet rewarding experience that honed my skills in data extraction. @@ -43,7 +43,7 @@ I made Crawl4AI open-source for two reasons. First, itโ€™s my way of giving back Thank you to everyone who has supported this project, used it, and shared feedback. Your encouragement motivates me to dream even bigger. Join us, file issues, submit PRs, or spread the word. Together, we can build a tool that truly empowers people to access their own data and reshape the future of AI.
-## ๐Ÿง Why Crawl4AI? +## Why Crawl4AI? 1. **Built for LLMs**: Creates smart, concise Markdown optimized for RAG and fine-tuning applications. 2. **Lightning Fast**: Delivers results faster with real-time, cost-efficient performance. @@ -102,96 +102,96 @@ crwl https://docs.crawl4ai.com --deep-crawl bfs --max-pages 10 crwl https://www.example.com/products -q "Extract all product prices" ``` -## โœจ Features +## Features
-๐Ÿ“ Markdown Generation + Markdown Generation -- ๐Ÿงน **Clean Markdown**: Generates clean, structured Markdown with accurate formatting. -- ๐ŸŽฏ **Fit Markdown**: Heuristic-based filtering to remove noise and irrelevant parts for AI-friendly processing. +- **Clean Markdown**: Generates clean, structured Markdown with accurate formatting. +- **Fit Markdown**: Heuristic-based filtering to remove noise and irrelevant parts for AI-friendly processing. - ๐Ÿ”— **Citations and References**: Converts page links into a numbered reference list with clean citations. -- ๐Ÿ› ๏ธ **Custom Strategies**: Users can create their own Markdown generation strategies tailored to specific needs. -- ๐Ÿ“š **BM25 Algorithm**: Employs BM25-based filtering for extracting core information and removing irrelevant content. +- **Custom Strategies**: Users can create their own Markdown generation strategies tailored to specific needs. +- **BM25 Algorithm**: Employs BM25-based filtering for extracting core information and removing irrelevant content.
-๐Ÿ“Š Structured Data Extraction + Structured Data Extraction - ๐Ÿค– **LLM-Driven Extraction**: Supports all LLMs (open-source and proprietary) for structured data extraction. -- ๐Ÿงฑ **Chunking Strategies**: Implements chunking (topic-based, regex, sentence-level) for targeted content processing. -- ๐ŸŒŒ **Cosine Similarity**: Find relevant content chunks based on user queries for semantic extraction. -- ๐Ÿ”Ž **CSS-Based Extraction**: Fast schema-based data extraction using XPath and CSS selectors. -- ๐Ÿ”ง **Schema Definition**: Define custom schemas for extracting structured JSON from repetitive patterns. +- **Chunking Strategies**: Implements chunking (topic-based, regex, sentence-level) for targeted content processing. +- **Cosine Similarity**: Find relevant content chunks based on user queries for semantic extraction. +- **CSS-Based Extraction**: Fast schema-based data extraction using XPath and CSS selectors. +- **Schema Definition**: Define custom schemas for extracting structured JSON from repetitive patterns.
-๐ŸŒ Browser Integration + Browser Integration -- ๐Ÿ–ฅ๏ธ **Managed Browser**: Use user-owned browsers with full control, avoiding bot detection. -- ๐Ÿ”„ **Remote Browser Control**: Connect to Chrome Developer Tools Protocol for remote, large-scale data extraction. -- ๐Ÿ‘ค **Browser Profiler**: Create and manage persistent profiles with saved authentication states, cookies, and settings. +- **Managed Browser**: Use user-owned browsers with full control, avoiding bot detection. +- **Remote Browser Control**: Connect to Chrome Developer Tools Protocol for remote, large-scale data extraction. +- **Browser Profiler**: Create and manage persistent profiles with saved authentication states, cookies, and settings. - ๐Ÿ”’ **Session Management**: Preserve browser states and reuse them for multi-step crawling. -- ๐Ÿงฉ **Proxy Support**: Seamlessly connect to proxies with authentication for secure access. -- โš™๏ธ **Full Browser Control**: Modify headers, cookies, user agents, and more for tailored crawling setups. -- ๐ŸŒ **Multi-Browser Support**: Compatible with Chromium, Firefox, and WebKit. -- ๐Ÿ“ **Dynamic Viewport Adjustment**: Automatically adjusts the browser viewport to match page content, ensuring complete rendering and capturing of all elements. +- **Proxy Support**: Seamlessly connect to proxies with authentication for secure access. +- **Full Browser Control**: Modify headers, cookies, user agents, and more for tailored crawling setups. +- **Multi-Browser Support**: Compatible with Chromium, Firefox, and WebKit. +- **Dynamic Viewport Adjustment**: Automatically adjusts the browser viewport to match page content, ensuring complete rendering and capturing of all elements.
-๐Ÿ”Ž Crawling & Scraping + Crawling & Scraping -- ๐Ÿ–ผ๏ธ **Media Support**: Extract images, audio, videos, and responsive image formats like `srcset` and `picture`. +- **Media Support**: Extract images, audio, videos, and responsive image formats like `srcset` and `picture`. - ๐Ÿš€ **Dynamic Crawling**: Execute JS and wait for async or sync for dynamic content extraction. -- ๐Ÿ“ธ **Screenshots**: Capture page screenshots during crawling for debugging or analysis. -- ๐Ÿ“‚ **Raw Data Crawling**: Directly process raw HTML (`raw:`) or local files (`file://`). +- **Screenshots**: Capture page screenshots during crawling for debugging or analysis. +- **Raw Data Crawling**: Directly process raw HTML (`raw:`) or local files (`file://`). - ๐Ÿ”— **Comprehensive Link Extraction**: Extracts internal, external links, and embedded iframe content. -- ๐Ÿ› ๏ธ **Customizable Hooks**: Define hooks at every step to customize crawling behavior. -- ๐Ÿ’พ **Caching**: Cache data for improved speed and to avoid redundant fetches. -- ๐Ÿ“„ **Metadata Extraction**: Retrieve structured metadata from web pages. -- ๐Ÿ“ก **IFrame Content Extraction**: Seamless extraction from embedded iframe content. -- ๐Ÿ•ต๏ธ **Lazy Load Handling**: Waits for images to fully load, ensuring no content is missed due to lazy loading. -- ๐Ÿ”„ **Full-Page Scanning**: Simulates scrolling to load and capture all dynamic content, perfect for infinite scroll pages. +- **Customizable Hooks**: Define hooks at every step to customize crawling behavior. +- **Caching**: Cache data for improved speed and to avoid redundant fetches. +- **Metadata Extraction**: Retrieve structured metadata from web pages. +- **IFrame Content Extraction**: Seamless extraction from embedded iframe content. +- **Lazy Load Handling**: Waits for images to fully load, ensuring no content is missed due to lazy loading. +- **Full-Page Scanning**: Simulates scrolling to load and capture all dynamic content, perfect for infinite scroll pages.
๐Ÿš€ Deployment -- ๐Ÿณ **Dockerized Setup**: Optimized Docker image with FastAPI server for easy deployment. -- ๐Ÿ”‘ **Secure Authentication**: Built-in JWT token authentication for API security. -- ๐Ÿ”„ **API Gateway**: One-click deployment with secure token authentication for API-based workflows. -- ๐ŸŒ **Scalable Architecture**: Designed for mass-scale production and optimized server performance. -- โ˜๏ธ **Cloud Deployment**: Ready-to-deploy configurations for major cloud platforms. +- **Dockerized Setup**: Optimized Docker image with FastAPI server for easy deployment. +- **Secure Authentication**: Built-in JWT token authentication for API security. +- **API Gateway**: One-click deployment with secure token authentication for API-based workflows. +- **Scalable Architecture**: Designed for mass-scale production and optimized server performance. +- **Cloud Deployment**: Ready-to-deploy configurations for major cloud platforms.
-๐ŸŽฏ Additional Features + Additional Features -- ๐Ÿ•ถ๏ธ **Stealth Mode**: Avoid bot detection by mimicking real users. -- ๐Ÿท๏ธ **Tag-Based Content Extraction**: Refine crawling based on custom tags, headers, or metadata. +- **Stealth Mode**: Avoid bot detection by mimicking real users. +- **Tag-Based Content Extraction**: Refine crawling based on custom tags, headers, or metadata. - ๐Ÿ”— **Link Analysis**: Extract and analyze all links for detailed data exploration. - ๐Ÿ›ก๏ธ **Error Handling**: Robust error management for seamless execution. -- ๐Ÿ” **CORS & Static Serving**: Supports filesystem-based caching and cross-origin requests. -- ๐Ÿ“– **Clear Documentation**: Simplified and updated guides for onboarding and advanced usage. -- ๐Ÿ™Œ **Community Recognition**: Acknowledges contributors and pull requests for transparency. +- **CORS & Static Serving**: Supports filesystem-based caching and cross-origin requests. +- **Clear Documentation**: Simplified and updated guides for onboarding and advanced usage. +- **Community Recognition**: Acknowledges contributors and pull requests for transparency.
## Try it Now! -โœจ Play around with this [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing) + Play around with this [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing) -โœจ Visit our [Documentation Website](https://docs.crawl4ai.com/) + Visit our [Documentation Website](https://docs.crawl4ai.com/) -## Installation ๐Ÿ› ๏ธ +## Installation Crawl4AI offers flexible installation options to suit various use cases. You can install it as a Python package or use Docker.
-๐Ÿ Using pip + Using pip Choose the installation option that best fits your needs: @@ -206,7 +206,7 @@ crawl4ai-setup # Setup the browser By default, this will install the asynchronous version of Crawl4AI, using Playwright for web crawling. -๐Ÿ‘‰ **Note**: When you install Crawl4AI, the `crawl4ai-setup` should automatically install and set up Playwright. However, if you encounter any Playwright-related errors, you can manually install it using one of these methods: + **Note**: When you install Crawl4AI, the `crawl4ai-setup` should automatically install and set up Playwright. However, if you encounter any Playwright-related errors, you can manually install it using one of these methods: 1. Through the command line: @@ -257,7 +257,7 @@ pip install -e ".[all]" # Install all optional features
-๐Ÿณ Docker Deployment + Docker Deployment > ๐Ÿš€ **Now Available!** Our completely redesigned Docker implementation is here! This new solution makes deployment more efficient and seamless than ever. @@ -318,12 +318,12 @@ For more examples, see our [Docker Examples](https://github.com/unclecode/crawl4
-## ๐Ÿ”ฌ Advanced Usage Examples ๐Ÿ”ฌ +## Advanced Usage Examples You can check the project structure in the directory [docs/examples](https://github.com/unclecode/crawl4ai/tree/main/docs/examples). Over there, you can find a variety of examples; here, some popular examples are shared.
-๐Ÿ“ Heuristic Markdown Generation with Clean and Fit Markdown + Heuristic Markdown Generation with Clean and Fit Markdown ```python import asyncio @@ -361,7 +361,7 @@ if __name__ == "__main__":
-๐Ÿ–ฅ๏ธ Executing JavaScript & Extract Structured Data without LLMs + Executing JavaScript & Extract Structured Data without LLMs ```python import asyncio @@ -434,7 +434,7 @@ if __name__ == "__main__":
-๐Ÿ“š Extracting Structured Data with LLMs + Extracting Structured Data with LLMs ```python import os @@ -517,11 +517,11 @@ async def test_news_crawl():
-## โœจ Recent Updates +## Recent Updates ### Version 0.7.0 Release Highlights - The Adaptive Intelligence Update -- **๐Ÿง  Adaptive Crawling**: Your crawler now learns and adapts to website patterns automatically: +- ** Adaptive Crawling**: Your crawler now learns and adapts to website patterns automatically: ```python config = AdaptiveConfig( confidence_threshold=0.7, # Min confidence to stop crawling @@ -539,7 +539,7 @@ async def test_news_crawl(): # Crawler learns patterns and improves extraction over time ``` -- **๐ŸŒŠ Virtual Scroll Support**: Complete content extraction from infinite scroll pages: +- ** Virtual Scroll Support**: Complete content extraction from infinite scroll pages: ```python scroll_config = VirtualScrollConfig( container_selector="[data-testid='feed']", @@ -568,7 +568,7 @@ async def test_news_crawl(): # Links ranked by relevance and quality ``` -- **๐ŸŽฃ Async URL Seeder**: Discover thousands of URLs in seconds: +- ** Async URL Seeder**: Discover thousands of URLs in seconds: ```python seeder = AsyncUrlSeeder(SeedingConfig( source="sitemap+cc", @@ -580,7 +580,7 @@ async def test_news_crawl(): urls = await seeder.discover("https://example.com") ``` -- **โšก Performance Boost**: Up to 3x faster with optimized resource handling and memory efficiency +- ** Performance Boost**: Up to 3x faster with optimized resource handling and memory efficiency Read the full details in our [0.7.0 Release Notes](https://docs.crawl4ai.com/blog/release-v0.7.0) or check the [CHANGELOG](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md). @@ -625,16 +625,16 @@ We use pre-releases to: For production environments, we recommend using the stable version. For testing new features, you can opt-in to pre-releases using the `--pre` flag. -## ๐Ÿ“– Documentation & Roadmap +## Documentation & Roadmap -> ๐Ÿšจ **Documentation Update Alert**: We're undertaking a major documentation overhaul next week to reflect recent updates and improvements. Stay tuned for a more comprehensive and up-to-date guide! +> **Documentation Update Alert**: We're undertaking a major documentation overhaul next week to reflect recent updates and improvements. Stay tuned for a more comprehensive and up-to-date guide! For current documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://docs.crawl4ai.com/). To check our development plans and upcoming features, visit our [Roadmap](https://github.com/unclecode/crawl4ai/blob/main/ROADMAP.md).
-๐Ÿ“ˆ Development TODOs + Development TODOs - [x] 0. Graph Crawler: Smart website traversal using graph search algorithms for comprehensive nested page extraction - [ ] 1. Question-Based Crawler: Natural language driven web discovery and content extraction @@ -651,7 +651,7 @@ To check our development plans and upcoming features, visit our [Roadmap](https:
-## ๐Ÿค Contributing +## Contributing We welcome contributions from the open-source community. Check out our [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTORS.md) for more information. @@ -659,7 +659,7 @@ I'll help modify the license section with badges. For the halftone effect, here' Here's the updated license section: -## ๐Ÿ“„ License & Attribution +## License & Attribution This project is licensed under the Apache License 2.0, attribution is recommended via the badges below. See the [Apache 2.0 License](https://github.com/unclecode/crawl4ai/blob/main/LICENSE) file for details. @@ -711,7 +711,7 @@ Add this line to your documentation: This project uses Crawl4AI (https://github.com/unclecode/crawl4ai) for web data extraction. ``` -## ๐Ÿ“š Citation +## Citation If you use Crawl4AI in your research or project, please cite: @@ -733,7 +733,7 @@ UncleCode. (2024). Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper [Com GitHub. https://github.com/unclecode/crawl4ai ``` -## ๐Ÿ“ง Contact +## Contact For questions, suggestions, or feedback, feel free to reach out: @@ -741,11 +741,11 @@ For questions, suggestions, or feedback, feel free to reach out: - Twitter: [@unclecode](https://twitter.com/unclecode) - Website: [crawl4ai.com](https://crawl4ai.com) -Happy Crawling! ๐Ÿ•ธ๏ธ๐Ÿš€ +Happy Crawling! ๐Ÿš€ -## ๐Ÿ’– Support Crawl4AI +## Support Crawl4AI -> ๐ŸŽ‰ **Sponsorship Program Just Launched!** Be among the first 50 **Founding Sponsors** and get permanent recognition in our Hall of Fame! +> **Sponsorship Program Just Launched!** Be among the first 50 **Founding Sponsors** and get permanent recognition in our Hall of Fame! Crawl4AI is the #1 trending open-source web crawler with 51K+ stars. Your support ensures we stay independent, innovative, and free forever. @@ -756,20 +756,20 @@ Crawl4AI is the #1 trending open-source web crawler with 51K+ stars. Your suppor -### ๐Ÿค Sponsorship Tiers +### Sponsorship Tiers -- **๐ŸŒฑ Believer ($5/mo)**: Join the movement for data democratization +- ** Believer ($5/mo)**: Join the movement for data democratization - **๐Ÿš€ Builder ($50/mo)**: Get priority support and early feature access -- **๐Ÿ’ผ Growing Team ($500/mo)**: Bi-weekly syncs and optimization help -- **๐Ÿข Data Infrastructure Partner ($2000/mo)**: Full partnership with dedicated support +- ** Growing Team ($500/mo)**: Bi-weekly syncs and optimization help +- ** Data Infrastructure Partner ($2000/mo)**: Full partnership with dedicated support **Why sponsor?** Every tier includes real benefits. No more rate-limited APIs. Own your data pipeline. Build data sovereignty together. [View All Tiers & Benefits โ†’](https://github.com/sponsors/unclecode) -### ๐Ÿ† Our Sponsors +### Our Sponsors -#### ๐Ÿ‘‘ Founding Sponsors (First 50) +#### Founding Sponsors (First 50) *Be part of history - [Become a Founding Sponsor](https://github.com/sponsors/unclecode)* @@ -779,14 +779,14 @@ Thank you to all our sponsors who make this project possible! -## ๐Ÿ—พ Mission +## Mission Our mission is to unlock the value of personal and enterprise data by transforming digital footprints into structured, tradeable assets. Crawl4AI empowers individuals and organizations with open-source tools to extract and structure data, fostering a shared data economy. We envision a future where AI is powered by real human knowledge, ensuring data creators directly benefit from their contributions. By democratizing data and enabling ethical sharing, we are laying the foundation for authentic AI advancement.
-๐Ÿ”‘ Key Opportunities + Key Opportunities - **Data Capitalization**: Transform digital footprints into measurable, valuable assets. - **Authentic AI Data**: Provide AI systems with real human insights. diff --git a/README.md b/README.md index 733678be3..51353c2c4 100644 --- a/README.md +++ b/README.md @@ -16,7 +16,7 @@ #### ๐Ÿš€ Crawl4AI Cloud API โ€” Closed Beta (Launching Soon) Reliable, large-scale web extraction, now built to be _**drastically more cost-effective**_ than any of the existing solutions. -๐Ÿ‘‰ **Apply [here](https://forms.gle/E9MyPaNXACnAMaqG7) for early access** + **Apply [here](https://forms.gle/E9MyPaNXACnAMaqG7) for early access** _Weโ€™ll be onboarding in phases and working closely with early users. Limited slots._ @@ -37,20 +37,20 @@ Limited slots._ Crawl4AI turns the web into clean, LLM ready Markdown for RAG, agents, and data pipelines. Fast, controllable, battle tested by a 50k+ star community. -[โœจ Check out latest update v0.9](#-recent-updates) +[ Check out latest update v0.9](#-recent-updates) -โœจ **New in v0.9**: Major secure-by-default release of the Docker API server. Auth is on by default, the server binds loopback unless given a token, and the request body is now an untrusted trust boundary. Breaking changes for the self-hosted server only; the pip library is unchanged. If you self-host the Docker API, read the [migration guide](https://github.com/unclecode/crawl4ai/blob/main/deploy/docker/MIGRATION.md) before upgrading. [Release notes โ†’](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.9.0.md) + **New in v0.9**: Major secure-by-default release of the Docker API server. Auth is on by default, the server binds loopback unless given a token, and the request body is now an untrusted trust boundary. Breaking changes for the self-hosted server only; the pip library is unchanged. If you self-host the Docker API, read the [migration guide](https://github.com/unclecode/crawl4ai/blob/main/deploy/docker/MIGRATION.md) before upgrading. [Release notes โ†’](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.9.0.md) -โœจ Recent v0.8.7: Security-hardening release. Fixes critical Docker API vulnerabilities (RCE, SSRF, auth bypass, file write, XSS, hardcoded JWT secret), adds DomainMapper, and ships scraping, deep-crawl, and LLM fixes. [Release notes โ†’](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.8.7.md) + Recent v0.8.7: Security-hardening release. Fixes critical Docker API vulnerabilities (RCE, SSRF, auth bypass, file write, XSS, hardcoded JWT secret), adds DomainMapper, and ships scraping, deep-crawl, and LLM fixes. [Release notes โ†’](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.8.7.md) -โœจ Recent v0.8.6: Security hotfix that replaced `litellm` with `unclecode-litellm` due to a PyPI supply chain compromise. + Recent v0.8.6: Security hotfix that replaced `litellm` with `unclecode-litellm` due to a PyPI supply chain compromise. -โœจ Previous v0.8.0: Crash Recovery & Prefetch Mode! Deep crawl crash recovery with `resume_state` and `on_state_change` callbacks for long-running crawls. New `prefetch=True` mode for 5-10x faster URL discovery. [Release notes โ†’](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.8.0.md) + Previous v0.8.0: Crash Recovery & Prefetch Mode! Deep crawl crash recovery with `resume_state` and `on_state_change` callbacks for long-running crawls. New `prefetch=True` mode for 5-10x faster URL discovery. [Release notes โ†’](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.8.0.md) -โœจ Previous v0.7.8: Stability & Bug Fix Release! 11 bug fixes addressing Docker API issues, LLM extraction improvements, URL handling fixes, and dependency updates. [Release notes โ†’](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.8.md) + Previous v0.7.8: Stability & Bug Fix Release! 11 bug fixes addressing Docker API issues, LLM extraction improvements, URL handling fixes, and dependency updates. [Release notes โ†’](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.8.md)
- ๐Ÿค“ My Personal Story + My Personal Story I grew up on an Amstrad, thanks to my dad, and never stopped building. In grad school I specialized in NLP and built crawlers for research. Thatโ€™s where I learned how much extraction matters. @@ -121,9 +121,9 @@ crwl https://docs.crawl4ai.com --deep-crawl bfs --max-pages 10 crwl https://www.example.com/products -q "Extract all product prices" ``` -## ๐Ÿ’– Support Crawl4AI +## Support Crawl4AI -> ๐ŸŽ‰ **Sponsorship Program Now Open!** After powering 51K+ developers and 1 year of growth, Crawl4AI is launching dedicated support for **startups** and **enterprises**. Be among the first 50 **Founding Sponsors** for permanent recognition in our Hall of Fame. +> **Sponsorship Program Now Open!** After powering 51K+ developers and 1 year of growth, Crawl4AI is launching dedicated support for **startups** and **enterprises**. Be among the first 50 **Founding Sponsors** for permanent recognition in our Hall of Fame. Crawl4AI is the #1 trending open-source web crawler on GitHub. Your support keeps it independent, innovative, and free for the community โ€” while giving you direct access to premium benefits. @@ -134,12 +134,12 @@ Crawl4AI is the #1 trending open-source web crawler on GitHub. Your support keep -### ๐Ÿค Sponsorship Tiers +### Sponsorship Tiers -- **๐ŸŒฑ Believer ($5/mo)** โ€” Join the movement for data democratization +- ** Believer ($5/mo)** โ€” Join the movement for data democratization - **๐Ÿš€ Builder ($50/mo)** โ€” Priority support & early access to features -- **๐Ÿ’ผ Growing Team ($500/mo)** โ€” Bi-weekly syncs & optimization help -- **๐Ÿข Data Infrastructure Partner ($2000/mo)** โ€” Full partnership with dedicated support +- ** Growing Team ($500/mo)** โ€” Bi-weekly syncs & optimization help +- ** Data Infrastructure Partner ($2000/mo)** โ€” Full partnership with dedicated support *Custom arrangements available - see [SPONSORS.md](SPONSORS.md) for details & contact* **Why sponsor?** @@ -148,96 +148,96 @@ No rate-limited APIs. No lock-in. Build and own your data pipeline with direct g [See All Tiers & Benefits โ†’](https://github.com/sponsors/unclecode) -## โœจ Features +## Features
-๐Ÿ“ Markdown Generation + Markdown Generation -- ๐Ÿงน **Clean Markdown**: Generates clean, structured Markdown with accurate formatting. -- ๐ŸŽฏ **Fit Markdown**: Heuristic-based filtering to remove noise and irrelevant parts for AI-friendly processing. +- **Clean Markdown**: Generates clean, structured Markdown with accurate formatting. +- **Fit Markdown**: Heuristic-based filtering to remove noise and irrelevant parts for AI-friendly processing. - ๐Ÿ”— **Citations and References**: Converts page links into a numbered reference list with clean citations. -- ๐Ÿ› ๏ธ **Custom Strategies**: Users can create their own Markdown generation strategies tailored to specific needs. -- ๐Ÿ“š **BM25 Algorithm**: Employs BM25-based filtering for extracting core information and removing irrelevant content. +- **Custom Strategies**: Users can create their own Markdown generation strategies tailored to specific needs. +- **BM25 Algorithm**: Employs BM25-based filtering for extracting core information and removing irrelevant content.
-๐Ÿ“Š Structured Data Extraction + Structured Data Extraction - ๐Ÿค– **LLM-Driven Extraction**: Supports all LLMs (open-source and proprietary) for structured data extraction. -- ๐Ÿงฑ **Chunking Strategies**: Implements chunking (topic-based, regex, sentence-level) for targeted content processing. -- ๐ŸŒŒ **Cosine Similarity**: Find relevant content chunks based on user queries for semantic extraction. -- ๐Ÿ”Ž **CSS-Based Extraction**: Fast schema-based data extraction using XPath and CSS selectors. -- ๐Ÿ”ง **Schema Definition**: Define custom schemas for extracting structured JSON from repetitive patterns. +- **Chunking Strategies**: Implements chunking (topic-based, regex, sentence-level) for targeted content processing. +- **Cosine Similarity**: Find relevant content chunks based on user queries for semantic extraction. +- **CSS-Based Extraction**: Fast schema-based data extraction using XPath and CSS selectors. +- **Schema Definition**: Define custom schemas for extracting structured JSON from repetitive patterns.
-๐ŸŒ Browser Integration + Browser Integration -- ๐Ÿ–ฅ๏ธ **Managed Browser**: Use user-owned browsers with full control, avoiding bot detection. -- ๐Ÿ”„ **Remote Browser Control**: Connect to Chrome Developer Tools Protocol for remote, large-scale data extraction. -- ๐Ÿ‘ค **Browser Profiler**: Create and manage persistent profiles with saved authentication states, cookies, and settings. +- **Managed Browser**: Use user-owned browsers with full control, avoiding bot detection. +- **Remote Browser Control**: Connect to Chrome Developer Tools Protocol for remote, large-scale data extraction. +- **Browser Profiler**: Create and manage persistent profiles with saved authentication states, cookies, and settings. - ๐Ÿ”’ **Session Management**: Preserve browser states and reuse them for multi-step crawling. -- ๐Ÿงฉ **Proxy Support**: Seamlessly connect to proxies with authentication for secure access. -- โš™๏ธ **Full Browser Control**: Modify headers, cookies, user agents, and more for tailored crawling setups. -- ๐ŸŒ **Multi-Browser Support**: Compatible with Chromium, Firefox, and WebKit. -- ๐Ÿ“ **Dynamic Viewport Adjustment**: Automatically adjusts the browser viewport to match page content, ensuring complete rendering and capturing of all elements. +- **Proxy Support**: Seamlessly connect to proxies with authentication for secure access. +- **Full Browser Control**: Modify headers, cookies, user agents, and more for tailored crawling setups. +- **Multi-Browser Support**: Compatible with Chromium, Firefox, and WebKit. +- **Dynamic Viewport Adjustment**: Automatically adjusts the browser viewport to match page content, ensuring complete rendering and capturing of all elements.
-๐Ÿ”Ž Crawling & Scraping + Crawling & Scraping -- ๐Ÿ–ผ๏ธ **Media Support**: Extract images, audio, videos, and responsive image formats like `srcset` and `picture`. +- **Media Support**: Extract images, audio, videos, and responsive image formats like `srcset` and `picture`. - ๐Ÿš€ **Dynamic Crawling**: Execute JS and wait for async or sync for dynamic content extraction. -- ๐Ÿ“ธ **Screenshots**: Capture page screenshots during crawling for debugging or analysis. -- ๐Ÿ“‚ **Raw Data Crawling**: Directly process raw HTML (`raw:`) or local files (`file://`). +- **Screenshots**: Capture page screenshots during crawling for debugging or analysis. +- **Raw Data Crawling**: Directly process raw HTML (`raw:`) or local files (`file://`). - ๐Ÿ”— **Comprehensive Link Extraction**: Extracts internal, external links, and embedded iframe content. -- ๐Ÿ› ๏ธ **Customizable Hooks**: Define hooks at every step to customize crawling behavior (supports both string and function-based APIs). -- ๐Ÿ’พ **Caching**: Cache data for improved speed and to avoid redundant fetches. -- ๐Ÿ“„ **Metadata Extraction**: Retrieve structured metadata from web pages. -- ๐Ÿ“ก **IFrame Content Extraction**: Seamless extraction from embedded iframe content. -- ๐Ÿ•ต๏ธ **Lazy Load Handling**: Waits for images to fully load, ensuring no content is missed due to lazy loading. -- ๐Ÿ”„ **Full-Page Scanning**: Simulates scrolling to load and capture all dynamic content, perfect for infinite scroll pages. +- **Customizable Hooks**: Define hooks at every step to customize crawling behavior (supports both string and function-based APIs). +- **Caching**: Cache data for improved speed and to avoid redundant fetches. +- **Metadata Extraction**: Retrieve structured metadata from web pages. +- **IFrame Content Extraction**: Seamless extraction from embedded iframe content. +- **Lazy Load Handling**: Waits for images to fully load, ensuring no content is missed due to lazy loading. +- **Full-Page Scanning**: Simulates scrolling to load and capture all dynamic content, perfect for infinite scroll pages.
๐Ÿš€ Deployment -- ๐Ÿณ **Dockerized Setup**: Optimized Docker image with FastAPI server for easy deployment. -- ๐Ÿ”‘ **Secure Authentication**: Built-in JWT token authentication for API security. -- ๐Ÿ”„ **API Gateway**: One-click deployment with secure token authentication for API-based workflows. -- ๐ŸŒ **Scalable Architecture**: Designed for mass-scale production and optimized server performance. -- โ˜๏ธ **Cloud Deployment**: Ready-to-deploy configurations for major cloud platforms. +- **Dockerized Setup**: Optimized Docker image with FastAPI server for easy deployment. +- **Secure Authentication**: Built-in JWT token authentication for API security. +- **API Gateway**: One-click deployment with secure token authentication for API-based workflows. +- **Scalable Architecture**: Designed for mass-scale production and optimized server performance. +- **Cloud Deployment**: Ready-to-deploy configurations for major cloud platforms.
-๐ŸŽฏ Additional Features + Additional Features -- ๐Ÿ•ถ๏ธ **Stealth Mode**: Avoid bot detection by mimicking real users. -- ๐Ÿท๏ธ **Tag-Based Content Extraction**: Refine crawling based on custom tags, headers, or metadata. +- **Stealth Mode**: Avoid bot detection by mimicking real users. +- **Tag-Based Content Extraction**: Refine crawling based on custom tags, headers, or metadata. - ๐Ÿ”— **Link Analysis**: Extract and analyze all links for detailed data exploration. - ๐Ÿ›ก๏ธ **Error Handling**: Robust error management for seamless execution. -- ๐Ÿ” **CORS & Static Serving**: Supports filesystem-based caching and cross-origin requests. -- ๐Ÿ“– **Clear Documentation**: Simplified and updated guides for onboarding and advanced usage. -- ๐Ÿ™Œ **Community Recognition**: Acknowledges contributors and pull requests for transparency. +- **CORS & Static Serving**: Supports filesystem-based caching and cross-origin requests. +- **Clear Documentation**: Simplified and updated guides for onboarding and advanced usage. +- **Community Recognition**: Acknowledges contributors and pull requests for transparency.
## Try it Now! -โœจ Play around with this [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing) + Play around with this [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing) -โœจ Visit our [Documentation Website](https://docs.crawl4ai.com/) + Visit our [Documentation Website](https://docs.crawl4ai.com/) -## Installation ๐Ÿ› ๏ธ +## Installation Crawl4AI offers flexible installation options to suit various use cases. You can install it as a Python package or use Docker.
-๐Ÿ Using pip + Using pip Choose the installation option that best fits your needs: @@ -252,7 +252,7 @@ crawl4ai-setup # Setup the browser By default, this will install the asynchronous version of Crawl4AI, using Playwright for web crawling. -๐Ÿ‘‰ **Note**: When you install Crawl4AI, the `crawl4ai-setup` should automatically install and set up Playwright. However, if you encounter any Playwright-related errors, you can manually install it using one of these methods: + **Note**: When you install Crawl4AI, the `crawl4ai-setup` should automatically install and set up Playwright. However, if you encounter any Playwright-related errors, you can manually install it using one of these methods: 1. Through the command line: @@ -303,7 +303,7 @@ pip install -e ".[all]" # Install all optional features
-๐Ÿณ Docker Deployment + Docker Deployment > ๐Ÿš€ **Now Available!** Our completely redesigned Docker implementation is here! This new solution makes deployment more efficient and seamless than ever. @@ -361,12 +361,12 @@ For more examples, see our [Docker Examples](https://github.com/unclecode/crawl4 --- -## ๐Ÿ”ฌ Advanced Usage Examples ๐Ÿ”ฌ +## Advanced Usage Examples You can check the project structure in the directory [docs/examples](https://github.com/unclecode/crawl4ai/tree/main/docs/examples). Over there, you can find a variety of examples; here, some popular examples are shared.
-๐Ÿ“ Heuristic Markdown Generation with Clean and Fit Markdown + Heuristic Markdown Generation with Clean and Fit Markdown ```python import asyncio @@ -404,7 +404,7 @@ if __name__ == "__main__":
-๐Ÿ–ฅ๏ธ Executing JavaScript & Extract Structured Data without LLMs + Executing JavaScript & Extract Structured Data without LLMs ```python import asyncio @@ -477,7 +477,7 @@ if __name__ == "__main__":
-๐Ÿ“š Extracting Structured Data with LLMs + Extracting Structured Data with LLMs ```python import os @@ -562,9 +562,9 @@ async def test_news_crawl(): --- -> **๐Ÿ’ก Tip:** Some websites may use **CAPTCHA** based verification mechanisms to prevent automated access. If your workflow encounters such challenges, you may optionally integrate a third-party CAPTCHA-handling service such as [CapSolver](https://www.capsolver.com/blog/Partners/crawl4ai-capsolver/?utm_source=crawl4ai&utm_medium=github_pr&utm_campaign=crawl4ai_integration). They support reCAPTCHA v2/v3, Cloudflare Turnstile, Challenge, AWS WAF, and more. Please ensure that your usage complies with the target websiteโ€™s terms of service and applicable laws. +> ** Tip:** Some websites may use **CAPTCHA** based verification mechanisms to prevent automated access. If your workflow encounters such challenges, you may optionally integrate a third-party CAPTCHA-handling service such as [CapSolver](https://www.capsolver.com/blog/Partners/crawl4ai-capsolver/?utm_source=crawl4ai&utm_medium=github_pr&utm_campaign=crawl4ai_integration). They support reCAPTCHA v2/v3, Cloudflare Turnstile, Challenge, AWS WAF, and more. Please ensure that your usage complies with the target websiteโ€™s terms of service and applicable laws. -## โœจ Recent Updates +## Recent Updates
Version 0.9.0 Release Highlights - Secure-by-Default Docker Server @@ -624,17 +624,17 @@ Our biggest release since v0.8.0. Anti-bot detection with proxy escalation, Shad ) ``` -- **๐ŸŒ‘ Shadow DOM Flattening**: +- ** Shadow DOM Flattening**: - Extract content hidden inside shadow DOM components ```python config = CrawlerRunConfig(flatten_shadow_dom=True) ``` -- **๐Ÿ›‘ Deep Crawl Cancellation**: +- ** Deep Crawl Cancellation**: - Stop long crawls gracefully with `cancel()` or `should_cancel` callback - Works with BFS, DFS, and BestFirst strategies -- **โš™๏ธ Config Defaults API**: +- ** Config Defaults API**: - `set_defaults()` / `get_defaults()` / `reset_defaults()` on BrowserConfig and CrawlerRunConfig - **๐Ÿ”’ Critical Security Fixes**: @@ -652,7 +652,7 @@ Our biggest release since v0.8.0. Anti-bot detection with proxy escalation, Shad This release introduces crash recovery for deep crawls, a new prefetch mode for fast URL discovery, and critical security fixes for Docker deployments. -- **๐Ÿ”„ Deep Crawl Crash Recovery**: +- ** Deep Crawl Crash Recovery**: - `on_state_change` callback fires after each URL for real-time state persistence - `resume_state` parameter to continue from a saved checkpoint - JSON-serializable state for Redis/database storage @@ -667,7 +667,7 @@ This release introduces crash recovery for deep crawls, a new prefetch mode for ) ``` -- **โšก Prefetch Mode for Fast URL Discovery**: +- ** Prefetch Mode for Fast URL Discovery**: - `prefetch=True` skips markdown, extraction, and media processing - 5-10x faster than full processing - Perfect for two-phase crawling: discover first, process selectively @@ -691,7 +691,7 @@ This release introduces crash recovery for deep crawls, a new prefetch mode for This release focuses on stability with 11 bug fixes addressing issues reported by the community. No new features, but significant improvements to reliability. -- **๐Ÿณ Docker API Fixes**: +- ** Docker API Fixes**: - Fixed `ContentRelevanceFilter` deserialization in deep crawl requests (#1642) - Fixed `ProxyConfig` JSON serialization in `BrowserConfig.to_dict()` (#1629) - Fixed `.cache` folder permissions in Docker image (#1638) @@ -724,11 +724,11 @@ This release focuses on stability with 11 bug fixes addressing issues reported b - Fixed relative URL resolution after JavaScript redirects (#1268) - Fixed import statement formatting in extracted code (#1181) -- **๐Ÿ“ฆ Dependency Updates**: +- ** Dependency Updates**: - Replaced deprecated PyPDF2 with pypdf (#1412) - Pydantic v2 ConfigDict compatibility - no more deprecation warnings (#678) -- **๐Ÿง  AdaptiveCrawler**: +- ** AdaptiveCrawler**: - Fixed query expansion to actually use LLM instead of hardcoded mock data (#1621) [Full v0.7.8 Release Notes โ†’](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.8.md) @@ -738,7 +738,7 @@ This release focuses on stability with 11 bug fixes addressing issues reported b
Version 0.7.7 Release Highlights - The Self-Hosting & Monitoring Update -- **๐Ÿ“Š Real-time Monitoring Dashboard**: Interactive web UI with live system metrics and browser pool visibility +- ** Real-time Monitoring Dashboard**: Interactive web UI with live system metrics and browser pool visibility ```python # Access the monitoring dashboard # Visit: http://localhost:11235/dashboard @@ -751,7 +751,7 @@ This release focuses on stability with 11 bug fixes addressing issues reported b # - Error monitoring with full context ``` -- **๐Ÿ”Œ Comprehensive Monitor API**: Complete REST API for programmatic access to all monitoring data +- ** Comprehensive Monitor API**: Complete REST API for programmatic access to all monitoring data ```python import httpx @@ -769,12 +769,12 @@ This release focuses on stability with 11 bug fixes addressing issues reported b stats = await client.get("http://localhost:11235/monitor/endpoints/stats") ``` -- **โšก WebSocket Streaming**: Real-time updates every 2 seconds for custom dashboards -- **๐Ÿ”ฅ Smart Browser Pool**: 3-tier architecture (permanent/hot/cold) with automatic promotion and cleanup -- **๐Ÿงน Janitor System**: Automatic resource management with event logging -- **๐ŸŽฎ Control Actions**: Manual browser management (kill, restart, cleanup) via API -- **๐Ÿ“ˆ Production Metrics**: 6 critical metrics for operational excellence with Prometheus integration -- **๐Ÿ› Critical Bug Fixes**: +- ** WebSocket Streaming**: Real-time updates every 2 seconds for custom dashboards +- ** Smart Browser Pool**: 3-tier architecture (permanent/hot/cold) with automatic promotion and cleanup +- ** Janitor System**: Automatic resource management with event logging +- ** Control Actions**: Manual browser management (kill, restart, cleanup) via API +- ** Production Metrics**: 6 critical metrics for operational excellence with Prometheus integration +- ** Critical Bug Fixes**: - Fixed async LLM extraction blocking issue (#1055) - Enhanced DFS deep crawl strategy (#1607) - Fixed sitemap parsing in AsyncUrlSeeder (#1598) @@ -789,8 +789,8 @@ This release focuses on stability with 11 bug fixes addressing issues reported b
Version 0.7.5 Release Highlights - The Docker Hooks & Security Update -- **๐Ÿ”ง Docker Hooks System**: Complete pipeline customization with user-provided Python functions at 8 key points -- **โœจ Function-Based Hooks API (NEW)**: Write hooks as regular Python functions with full IDE support: +- ** Docker Hooks System**: Complete pipeline customization with user-provided Python functions at 8 key points +- ** Function-Based Hooks API (NEW)**: Write hooks as regular Python functions with full IDE support: ```python from crawl4ai import hooks_to_string from crawl4ai.docker_client import Crawl4aiDockerClient @@ -827,8 +827,8 @@ This release focuses on stability with 11 bug fixes addressing issues reported b - **๐Ÿค– Enhanced LLM Integration**: Custom providers with temperature control and base_url configuration - **๐Ÿ”’ HTTPS Preservation**: Secure internal link handling with `preserve_https_for_internal_links=True` -- **๐Ÿ Python 3.10+ Support**: Modern language features and enhanced performance -- **๐Ÿ› ๏ธ Bug Fixes**: Resolved multiple community-reported issues including URL processing, JWT authentication, and proxy configuration +- ** Python 3.10+ Support**: Modern language features and enhanced performance +- ** Bug Fixes**: Resolved multiple community-reported issues including URL processing, JWT authentication, and proxy configuration [Full v0.7.5 Release Notes โ†’](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.5.md) @@ -858,9 +858,9 @@ This release focuses on stability with 11 bug fixes addressing issues reported b print(f"Extracted table: {len(table['data'])} rows") ``` -- **โšก Dispatcher Bug Fix**: Fixed sequential processing bottleneck in arun_many for fast-completing tasks -- **๐Ÿงน Memory Management Refactor**: Consolidated memory utilities into main utils module for cleaner architecture -- **๐Ÿ”ง Browser Manager Fixes**: Resolved race conditions in concurrent page creation with thread-safe locking +- ** Dispatcher Bug Fix**: Fixed sequential processing bottleneck in arun_many for fast-completing tasks +- ** Memory Management Refactor**: Consolidated memory utilities into main utils module for cleaner architecture +- ** Browser Manager Fixes**: Resolved race conditions in concurrent page creation with thread-safe locking - **๐Ÿ”— Advanced URL Processing**: Better handling of raw:// URLs and base tag link resolution - **๐Ÿ›ก๏ธ Enhanced Proxy Support**: Flexible proxy configuration supporting both dict and string formats @@ -871,7 +871,7 @@ This release focuses on stability with 11 bug fixes addressing issues reported b
Version 0.7.3 Release Highlights - The Multi-Config Intelligence Update -- **๐Ÿ•ต๏ธ Undetected Browser Support**: Bypass sophisticated bot detection systems: +- ** Undetected Browser Support**: Bypass sophisticated bot detection systems: ```python from crawl4ai import AsyncWebCrawler, BrowserConfig @@ -889,7 +889,7 @@ This release focuses on stability with 11 bug fixes addressing issues reported b # Successfully bypass Cloudflare, Akamai, and custom bot detection ``` -- **๐ŸŽจ Multi-URL Configuration**: Different strategies for different URL patterns in one batch: +- ** Multi-URL Configuration**: Different strategies for different URL patterns in one batch: ```python from crawl4ai import CrawlerRunConfig, MatchMode, CacheMode @@ -915,7 +915,7 @@ from crawl4ai import CrawlerRunConfig, MatchMode, CacheMode # Each URL gets the perfect configuration automatically ``` -- **๐Ÿง  Memory Monitoring**: Track and optimize memory usage during crawling: +- ** Memory Monitoring**: Track and optimize memory usage during crawling: ```python from crawl4ai.memory_utils import MemoryMonitor @@ -930,7 +930,7 @@ from crawl4ai import CrawlerRunConfig, MatchMode, CacheMode # Get optimization recommendations ``` -- **๐Ÿ“Š Enhanced Table Extraction**: Direct DataFrame conversion from web tables: +- ** Enhanced Table Extraction**: Direct DataFrame conversion from web tables: ```python result = await crawler.arun("https://site-with-tables.com") @@ -942,8 +942,8 @@ from crawl4ai import CrawlerRunConfig, MatchMode, CacheMode print(f"Table: {df.shape[0]} rows ร— {df.shape[1]} columns") ``` -- **๐Ÿ’ฐ GitHub Sponsors**: 4-tier sponsorship system for project sustainability -- **๐Ÿณ Docker LLM Flexibility**: Configure providers via environment variables +- ** GitHub Sponsors**: 4-tier sponsorship system for project sustainability +- ** Docker LLM Flexibility**: Configure providers via environment variables [Full v0.7.3 Release Notes โ†’](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.3.md) @@ -952,7 +952,7 @@ from crawl4ai import CrawlerRunConfig, MatchMode, CacheMode
Version 0.7.0 Release Highlights - The Adaptive Intelligence Update -- **๐Ÿง  Adaptive Crawling**: Your crawler now learns and adapts to website patterns automatically: +- ** Adaptive Crawling**: Your crawler now learns and adapts to website patterns automatically: ```python config = AdaptiveConfig( confidence_threshold=0.7, # Min confidence to stop crawling @@ -970,7 +970,7 @@ from crawl4ai import CrawlerRunConfig, MatchMode, CacheMode # Crawler learns patterns and improves extraction over time ``` -- **๐ŸŒŠ Virtual Scroll Support**: Complete content extraction from infinite scroll pages: +- ** Virtual Scroll Support**: Complete content extraction from infinite scroll pages: ```python scroll_config = VirtualScrollConfig( container_selector="[data-testid='feed']", @@ -999,7 +999,7 @@ from crawl4ai import CrawlerRunConfig, MatchMode, CacheMode # Links ranked by relevance and quality ``` -- **๐ŸŽฃ Async URL Seeder**: Discover thousands of URLs in seconds: +- ** Async URL Seeder**: Discover thousands of URLs in seconds: ```python seeder = AsyncUrlSeeder(SeedingConfig( source="sitemap+cc", @@ -1011,7 +1011,7 @@ from crawl4ai import CrawlerRunConfig, MatchMode, CacheMode urls = await seeder.discover("https://example.com") ``` -- **โšก Performance Boost**: Up to 3x faster with optimized resource handling and memory efficiency +- ** Performance Boost**: Up to 3x faster with optimized resource handling and memory efficiency Read the full details in our [0.7.0 Release Notes](https://docs.crawl4ai.com/blog/release-v0.7.0) or check the [CHANGELOG](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md). @@ -1022,7 +1022,7 @@ Read the full details in our [0.7.0 Release Notes](https://docs.crawl4ai.com/blo Crawl4AI follows standard Python version numbering conventions (PEP 440) to help users understand the stability and features of each release.
-๐Ÿ“ˆ Version Numbers Explained + Version Numbers Explained Our version numbers follow this pattern: `MAJOR.MINOR.PATCH` (e.g., 0.4.3) @@ -1061,16 +1061,16 @@ For production environments, we recommend using the stable version. For testing
-## ๐Ÿ“– Documentation & Roadmap +## Documentation & Roadmap -> ๐Ÿšจ **Documentation Update Alert**: We're undertaking a major documentation overhaul next week to reflect recent updates and improvements. Stay tuned for a more comprehensive and up-to-date guide! +> **Documentation Update Alert**: We're undertaking a major documentation overhaul next week to reflect recent updates and improvements. Stay tuned for a more comprehensive and up-to-date guide! For current documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://docs.crawl4ai.com/). To check our development plans and upcoming features, visit our [Roadmap](https://github.com/unclecode/crawl4ai/blob/main/ROADMAP.md).
-๐Ÿ“ˆ Development TODOs + Development TODOs - [x] 0. Graph Crawler: Smart website traversal using graph search algorithms for comprehensive nested page extraction - [x] 1. Question-Based Crawler: Natural language driven web discovery and content extraction @@ -1087,7 +1087,7 @@ To check our development plans and upcoming features, visit our [Roadmap](https:
-## ๐Ÿค Contributing +## Contributing We welcome contributions from the open-source community. Check out our [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTORS.md) for more information. @@ -1095,7 +1095,7 @@ I'll help modify the license section with badges. For the halftone effect, here' Here's the updated license section: -## ๐Ÿ“„ License & Attribution +## License & Attribution This project is licensed under the Apache License 2.0, attribution is recommended via the badges below. See the [Apache 2.0 License](https://github.com/unclecode/crawl4ai/blob/main/LICENSE) file for details. @@ -1103,7 +1103,7 @@ This project is licensed under the Apache License 2.0, attribution is recommende When using Crawl4AI, you must include one of the following attribution methods:
-๐Ÿ“ˆ 1. Badge Attribution (Recommended) + 1. Badge Attribution (Recommended) Add one of these badges to your README, documentation, or website: | Theme | Badge | @@ -1145,14 +1145,14 @@ HTML code for adding the badges:
-๐Ÿ“– 2. Text Attribution + 2. Text Attribution Add this line to your documentation: ``` This project uses Crawl4AI (https://github.com/unclecode/crawl4ai) for web data extraction. ```
-## ๐Ÿ“š Citation +## Citation If you use Crawl4AI in your research or project, please cite: @@ -1174,7 +1174,7 @@ UncleCode. (2024). Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper [Com GitHub. https://github.com/unclecode/crawl4ai ``` -## ๐Ÿ“ง Contact +## Contact For questions, suggestions, or feedback, feel free to reach out: @@ -1182,16 +1182,16 @@ For questions, suggestions, or feedback, feel free to reach out: - Twitter: [@unclecode](https://twitter.com/unclecode) - Website: [crawl4ai.com](https://crawl4ai.com) -Happy Crawling! ๐Ÿ•ธ๏ธ๐Ÿš€ +Happy Crawling! ๐Ÿš€ -## ๐Ÿ—พ Mission +## Mission Our mission is to unlock the value of personal and enterprise data by transforming digital footprints into structured, tradeable assets. Crawl4AI empowers individuals and organizations with open-source tools to extract and structure data, fostering a shared data economy. We envision a future where AI is powered by real human knowledge, ensuring data creators directly benefit from their contributions. By democratizing data and enabling ethical sharing, we are laying the foundation for authentic AI advancement.
-๐Ÿ”‘ Key Opportunities + Key Opportunities - **Data Capitalization**: Transform digital footprints into measurable, valuable assets. - **Authentic AI Data**: Provide AI systems with real human insights. @@ -1209,25 +1209,25 @@ We envision a future where AI is powered by real human knowledge, ensuring data For more details, see our [full mission statement](./MISSION.md).
-## ๐ŸŒŸ Current Sponsors +## Current Sponsors -### ๐Ÿข Enterprise Sponsors & Partners +### Enterprise Sponsors & Partners Our enterprise sponsors and technology partners help scale Crawl4AI to power production-grade data pipelines. | Company | About | Sponsorship Tier | |------|------|----------------------------| -| Thor Data | Leveraging Thordata ensures seamless compatibility with any AI/ML workflows and data infrastructure, massively accessing web data with 99.9% uptime, backed by one-on-one customer support. | ๐Ÿฅˆ Silver | -| nstproxy | NstProxy is a trusted proxy provider with over 110M+ real residential IPs, city-level targeting, 99.99% uptime, and low pricing at $0.1/GB, it delivers unmatched stability, scale, and cost-efficiency. | ๐Ÿฅˆ Silver | -| Scrapeless | Scrapeless provides production-grade infrastructure for Crawling, Automation, and AI Agents, offering Scraping Browser, 4 Proxy Types and Universal Scraping API. | ๐Ÿฅˆ Silver | -| Capsolver | AI-powered Captcha solving service. Supports all major Captcha types, including reCAPTCHA, Cloudflare, and more | ๐Ÿฅ‰ Bronze | -| DataSync | Helps engineers and buyers find, compare, and source electronic & industrial parts in seconds, with specs, pricing, lead times & alternatives.| ๐Ÿฅ‡ Gold | -| Kidocode

KidoCode

| Kidocode is a hybrid technology and entrepreneurship school for kids aged 5โ€“18, offering both online and on-campus education. | ๐Ÿฅ‡ Gold | -| Aleph null | Singapore-based Aleph Null is Asiaโ€™s leading edtech hub, dedicated to student-centric, AI-driven educationโ€”empowering learners with the tools to thrive in a fast-changing world. | ๐Ÿฅ‡ Gold | +| Thor Data | Leveraging Thordata ensures seamless compatibility with any AI/ML workflows and data infrastructure, massively accessing web data with 99.9% uptime, backed by one-on-one customer support. | Silver | +| nstproxy | NstProxy is a trusted proxy provider with over 110M+ real residential IPs, city-level targeting, 99.99% uptime, and low pricing at $0.1/GB, it delivers unmatched stability, scale, and cost-efficiency. | Silver | +| Scrapeless | Scrapeless provides production-grade infrastructure for Crawling, Automation, and AI Agents, offering Scraping Browser, 4 Proxy Types and Universal Scraping API. | Silver | +| Capsolver | AI-powered Captcha solving service. Supports all major Captcha types, including reCAPTCHA, Cloudflare, and more | Bronze | +| DataSync | Helps engineers and buyers find, compare, and source electronic & industrial parts in seconds, with specs, pricing, lead times & alternatives.| Gold | +| Kidocode

KidoCode

| Kidocode is a hybrid technology and entrepreneurship school for kids aged 5โ€“18, offering both online and on-campus education. | Gold | +| Aleph null | Singapore-based Aleph Null is Asiaโ€™s leading edtech hub, dedicated to student-centric, AI-driven educationโ€”empowering learners with the tools to thrive in a fast-changing world. | Gold | -### ๐Ÿง‘โ€๐Ÿค Individual Sponsors +### โ€ Individual Sponsors A heartfelt thanks to our individual supporters! Every contribution helps us keep our opensource mission alive and thriving! diff --git a/ROADMAP.md b/ROADMAP.md index 0fd784c13..b55dc40a7 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -51,7 +51,7 @@ graph TD Crawl4AI is evolving to provide more intelligent, efficient, and versatile web crawling capabilities. This roadmap outlines the key developments and features planned for the project, organized into strategic sections that build upon our current foundation. -## 1. Advanced Crawling Systems ๐Ÿ”ง +## 1. Advanced Crawling Systems This section introduces three powerful crawling systems that extend Crawl4AI's capabilities from basic web crawling to intelligent, purpose-driven data extraction. @@ -163,7 +163,7 @@ async with AsyncWebCrawler() as crawler: print("Success Rate:", result.success_rate) ``` -# Section 2: Specialized Features ๐Ÿ› ๏ธ +# Section 2: Specialized Features This section introduces specialized tools and features that enhance Crawl4AI's capabilities for specific use cases and data extraction needs. @@ -308,15 +308,15 @@ async with AsyncWebCrawler() as crawler: Each of these specialized features builds upon Crawl4AI's core functionality while providing targeted solutions for specific use cases. They can be used independently or combined for more complex data extraction and processing needs. -# Section 3: Development Tools ๐Ÿ”ง +# Section 3: Development Tools This section covers tools designed to enhance the development experience, monitoring, and deployment of Crawl4AI applications. -### 3.1 Crawl4AI Playground ๐ŸŽฎ +### 3.1 Crawl4AI Playground The Crawl4AI Playground is an interactive web-based development environment that simplifies web scraping experimentation, development, and deployment. With its intuitive interface and AI-powered assistance, users can quickly prototype, test, and deploy web scraping solutions. -#### Key Features ๐ŸŒŸ +#### Key Features ##### Visual Strategy Builder - Interactive point-and-click interface for building extraction strategies @@ -441,7 +441,7 @@ print(f"Monitor URL: {deployment.monitor_url}") These development tools work together to provide a comprehensive environment for developing, testing, monitoring, and deploying Crawl4AI applications. The Playground helps users experiment and generate optimal configurations, the Performance Monitor ensures smooth operation, and the Cloud Integration tools simplify deployment and scaling. -# Section 4: Community & Growth ๐ŸŒฑ +# Section 4: Community & Growth This section outlines initiatives designed to build and support the Crawl4AI community, provide educational resources, and ensure sustainable project growth. diff --git a/SPONSORS.md b/SPONSORS.md index 6503cdb76..4f3edcaf5 100644 --- a/SPONSORS.md +++ b/SPONSORS.md @@ -1,16 +1,16 @@ -# ๐Ÿ’– Sponsors & Supporters +# Sponsors & Supporters Thank you to everyone supporting Crawl4AI! Your sponsorship helps keep this project open-source and actively maintained. -## ๐Ÿ‘‘ Founding Sponsors +## Founding Sponsors *The first 50 sponsors who believed in our vision - permanently recognized* -๐ŸŽ‰ **Become a Founding Sponsor!** Only [X/50] spots remaining! [Join now โ†’](https://github.com/sponsors/unclecode) + **Become a Founding Sponsor!** Only [X/50] spots remaining! [Join now โ†’](https://github.com/sponsors/unclecode) --- -## ๐Ÿข Data Infrastructure Partners ($2000/month) +## Data Infrastructure Partners ($2000/month) *These organizations are building their data sovereignty with Crawl4AI at the core* @@ -18,7 +18,7 @@ Thank you to everyone supporting Crawl4AI! Your sponsorship helps keep this proj --- -## ๐Ÿ’ผ Growing Teams ($500/month) +## Growing Teams ($500/month) *Teams scaling their data extraction with Crawl4AI* @@ -34,7 +34,7 @@ Thank you to everyone supporting Crawl4AI! Your sponsorship helps keep this proj --- -## ๐ŸŒฑ Believers ($5/month) +## Believers ($5/month) *The community supporting data democratization* @@ -42,15 +42,15 @@ Thank you to everyone supporting Crawl4AI! Your sponsorship helps keep this proj --- -## ๐Ÿค Want to Sponsor? +## Want to Sponsor? Crawl4AI is the #1 trending open-source web crawler. We're building the future of data extraction - where organizations own their data pipelines instead of relying on rate-limited APIs. ### Available Sponsorship Tiers: -- **๐ŸŒฑ Believer** ($5/mo) - Support the movement +- ** Believer** ($5/mo) - Support the movement - **๐Ÿš€ Builder** ($50/mo) - Priority support & early access -- **๐Ÿ’ผ Growing Team** ($500/mo) - Bi-weekly syncs & optimization -- **๐Ÿข Data Infrastructure Partner** ($2000/mo) - Full partnership & dedicated support +- ** Growing Team** ($500/mo) - Bi-weekly syncs & optimization +- ** Data Infrastructure Partner** ($2000/mo) - Full partnership & dedicated support [View all tiers and benefits โ†’](https://github.com/sponsors/unclecode) @@ -58,7 +58,7 @@ Crawl4AI is the #1 trending open-source web crawler. We're building the future o Building data extraction at scale? Need dedicated support or infrastructure? Let's talk about a custom partnership. -๐Ÿ“ง Contact: [hello@crawl4ai.com](mailto:hello@crawl4ai.com) | ๐Ÿ“… [Schedule a call](https://calendar.app.google/rEpvi2UBgUQjWHfJ9) + Contact: [hello@crawl4ai.com](mailto:hello@crawl4ai.com) | [Schedule a call](https://calendar.app.google/rEpvi2UBgUQjWHfJ9) --- diff --git a/crawl4ai/adaptive_crawler copy.py b/crawl4ai/adaptive_crawler copy.py deleted file mode 100644 index 294a292d4..000000000 --- a/crawl4ai/adaptive_crawler copy.py +++ /dev/null @@ -1,1847 +0,0 @@ -""" -Adaptive Web Crawler for Crawl4AI - -This module implements adaptive information foraging for efficient web crawling. -It determines when sufficient information has been gathered to answer a query, -avoiding unnecessary crawls while ensuring comprehensive coverage. -""" - -from abc import ABC, abstractmethod -from typing import Dict, List, Optional, Set, Tuple, Any, Union -from dataclasses import dataclass, field -import asyncio -import pickle -import os -import json -import math -from collections import defaultdict, Counter -import re -from pathlib import Path - -from crawl4ai.async_webcrawler import AsyncWebCrawler -from crawl4ai.async_configs import CrawlerRunConfig, LinkPreviewConfig -from crawl4ai.models import Link, CrawlResult - - -@dataclass -class CrawlState: - """Tracks the current state of adaptive crawling""" - crawled_urls: Set[str] = field(default_factory=set) - knowledge_base: List[CrawlResult] = field(default_factory=list) - pending_links: List[Link] = field(default_factory=list) - query: str = "" - metrics: Dict[str, float] = field(default_factory=dict) - - # Statistical tracking - term_frequencies: Dict[str, int] = field(default_factory=lambda: defaultdict(int)) - document_frequencies: Dict[str, int] = field(default_factory=lambda: defaultdict(int)) - documents_with_terms: Dict[str, Set[int]] = field(default_factory=lambda: defaultdict(set)) - total_documents: int = 0 - - # History tracking for saturation - new_terms_history: List[int] = field(default_factory=list) - crawl_order: List[str] = field(default_factory=list) - - # Embedding-specific tracking (only if strategy is embedding) - kb_embeddings: Optional[Any] = None # Will be numpy array - query_embeddings: Optional[Any] = None # Will be numpy array - expanded_queries: List[str] = field(default_factory=list) - coverage_shape: Optional[Any] = None # Alpha shape - semantic_gaps: List[Tuple[List[float], float]] = field(default_factory=list) # Serializable - embedding_model: str = "" - - def save(self, path: Union[str, Path]): - """Save state to disk for persistence""" - path = Path(path) - path.parent.mkdir(parents=True, exist_ok=True) - - # Convert CrawlResult objects to dicts for serialization - state_dict = { - 'crawled_urls': list(self.crawled_urls), - 'knowledge_base': [self._crawl_result_to_dict(cr) for cr in self.knowledge_base], - 'pending_links': [link.model_dump() for link in self.pending_links], - 'query': self.query, - 'metrics': self.metrics, - 'term_frequencies': dict(self.term_frequencies), - 'document_frequencies': dict(self.document_frequencies), - 'documents_with_terms': {k: list(v) for k, v in self.documents_with_terms.items()}, - 'total_documents': self.total_documents, - 'new_terms_history': self.new_terms_history, - 'crawl_order': self.crawl_order, - # Embedding-specific fields (convert numpy arrays to lists for JSON) - 'kb_embeddings': self.kb_embeddings.tolist() if self.kb_embeddings is not None else None, - 'query_embeddings': self.query_embeddings.tolist() if self.query_embeddings is not None else None, - 'expanded_queries': self.expanded_queries, - 'semantic_gaps': self.semantic_gaps, - 'embedding_model': self.embedding_model - } - - with open(path, 'w') as f: - json.dump(state_dict, f, indent=2) - - @classmethod - def load(cls, path: Union[str, Path]) -> 'CrawlState': - """Load state from disk""" - path = Path(path) - with open(path, 'r') as f: - state_dict = json.load(f) - - state = cls() - state.crawled_urls = set(state_dict['crawled_urls']) - state.knowledge_base = [cls._dict_to_crawl_result(d) for d in state_dict['knowledge_base']] - state.pending_links = [Link(**link_dict) for link_dict in state_dict['pending_links']] - state.query = state_dict['query'] - state.metrics = state_dict['metrics'] - state.term_frequencies = defaultdict(int, state_dict['term_frequencies']) - state.document_frequencies = defaultdict(int, state_dict['document_frequencies']) - state.documents_with_terms = defaultdict(set, {k: set(v) for k, v in state_dict['documents_with_terms'].items()}) - state.total_documents = state_dict['total_documents'] - state.new_terms_history = state_dict['new_terms_history'] - state.crawl_order = state_dict['crawl_order'] - - # Load embedding-specific fields (convert lists back to numpy arrays) - import numpy as np - state.kb_embeddings = np.array(state_dict['kb_embeddings']) if state_dict.get('kb_embeddings') is not None else None - state.query_embeddings = np.array(state_dict['query_embeddings']) if state_dict.get('query_embeddings') is not None else None - state.expanded_queries = state_dict.get('expanded_queries', []) - state.semantic_gaps = state_dict.get('semantic_gaps', []) - state.embedding_model = state_dict.get('embedding_model', '') - - return state - - @staticmethod - def _crawl_result_to_dict(cr: CrawlResult) -> Dict: - """Convert CrawlResult to serializable dict""" - # Extract markdown content safely - markdown_content = "" - if hasattr(cr, 'markdown') and cr.markdown: - if hasattr(cr.markdown, 'raw_markdown'): - markdown_content = cr.markdown.raw_markdown - else: - markdown_content = str(cr.markdown) - - return { - 'url': cr.url, - 'content': markdown_content, - 'links': cr.links if hasattr(cr, 'links') else {}, - 'metadata': cr.metadata if hasattr(cr, 'metadata') else {} - } - - @staticmethod - def _dict_to_crawl_result(d: Dict): - """Convert dict back to CrawlResult""" - # Create a mock object that has the minimal interface we need - class MockMarkdown: - def __init__(self, content): - self.raw_markdown = content - - class MockCrawlResult: - def __init__(self, url, content, links, metadata): - self.url = url - self.markdown = MockMarkdown(content) - self.links = links - self.metadata = metadata - - return MockCrawlResult( - url=d['url'], - content=d.get('content', ''), - links=d.get('links', {}), - metadata=d.get('metadata', {}) - ) - - -@dataclass -class AdaptiveConfig: - """Configuration for adaptive crawling""" - confidence_threshold: float = 0.7 - max_depth: int = 5 - max_pages: int = 20 - top_k_links: int = 3 - min_gain_threshold: float = 0.1 - strategy: str = "statistical" # statistical, embedding, llm - - # Advanced parameters - saturation_threshold: float = 0.8 - consistency_threshold: float = 0.7 - coverage_weight: float = 0.4 - consistency_weight: float = 0.3 - saturation_weight: float = 0.3 - - # Link scoring parameters - relevance_weight: float = 0.5 - novelty_weight: float = 0.3 - authority_weight: float = 0.2 - - # Persistence - save_state: bool = False - state_path: Optional[str] = None - - # Embedding strategy parameters - embedding_model: str = "sentence-transformers/all-MiniLM-L6-v2" - embedding_llm_config: Optional[Dict] = None # Separate config for embeddings - n_query_variations: int = 10 - coverage_threshold: float = 0.85 - alpha_shape_alpha: float = 0.5 - - # Embedding confidence calculation parameters - embedding_coverage_radius: float = 0.2 # Distance threshold for "covered" query points - # Example: With radius=0.2, a query point is considered covered if ANY document - # is within cosine distance 0.2 (very similar). Smaller = stricter coverage requirement - - embedding_k_exp: float = 3.0 # Exponential decay factor for distance-to-score mapping - # Example: score = exp(-k_exp * distance). With k_exp=1, distance 0.2 โ†’ score 0.82, - # distance 0.5 โ†’ score 0.61. Higher k_exp = steeper decay = more emphasis on very close matches - - embedding_nearest_weight: float = 0.7 # Weight for nearest neighbor in hybrid scoring - embedding_top_k_weight: float = 0.3 # Weight for top-k average in hybrid scoring - # Example: If nearest doc has score 0.9 and top-3 avg is 0.6, final = 0.7*0.9 + 0.3*0.6 = 0.81 - # Higher nearest_weight = more focus on best match vs neighborhood density - - # Embedding link selection parameters - embedding_overlap_threshold: float = 0.85 # Similarity threshold for penalizing redundant links - # Example: Links with >0.85 similarity to existing KB get penalized to avoid redundancy - # Lower = more aggressive deduplication, Higher = allow more similar content - - # Embedding stopping criteria parameters - embedding_min_relative_improvement: float = 0.1 # Minimum relative improvement to continue - # Example: If confidence is 0.6, need improvement > 0.06 per batch to continue crawling - # Lower = more patient crawling, Higher = stop earlier when progress slows - - embedding_validation_min_score: float = 0.4 # Minimum validation score to trust convergence - # Example: Even if learning converged, keep crawling if validation score < 0.4 - # This prevents premature stopping when we haven't truly covered the query space - - # Quality confidence mapping parameters (for display to user) - embedding_quality_min_confidence: float = 0.7 # Minimum confidence for validated systems - embedding_quality_max_confidence: float = 0.95 # Maximum realistic confidence - embedding_quality_scale_factor: float = 0.833 # Scaling factor for confidence mapping - # Example: Validated system with learning_score=0.5 โ†’ confidence = 0.7 + (0.5-0.4)*0.833 = 0.78 - # These control how internal scores map to user-friendly confidence percentages - - def validate(self): - """Validate configuration parameters""" - assert 0 <= self.confidence_threshold <= 1, "confidence_threshold must be between 0 and 1" - assert self.max_depth > 0, "max_depth must be positive" - assert self.max_pages > 0, "max_pages must be positive" - assert self.top_k_links > 0, "top_k_links must be positive" - assert 0 <= self.min_gain_threshold <= 1, "min_gain_threshold must be between 0 and 1" - - # Check weights sum to 1 - weight_sum = self.coverage_weight + self.consistency_weight + self.saturation_weight - assert abs(weight_sum - 1.0) < 0.001, f"Coverage weights must sum to 1, got {weight_sum}" - - weight_sum = self.relevance_weight + self.novelty_weight + self.authority_weight - assert abs(weight_sum - 1.0) < 0.001, f"Link scoring weights must sum to 1, got {weight_sum}" - - # Validate embedding parameters - assert 0 < self.embedding_coverage_radius < 1, "embedding_coverage_radius must be between 0 and 1" - assert self.embedding_k_exp > 0, "embedding_k_exp must be positive" - assert 0 <= self.embedding_nearest_weight <= 1, "embedding_nearest_weight must be between 0 and 1" - assert 0 <= self.embedding_top_k_weight <= 1, "embedding_top_k_weight must be between 0 and 1" - assert abs(self.embedding_nearest_weight + self.embedding_top_k_weight - 1.0) < 0.001, "Embedding weights must sum to 1" - assert 0 <= self.embedding_overlap_threshold <= 1, "embedding_overlap_threshold must be between 0 and 1" - assert 0 < self.embedding_min_relative_improvement < 1, "embedding_min_relative_improvement must be between 0 and 1" - assert 0 <= self.embedding_validation_min_score <= 1, "embedding_validation_min_score must be between 0 and 1" - assert 0 <= self.embedding_quality_min_confidence <= 1, "embedding_quality_min_confidence must be between 0 and 1" - assert 0 <= self.embedding_quality_max_confidence <= 1, "embedding_quality_max_confidence must be between 0 and 1" - assert self.embedding_quality_scale_factor > 0, "embedding_quality_scale_factor must be positive" - - -class CrawlStrategy(ABC): - """Abstract base class for crawling strategies""" - - @abstractmethod - async def calculate_confidence(self, state: CrawlState) -> float: - """Calculate overall confidence that we have sufficient information""" - pass - - @abstractmethod - async def rank_links(self, state: CrawlState, config: AdaptiveConfig) -> List[Tuple[Link, float]]: - """Rank pending links by expected information gain""" - pass - - @abstractmethod - async def should_stop(self, state: CrawlState, config: AdaptiveConfig) -> bool: - """Determine if crawling should stop""" - pass - - @abstractmethod - async def update_state(self, state: CrawlState, new_results: List[CrawlResult]) -> None: - """Update state with new crawl results""" - pass - - -class StatisticalStrategy(CrawlStrategy): - """Pure statistical approach - no LLM, no embeddings""" - - def __init__(self): - self.idf_cache = {} - self.bm25_k1 = 1.2 # BM25 parameter - self.bm25_b = 0.75 # BM25 parameter - - async def calculate_confidence(self, state: CrawlState) -> float: - """Calculate confidence using coverage, consistency, and saturation""" - if not state.knowledge_base: - return 0.0 - - coverage = self._calculate_coverage(state) - consistency = self._calculate_consistency(state) - saturation = self._calculate_saturation(state) - - # Store individual metrics - state.metrics['coverage'] = coverage - state.metrics['consistency'] = consistency - state.metrics['saturation'] = saturation - - # Weighted combination (weights from config not accessible here, using defaults) - confidence = 0.4 * coverage + 0.3 * consistency + 0.3 * saturation - - return confidence - - def _calculate_coverage(self, state: CrawlState) -> float: - """Coverage scoring - measures query term presence across knowledge base - - Returns a score between 0 and 1, where: - - 0 means no query terms found - - 1 means excellent coverage of all query terms - """ - if not state.query or state.total_documents == 0: - return 0.0 - - query_terms = self._tokenize(state.query.lower()) - if not query_terms: - return 0.0 - - term_scores = [] - max_tf = max(state.term_frequencies.values()) if state.term_frequencies else 1 - - for term in query_terms: - tf = state.term_frequencies.get(term, 0) - df = state.document_frequencies.get(term, 0) - - if df > 0: - # Document coverage: what fraction of docs contain this term - doc_coverage = df / state.total_documents - - # Frequency signal: normalized log frequency - freq_signal = math.log(1 + tf) / math.log(1 + max_tf) if max_tf > 0 else 0 - - # Combined score: document coverage with frequency boost - term_score = doc_coverage * (1 + 0.5 * freq_signal) - term_scores.append(term_score) - else: - term_scores.append(0.0) - - # Average across all query terms - coverage = sum(term_scores) / len(term_scores) - - # Apply square root curve to make score more intuitive - # This helps differentiate between partial and good coverage - return min(1.0, math.sqrt(coverage)) - - def _calculate_consistency(self, state: CrawlState) -> float: - """Information overlap between pages - high overlap suggests coherent topic coverage""" - if len(state.knowledge_base) < 2: - return 1.0 # Single or no documents are perfectly consistent - - # Calculate pairwise term overlap - overlaps = [] - - for i in range(len(state.knowledge_base)): - for j in range(i + 1, len(state.knowledge_base)): - # Get terms from both documents - terms_i = set(self._get_document_terms(state.knowledge_base[i])) - terms_j = set(self._get_document_terms(state.knowledge_base[j])) - - if terms_i and terms_j: - # Jaccard similarity - overlap = len(terms_i & terms_j) / len(terms_i | terms_j) - overlaps.append(overlap) - - if overlaps: - # Average overlap as consistency measure - consistency = sum(overlaps) / len(overlaps) - else: - consistency = 0.0 - - return consistency - - def _calculate_saturation(self, state: CrawlState) -> float: - """Diminishing returns indicator - are we still discovering new information?""" - if not state.new_terms_history: - return 0.0 - - if len(state.new_terms_history) < 2: - return 0.0 # Not enough history - - # Calculate rate of new term discovery - recent_rate = state.new_terms_history[-1] if state.new_terms_history[-1] > 0 else 1 - initial_rate = state.new_terms_history[0] if state.new_terms_history[0] > 0 else 1 - - # Saturation increases as rate decreases - saturation = 1 - (recent_rate / initial_rate) - - return max(0.0, min(saturation, 1.0)) - - async def rank_links(self, state: CrawlState, config: AdaptiveConfig) -> List[Tuple[Link, float]]: - """Rank links by expected information gain""" - scored_links = [] - - for link in state.pending_links: - # Skip already crawled URLs - if link.href in state.crawled_urls: - continue - - # Calculate component scores - relevance = self._calculate_relevance(link, state) - novelty = self._calculate_novelty(link, state) - authority = 1.0 - # authority = self._calculate_authority(link) - - # Combined score - score = (config.relevance_weight * relevance + - config.novelty_weight * novelty + - config.authority_weight * authority) - - scored_links.append((link, score)) - - # Sort by score descending - scored_links.sort(key=lambda x: x[1], reverse=True) - - return scored_links - - def _calculate_relevance(self, link: Link, state: CrawlState) -> float: - """BM25 relevance score between link preview and query""" - if not state.query or not link: - return 0.0 - - # Combine available text from link - link_text = ' '.join(filter(None, [ - link.text or '', - link.title or '', - link.head_data.get('meta', {}).get('title', '') if link.head_data else '', - link.head_data.get('meta', {}).get('description', '') if link.head_data else '', - link.head_data.get('meta', {}).get('keywords', '') if link.head_data else '' - ])).lower() - - if not link_text: - return 0.0 - - # Use contextual score if available (from BM25 scoring during crawl) - # if link.contextual_score is not None: - if link.contextual_score and link.contextual_score > 0: - return link.contextual_score - - # Otherwise, calculate simple term overlap - query_terms = set(self._tokenize(state.query.lower())) - link_terms = set(self._tokenize(link_text)) - - if not query_terms: - return 0.0 - - overlap = len(query_terms & link_terms) / len(query_terms) - return overlap - - def _calculate_novelty(self, link: Link, state: CrawlState) -> float: - """Estimate how much new information this link might provide""" - if not state.knowledge_base: - return 1.0 # First links are maximally novel - - # Get terms from link preview - link_text = ' '.join(filter(None, [ - link.text or '', - link.title or '', - link.head_data.get('title', '') if link.head_data else '', - link.head_data.get('description', '') if link.head_data else '', - link.head_data.get('keywords', '') if link.head_data else '' - ])).lower() - - link_terms = set(self._tokenize(link_text)) - if not link_terms: - return 0.5 # Unknown novelty - - # Calculate what percentage of link terms are new - existing_terms = set(state.term_frequencies.keys()) - new_terms = link_terms - existing_terms - - novelty = len(new_terms) / len(link_terms) if link_terms else 0.0 - - return novelty - - def _calculate_authority(self, link: Link) -> float: - """Simple authority score based on URL structure and link attributes""" - score = 0.5 # Base score - - if not link.href: - return 0.0 - - url = link.href.lower() - - # Positive indicators - if '/docs/' in url or '/documentation/' in url: - score += 0.2 - if '/api/' in url or '/reference/' in url: - score += 0.2 - if '/guide/' in url or '/tutorial/' in url: - score += 0.1 - - # Check for file extensions - if url.endswith('.pdf'): - score += 0.1 - elif url.endswith(('.jpg', '.png', '.gif')): - score -= 0.3 # Reduce score for images - - # Use intrinsic score if available - if link.intrinsic_score is not None: - score = 0.7 * score + 0.3 * link.intrinsic_score - - return min(score, 1.0) - - async def should_stop(self, state: CrawlState, config: AdaptiveConfig) -> bool: - """Determine if crawling should stop""" - # Check confidence threshold - confidence = state.metrics.get('confidence', 0.0) - if confidence >= config.confidence_threshold: - return True - - # Check resource limits - if len(state.crawled_urls) >= config.max_pages: - return True - - # Check if we have any links left - if not state.pending_links: - return True - - # Check saturation - if state.metrics.get('saturation', 0.0) >= config.saturation_threshold: - return True - - return False - - async def update_state(self, state: CrawlState, new_results: List[CrawlResult]) -> None: - """Update state with new crawl results""" - for result in new_results: - # Track new terms - old_term_count = len(state.term_frequencies) - - # Extract and process content - try multiple fields - try: - content = result.markdown.raw_markdown - except AttributeError: - print(f"Warning: CrawlResult {result.url} has no markdown content") - content = "" - # content = "" - # if hasattr(result, 'extracted_content') and result.extracted_content: - # content = result.extracted_content - # elif hasattr(result, 'markdown') and result.markdown: - # content = result.markdown.raw_markdown - # elif hasattr(result, 'cleaned_html') and result.cleaned_html: - # content = result.cleaned_html - # elif hasattr(result, 'html') and result.html: - # # Use raw HTML as last resort - # content = result.html - - - terms = self._tokenize(content.lower()) - - # Update term frequencies - term_set = set() - for term in terms: - state.term_frequencies[term] += 1 - term_set.add(term) - - # Update document frequencies - doc_id = state.total_documents - for term in term_set: - if term not in state.documents_with_terms[term]: - state.document_frequencies[term] += 1 - state.documents_with_terms[term].add(doc_id) - - # Track new terms discovered - new_term_count = len(state.term_frequencies) - new_terms = new_term_count - old_term_count - state.new_terms_history.append(new_terms) - - # Update document count - state.total_documents += 1 - - # Add to crawl order - state.crawl_order.append(result.url) - - def _tokenize(self, text: str) -> List[str]: - """Simple tokenization - can be enhanced""" - # Remove punctuation and split - text = re.sub(r'[^\w\s]', ' ', text) - tokens = text.split() - - # Filter short tokens and stop words (basic) - tokens = [t for t in tokens if len(t) > 2] - - return tokens - - def _get_document_terms(self, crawl_result: CrawlResult) -> List[str]: - """Extract terms from a crawl result""" - content = crawl_result.markdown.raw_markdown or "" - return self._tokenize(content.lower()) - - -class EmbeddingStrategy(CrawlStrategy): - """Embedding-based adaptive crawling using semantic space coverage""" - - def __init__(self, embedding_model: str = None, llm_config: Dict = None): - self.embedding_model = embedding_model or "sentence-transformers/all-MiniLM-L6-v2" - self.llm_config = llm_config - self._embedding_cache = {} - self._link_embedding_cache = {} # Cache for link embeddings - self._validation_passed = False # Track if validation passed - - # Performance optimization caches - self._distance_matrix_cache = None # Cache for query-KB distances - self._kb_embeddings_hash = None # Track KB changes - self._validation_embeddings_cache = None # Cache validation query embeddings - self._kb_similarity_threshold = 0.95 # Threshold for deduplication - - async def _get_embeddings(self, texts: List[str]) -> Any: - """Get embeddings using configured method""" - from .utils import get_text_embeddings - embedding_llm_config = { - 'provider': 'openai/text-embedding-3-small', - 'api_token': os.getenv('OPENAI_API_KEY') - } - return await get_text_embeddings( - texts, - embedding_llm_config, - self.embedding_model - ) - - def _compute_distance_matrix(self, query_embeddings: Any, kb_embeddings: Any) -> Any: - """Compute distance matrix using vectorized operations""" - import numpy as np - - if kb_embeddings is None or len(kb_embeddings) == 0: - return None - - # Ensure proper shapes - if len(query_embeddings.shape) == 1: - query_embeddings = query_embeddings.reshape(1, -1) - if len(kb_embeddings.shape) == 1: - kb_embeddings = kb_embeddings.reshape(1, -1) - - # Vectorized cosine distance: 1 - cosine_similarity - # Normalize vectors - query_norm = query_embeddings / np.linalg.norm(query_embeddings, axis=1, keepdims=True) - kb_norm = kb_embeddings / np.linalg.norm(kb_embeddings, axis=1, keepdims=True) - - # Compute cosine similarity matrix - similarity_matrix = np.dot(query_norm, kb_norm.T) - - # Convert to distance - distance_matrix = 1 - similarity_matrix - - return distance_matrix - - def _get_cached_distance_matrix(self, query_embeddings: Any, kb_embeddings: Any) -> Any: - """Get distance matrix with caching""" - import numpy as np - - if kb_embeddings is None or len(kb_embeddings) == 0: - return None - - # Check if KB has changed - kb_hash = hash(kb_embeddings.tobytes()) if kb_embeddings is not None else None - - if (self._distance_matrix_cache is None or - kb_hash != self._kb_embeddings_hash): - # Recompute matrix - self._distance_matrix_cache = self._compute_distance_matrix(query_embeddings, kb_embeddings) - self._kb_embeddings_hash = kb_hash - - return self._distance_matrix_cache - - async def map_query_semantic_space(self, query: str, n_synthetic: int = 10) -> Any: - """Generate a point cloud representing the semantic neighborhood of the query""" - from .utils import perform_completion_with_backoff - - # Generate more variations than needed for train/val split - n_total = int(n_synthetic * 1.3) # Generate 30% more for validation - - # Generate variations using LLM - prompt = f"""Generate {n_total} variations of this query that explore different aspects: '{query}' - - These should be queries a user might ask when looking for similar information. - Include different phrasings, related concepts, and specific aspects. - - Return as a JSON array of strings.""" - - # Use the LLM for query generation - provider = self.llm_config.get('provider', 'openai/gpt-4o-mini') if self.llm_config else 'openai/gpt-4o-mini' - api_token = self.llm_config.get('api_token') if self.llm_config else None - - # response = perform_completion_with_backoff( - # provider=provider, - # prompt_with_variables=prompt, - # api_token=api_token, - # json_response=True - # ) - - # variations = json.loads(response.choices[0].message.content) - - - # # Mock data with more variations for split - variations ={'queries': ['what are the best vegetables to use in fried rice?', 'how do I make vegetable fried rice from scratch?', 'can you provide a quick recipe for vegetable fried rice?', 'what cooking techniques are essential for perfect fried rice with vegetables?', 'how to add flavor to vegetable fried rice?', 'are there any tips for making healthy fried rice with vegetables?']} - - - variations = {'queries': [ - 'How do async and await work with coroutines in Python?', - 'What is the role of event loops in asynchronous programming?', - 'Can you explain the differences between async/await and traditional callback methods?', - 'How do coroutines interact with event loops in JavaScript?', - 'What are the benefits of using async await over promises in Node.js?', - # 'How to manage multiple coroutines with an event loop?', - # 'What are some common pitfalls when using async await with coroutines?', - # 'How do different programming languages implement async await and event loops?', - # 'What happens when an async function is called without await?', - # 'How does the event loop handle blocking operations?', - 'Can you nest async functions and how does that affect the event loop?', - 'What is the performance impact of using async/await?' - ]} - - # Split into train and validation - # all_queries = [query] + variations['queries'] - - # Randomly shuffle for proper train/val split (keeping original query in training) - import random - - # Keep original query always in training - other_queries = variations['queries'].copy() - random.shuffle(other_queries) - - # Split: 80% for training, 20% for validation - n_validation = max(2, int(len(other_queries) * 0.2)) # At least 2 for validation - val_queries = other_queries[-n_validation:] - train_queries = [query] + other_queries[:-n_validation] - - # Embed only training queries for now (faster) - train_embeddings = await self._get_embeddings(train_queries) - - # Store validation queries for later (don't embed yet to save time) - self._validation_queries = val_queries - - return train_embeddings, train_queries - - def compute_coverage_shape(self, query_points: Any, alpha: float = 0.5): - """Find the minimal shape that covers all query points using alpha shape""" - try: - import numpy as np - - if len(query_points) < 3: - return None - - # For high-dimensional embeddings (e.g., 384-dim, 768-dim), - # alpha shapes require exponentially more points than available. - # Instead, use a statistical coverage model - query_points = np.array(query_points) - - # Store coverage as centroid + radius model - coverage = { - 'center': np.mean(query_points, axis=0), - 'std': np.std(query_points, axis=0), - 'points': query_points, - 'radius': np.max(np.linalg.norm(query_points - np.mean(query_points, axis=0), axis=1)) - } - return coverage - except Exception: - # Fallback if computation fails - return None - - def _sample_boundary_points(self, shape, n_samples: int = 20) -> List[Any]: - """Sample points from the boundary of a shape""" - import numpy as np - - # Simplified implementation - in practice would sample from actual shape boundary - # For now, return empty list if shape is None - if shape is None: - return [] - - # This is a placeholder - actual implementation would depend on shape type - return [] - - def find_coverage_gaps(self, kb_embeddings: Any, query_embeddings: Any) -> List[Tuple[Any, float]]: - """Calculate gap distances for all query variations using vectorized operations""" - import numpy as np - - gaps = [] - - if kb_embeddings is None or len(kb_embeddings) == 0: - # If no KB yet, all query points have maximum gap - for q_emb in query_embeddings: - gaps.append((q_emb, 1.0)) - return gaps - - # Use cached distance matrix - distance_matrix = self._get_cached_distance_matrix(query_embeddings, kb_embeddings) - - if distance_matrix is None: - # Fallback - for q_emb in query_embeddings: - gaps.append((q_emb, 1.0)) - return gaps - - # Find minimum distance for each query (vectorized) - min_distances = np.min(distance_matrix, axis=1) - - # Create gaps list - for i, q_emb in enumerate(query_embeddings): - gaps.append((q_emb, min_distances[i])) - - return gaps - - async def select_links_for_expansion( - self, - candidate_links: List[Link], - gaps: List[Tuple[Any, float]], - kb_embeddings: Any - ) -> List[Tuple[Link, float]]: - """Select links that most efficiently fill the gaps""" - from .utils import cosine_distance, cosine_similarity, get_text_embeddings - import numpy as np - import hashlib - - scored_links = [] - - # Prepare for embedding - separate cached vs uncached - links_to_embed = [] - texts_to_embed = [] - link_embeddings_map = {} - - for link in candidate_links: - # Extract text from link - link_text = ' '.join(filter(None, [ - link.text or '', - link.title or '', - link.meta.get('description', '') if hasattr(link, 'meta') and link.meta else '', - link.head_data.get('meta', {}).get('description', '') if link.head_data else '' - ])) - - if not link_text.strip(): - continue - - # Create cache key from URL + text content - cache_key = hashlib.md5(f"{link.href}:{link_text}".encode()).hexdigest() - - # Check cache - if cache_key in self._link_embedding_cache: - link_embeddings_map[link.href] = self._link_embedding_cache[cache_key] - else: - links_to_embed.append(link) - texts_to_embed.append(link_text) - - # Batch embed only uncached links - if texts_to_embed: - embedding_llm_config = { - 'provider': 'openai/text-embedding-3-small', - 'api_token': os.getenv('OPENAI_API_KEY') - } - new_embeddings = await get_text_embeddings(texts_to_embed, embedding_llm_config, self.embedding_model) - - # Cache the new embeddings - for link, text, embedding in zip(links_to_embed, texts_to_embed, new_embeddings): - cache_key = hashlib.md5(f"{link.href}:{text}".encode()).hexdigest() - self._link_embedding_cache[cache_key] = embedding - link_embeddings_map[link.href] = embedding - - # Get coverage radius from config - coverage_radius = self.config.embedding_coverage_radius if hasattr(self, 'config') else 0.2 - - # Score each link - for link in candidate_links: - if link.href not in link_embeddings_map: - continue # Skip links without embeddings - - link_embedding = link_embeddings_map[link.href] - - if not gaps: - score = 0.0 - else: - # Calculate how many gaps this link helps with - gaps_helped = 0 - total_improvement = 0 - - for gap_point, gap_distance in gaps: - # Only consider gaps that actually need filling (outside coverage radius) - if gap_distance > coverage_radius: - new_distance = cosine_distance(link_embedding, gap_point) - if new_distance < gap_distance: - # This link helps this gap - improvement = gap_distance - new_distance - # Scale improvement - moving from 0.5 to 0.3 is valuable - scaled_improvement = improvement * 2 # Amplify the signal - total_improvement += scaled_improvement - gaps_helped += 1 - - # Average improvement per gap that needs help - gaps_needing_help = sum(1 for _, d in gaps if d > coverage_radius) - if gaps_needing_help > 0: - gap_reduction_score = total_improvement / gaps_needing_help - else: - gap_reduction_score = 0 - - # Check overlap with existing KB (vectorized) - if kb_embeddings is not None and len(kb_embeddings) > 0: - # Normalize embeddings - link_norm = link_embedding / np.linalg.norm(link_embedding) - kb_norm = kb_embeddings / np.linalg.norm(kb_embeddings, axis=1, keepdims=True) - - # Compute all similarities at once - similarities = np.dot(kb_norm, link_norm) - max_similarity = np.max(similarities) - - # Only penalize if very similar (above threshold) - overlap_threshold = self.config.embedding_overlap_threshold if hasattr(self, 'config') else 0.85 - if max_similarity > overlap_threshold: - overlap_penalty = (max_similarity - overlap_threshold) * 2 # 0 to 0.3 range - else: - overlap_penalty = 0 - else: - overlap_penalty = 0 - - # Final score - emphasize gap reduction - score = gap_reduction_score * (1 - overlap_penalty) - - # Add contextual score boost if available - if hasattr(link, 'contextual_score') and link.contextual_score: - score = score * 0.8 + link.contextual_score * 0.2 - - scored_links.append((link, score)) - - return sorted(scored_links, key=lambda x: x[1], reverse=True) - - async def calculate_confidence(self, state: CrawlState) -> float: - """Coverage-based learning score (0โ€“1).""" - import numpy as np - - # Guard clauses - if state.kb_embeddings is None or state.query_embeddings is None: - return 0.0 - if len(state.kb_embeddings) == 0 or len(state.query_embeddings) == 0: - return 0.0 - - # Prepare L2-normalised arrays - Q = np.asarray(state.query_embeddings, dtype=np.float32) - D = np.asarray(state.kb_embeddings, dtype=np.float32) - Q /= np.linalg.norm(Q, axis=1, keepdims=True) + 1e-8 - D /= np.linalg.norm(D, axis=1, keepdims=True) + 1e-8 - - # Best cosine per query - best = (Q @ D.T).max(axis=1) - - # Mean similarity or hit-rate above tau - tau = getattr(self.config, 'coverage_tau', None) - score = float((best >= tau).mean()) if tau is not None else float(best.mean()) - - # Store quick metrics - state.metrics['coverage_score'] = score - state.metrics['avg_best_similarity'] = float(best.mean()) - state.metrics['median_best_similarity'] = float(np.median(best)) - - return score - - - - # async def calculate_confidence(self, state: CrawlState) -> float: - # """Calculate learning score for adaptive crawling (used for stopping)""" - # import numpy as np - - # if state.kb_embeddings is None or state.query_embeddings is None: - # return 0.0 - - # if len(state.kb_embeddings) == 0: - # return 0.0 - - # # Get cached distance matrix - # distance_matrix = self._get_cached_distance_matrix(state.query_embeddings, state.kb_embeddings) - - # if distance_matrix is None: - # return 0.0 - - # # Vectorized analysis for all queries at once - # all_query_metrics = [] - - # for i in range(len(state.query_embeddings)): - # # Get distances for this query - # distances = distance_matrix[i] - # sorted_distances = np.sort(distances) - - # # Store metrics for this query - # query_metric = { - # 'min_distance': sorted_distances[0], - # 'top_3_distances': sorted_distances[:3], - # 'top_5_distances': sorted_distances[:5], - # 'close_neighbors': np.sum(distances < 0.3), - # 'very_close_neighbors': np.sum(distances < 0.2), - # 'all_distances': distances - # } - # all_query_metrics.append(query_metric) - - # # Hybrid approach with density (exponential base) - # k_exp = self.config.embedding_k_exp if hasattr(self, 'config') else 1.0 - # coverage_scores_hybrid_exp = [] - - # for metric in all_query_metrics: - # # Base score from nearest neighbor - # nearest_score = np.exp(-k_exp * metric['min_distance']) - - # # Top-k average (top 3) - # top_k = min(3, len(metric['all_distances'])) - # top_k_avg = np.mean([np.exp(-k_exp * d) for d in metric['top_3_distances'][:top_k]]) - - # # Combine using configured weights - # nearest_weight = self.config.embedding_nearest_weight if hasattr(self, 'config') else 0.7 - # top_k_weight = self.config.embedding_top_k_weight if hasattr(self, 'config') else 0.3 - # hybrid_score = nearest_weight * nearest_score + top_k_weight * top_k_avg - # coverage_scores_hybrid_exp.append(hybrid_score) - - # learning_score = np.mean(coverage_scores_hybrid_exp) - - # # Store as learning score - # state.metrics['learning_score'] = learning_score - - # # Store embedding-specific metrics - # state.metrics['avg_min_distance'] = np.mean([m['min_distance'] for m in all_query_metrics]) - # state.metrics['avg_close_neighbors'] = np.mean([m['close_neighbors'] for m in all_query_metrics]) - # state.metrics['avg_very_close_neighbors'] = np.mean([m['very_close_neighbors'] for m in all_query_metrics]) - # state.metrics['total_kb_docs'] = len(state.kb_embeddings) - - # # Store query-level metrics for detailed analysis - # self._query_metrics = all_query_metrics - - # # For stopping criteria, return learning score - # return float(learning_score) - - async def rank_links(self, state: CrawlState, config: AdaptiveConfig) -> List[Tuple[Link, float]]: - """Main entry point for link ranking""" - # Store config for use in other methods - self.config = config - - # Filter out already crawled URLs and remove duplicates - seen_urls = set() - uncrawled_links = [] - - for link in state.pending_links: - if link.href not in state.crawled_urls and link.href not in seen_urls: - uncrawled_links.append(link) - seen_urls.add(link.href) - - if not uncrawled_links: - return [] - - # Get gaps in coverage (no threshold needed anymore) - gaps = self.find_coverage_gaps( - state.kb_embeddings, - state.query_embeddings - ) - state.semantic_gaps = [(g[0].tolist(), g[1]) for g in gaps] # Store as list for serialization - - # Select links that fill gaps (only from uncrawled) - return await self.select_links_for_expansion( - uncrawled_links, - gaps, - state.kb_embeddings - ) - - async def validate_coverage(self, state: CrawlState) -> float: - """Validate coverage using held-out queries with caching""" - if not hasattr(self, '_validation_queries') or not self._validation_queries: - return state.metrics.get('confidence', 0.0) - - import numpy as np - - # Cache validation embeddings (only embed once!) - if self._validation_embeddings_cache is None: - self._validation_embeddings_cache = await self._get_embeddings(self._validation_queries) - - val_embeddings = self._validation_embeddings_cache - - # Use vectorized distance computation - if state.kb_embeddings is None or len(state.kb_embeddings) == 0: - return 0.0 - - # Compute distance matrix for validation queries - distance_matrix = self._compute_distance_matrix(val_embeddings, state.kb_embeddings) - - if distance_matrix is None: - return 0.0 - - # Find minimum distance for each validation query (vectorized) - min_distances = np.min(distance_matrix, axis=1) - - # Compute scores using same exponential as training - k_exp = self.config.embedding_k_exp if hasattr(self, 'config') else 1.0 - scores = np.exp(-k_exp * min_distances) - - validation_confidence = np.mean(scores) - state.metrics['validation_confidence'] = validation_confidence - - return validation_confidence - - async def should_stop(self, state: CrawlState, config: AdaptiveConfig) -> bool: - """Stop based on learning curve convergence""" - confidence = state.metrics.get('confidence', 0.0) - - # Basic limits - if len(state.crawled_urls) >= config.max_pages or not state.pending_links: - return True - - # Track confidence history - if not hasattr(state, 'confidence_history'): - state.confidence_history = [] - - state.confidence_history.append(confidence) - - # Need at least 3 iterations to check convergence - if len(state.confidence_history) < 2: - return False - - improvement_diffs = list(zip(state.confidence_history[:-1], state.confidence_history[1:])) - - # Calculate average improvement - avg_improvement = sum(abs(b - a) for a, b in improvement_diffs) / len(improvement_diffs) - state.metrics['avg_improvement'] = avg_improvement - - min_relative_improvement = self.config.embedding_min_relative_improvement * confidence if hasattr(self, 'config') else 0.1 * confidence - if avg_improvement < min_relative_improvement: - # Converged - validate before stopping - val_score = await self.validate_coverage(state) - - # Only stop if validation is reasonable - validation_min = self.config.embedding_validation_min_score if hasattr(self, 'config') else 0.4 - if val_score > validation_min: - state.metrics['stopped_reason'] = 'converged_validated' - self._validation_passed = True - return True - else: - state.metrics['stopped_reason'] = 'low_validation' - # Continue crawling despite convergence - - return False - - def get_quality_confidence(self, state: CrawlState) -> float: - """Calculate quality-based confidence score for display""" - learning_score = state.metrics.get('learning_score', 0.0) - validation_score = state.metrics.get('validation_confidence', 0.0) - - # Get config values - validation_min = self.config.embedding_validation_min_score if hasattr(self, 'config') else 0.4 - quality_min = self.config.embedding_quality_min_confidence if hasattr(self, 'config') else 0.7 - quality_max = self.config.embedding_quality_max_confidence if hasattr(self, 'config') else 0.95 - scale_factor = self.config.embedding_quality_scale_factor if hasattr(self, 'config') else 0.833 - - if self._validation_passed and validation_score > validation_min: - # Validated systems get boosted scores - # Map 0.4-0.7 learning โ†’ quality_min-quality_max confidence - if learning_score < 0.4: - confidence = quality_min # Minimum for validated systems - elif learning_score > 0.7: - confidence = quality_max # Maximum realistic confidence - else: - # Linear mapping in between - confidence = quality_min + (learning_score - 0.4) * scale_factor - else: - # Not validated = conservative mapping - confidence = learning_score * 0.8 - - return confidence - - async def update_state(self, state: CrawlState, new_results: List[CrawlResult]) -> None: - """Update embeddings and coverage metrics with deduplication""" - from .utils import get_text_embeddings - import numpy as np - - # Extract text from results - new_texts = [] - valid_results = [] - for result in new_results: - content = result.markdown.raw_markdown if hasattr(result, 'markdown') and result.markdown else "" - if content: # Only process non-empty content - new_texts.append(content[:5000]) # Limit text length - valid_results.append(result) - - if not new_texts: - return - - # Get embeddings for new texts - embedding_llm_config = { - 'provider': 'openai/text-embedding-3-small', - 'api_token': os.getenv('OPENAI_API_KEY') - } - new_embeddings = await get_text_embeddings(new_texts, embedding_llm_config, self.embedding_model) - - # Deduplicate embeddings before adding to KB - if state.kb_embeddings is None: - # First batch - no deduplication needed - state.kb_embeddings = new_embeddings - deduplicated_indices = list(range(len(new_embeddings))) - else: - # Check for duplicates using vectorized similarity - deduplicated_embeddings = [] - deduplicated_indices = [] - - for i, new_emb in enumerate(new_embeddings): - # Compute similarities with existing KB - new_emb_normalized = new_emb / np.linalg.norm(new_emb) - kb_normalized = state.kb_embeddings / np.linalg.norm(state.kb_embeddings, axis=1, keepdims=True) - similarities = np.dot(kb_normalized, new_emb_normalized) - - # Only add if not too similar to existing content - if np.max(similarities) < self._kb_similarity_threshold: - deduplicated_embeddings.append(new_emb) - deduplicated_indices.append(i) - - # Add deduplicated embeddings - if deduplicated_embeddings: - state.kb_embeddings = np.vstack([state.kb_embeddings, np.array(deduplicated_embeddings)]) - - # Update crawl order only for non-duplicate results - for idx in deduplicated_indices: - state.crawl_order.append(valid_results[idx].url) - - # Invalidate distance matrix cache since KB changed - self._kb_embeddings_hash = None - self._distance_matrix_cache = None - - # Update coverage shape if needed - if hasattr(state, 'query_embeddings') and state.query_embeddings is not None: - state.coverage_shape = self.compute_coverage_shape(state.query_embeddings, self.config.alpha_shape_alpha if hasattr(self, 'config') else 0.5) - - -class AdaptiveCrawler: - """Main adaptive crawler that orchestrates the crawling process""" - - def __init__(self, - crawler: Optional[AsyncWebCrawler] = None, - config: Optional[AdaptiveConfig] = None, - strategy: Optional[CrawlStrategy] = None): - self.crawler = crawler - self.config = config or AdaptiveConfig() - self.config.validate() - - # Create strategy based on config - if strategy: - self.strategy = strategy - else: - self.strategy = self._create_strategy(self.config.strategy) - - # Initialize state - self.state: Optional[CrawlState] = None - - # Track if we own the crawler (for cleanup) - self._owns_crawler = crawler is None - - def _create_strategy(self, strategy_name: str) -> CrawlStrategy: - """Create strategy instance based on name""" - if strategy_name == "statistical": - return StatisticalStrategy() - elif strategy_name == "embedding": - return EmbeddingStrategy( - embedding_model=self.config.embedding_model, - llm_config=self.config.embedding_llm_config - ) - else: - raise ValueError(f"Unknown strategy: {strategy_name}") - - async def digest(self, - start_url: str, - query: str, - resume_from: Optional[str] = None) -> CrawlState: - """Main entry point for adaptive crawling""" - # Initialize or resume state - if resume_from: - self.state = CrawlState.load(resume_from) - self.state.query = query # Update query in case it changed - else: - self.state = CrawlState( - crawled_urls=set(), - knowledge_base=[], - pending_links=[], - query=query, - metrics={} - ) - - # Create crawler if needed - if not self.crawler: - self.crawler = AsyncWebCrawler() - await self.crawler.__aenter__() - - self.strategy.config = self.config # Pass config to strategy - - # If using embedding strategy and not resuming, expand query space - if isinstance(self.strategy, EmbeddingStrategy) and not resume_from: - # Generate query space - query_embeddings, expanded_queries = await self.strategy.map_query_semantic_space( - query, - self.config.n_query_variations - ) - self.state.query_embeddings = query_embeddings - self.state.expanded_queries = expanded_queries[1:] # Skip original query - self.state.embedding_model = self.strategy.embedding_model - - try: - # Initial crawl if not resuming - if start_url not in self.state.crawled_urls: - result = await self._crawl_with_preview(start_url, query) - if result and hasattr(result, 'success') and result.success: - self.state.knowledge_base.append(result) - self.state.crawled_urls.add(start_url) - # Extract links from result - handle both dict and Links object formats - if hasattr(result, 'links') and result.links: - if isinstance(result.links, dict): - # Extract internal and external links from dict - internal_links = [Link(**link) for link in result.links.get('internal', [])] - external_links = [Link(**link) for link in result.links.get('external', [])] - self.state.pending_links.extend(internal_links + external_links) - else: - # Handle Links object - self.state.pending_links.extend(result.links.internal + result.links.external) - - # Update state - await self.strategy.update_state(self.state, [result]) - - # adaptive expansion - depth = 0 - while depth < self.config.max_depth: - # Calculate confidence - confidence = await self.strategy.calculate_confidence(self.state) - self.state.metrics['confidence'] = confidence - - # Check stopping criteria - if await self.strategy.should_stop(self.state, self.config): - break - - # Rank candidate links - ranked_links = await self.strategy.rank_links(self.state, self.config) - - if not ranked_links: - break - - # Check minimum gain threshold - if ranked_links[0][1] < self.config.min_gain_threshold: - break - - # Select top K links - to_crawl = [(link, score) for link, score in ranked_links[:self.config.top_k_links] - if link.href not in self.state.crawled_urls] - - if not to_crawl: - break - - # Crawl selected links - new_results = await self._crawl_batch(to_crawl, query) - - if new_results: - # Update knowledge base - self.state.knowledge_base.extend(new_results) - - # Update crawled URLs and pending links - for result, (link, _) in zip(new_results, to_crawl): - if result: - self.state.crawled_urls.add(link.href) - # Extract links from result - handle both dict and Links object formats - if hasattr(result, 'links') and result.links: - new_links = [] - if isinstance(result.links, dict): - # Extract internal and external links from dict - internal_links = [Link(**link_data) for link_data in result.links.get('internal', [])] - external_links = [Link(**link_data) for link_data in result.links.get('external', [])] - new_links = internal_links + external_links - else: - # Handle Links object - new_links = result.links.internal + result.links.external - - # Add new links to pending - for new_link in new_links: - if new_link.href not in self.state.crawled_urls: - self.state.pending_links.append(new_link) - - # Update state with new results - await self.strategy.update_state(self.state, new_results) - - depth += 1 - - # Save state if configured - if self.config.save_state and self.config.state_path: - self.state.save(self.config.state_path) - - # Final confidence calculation - learning_score = await self.strategy.calculate_confidence(self.state) - - # For embedding strategy, get quality-based confidence - if isinstance(self.strategy, EmbeddingStrategy): - self.state.metrics['confidence'] = self.strategy.get_quality_confidence(self.state) - else: - # For statistical strategy, use the same as before - self.state.metrics['confidence'] = learning_score - - self.state.metrics['pages_crawled'] = len(self.state.crawled_urls) - self.state.metrics['depth_reached'] = depth - - # Final save - if self.config.save_state and self.config.state_path: - self.state.save(self.config.state_path) - - return self.state - - finally: - # Cleanup if we created the crawler - if self._owns_crawler and self.crawler: - await self.crawler.__aexit__(None, None, None) - - async def _crawl_with_preview(self, url: str, query: str) -> Optional[CrawlResult]: - """Crawl a URL with link preview enabled""" - config = CrawlerRunConfig( - link_preview_config=LinkPreviewConfig( - include_internal=True, - include_external=False, - query=query, # For BM25 scoring - concurrency=5, - timeout=5, - max_links=50, # Reasonable limit - verbose=False - ), - score_links=True # Enable intrinsic scoring - ) - - try: - result = await self.crawler.arun(url=url, config=config) - # Extract the actual CrawlResult from the container - if hasattr(result, '_results') and result._results: - result = result._results[0] - - # Filter our all links do not have head_date - if hasattr(result, 'links') and result.links: - result.links['internal'] = [link for link in result.links['internal'] if link.get('head_data')] - # For now let's ignore external links without head_data - # result.links['external'] = [link for link in result.links['external'] if link.get('head_data')] - - return result - except Exception as e: - print(f"Error crawling {url}: {e}") - return None - - async def _crawl_batch(self, links_with_scores: List[Tuple[Link, float]], query: str) -> List[CrawlResult]: - """Crawl multiple URLs in parallel""" - tasks = [] - for link, score in links_with_scores: - task = self._crawl_with_preview(link.href, query) - tasks.append(task) - - results = await asyncio.gather(*tasks, return_exceptions=True) - - # Filter out exceptions and failed crawls - valid_results = [] - for result in results: - if isinstance(result, CrawlResult): - # Only include successful crawls - if hasattr(result, 'success') and result.success: - valid_results.append(result) - else: - print(f"Skipping failed crawl: {result.url if hasattr(result, 'url') else 'unknown'}") - elif isinstance(result, Exception): - print(f"Error in batch crawl: {result}") - - return valid_results - - # Status properties - @property - def confidence(self) -> float: - """Current confidence level""" - if self.state: - return self.state.metrics.get('confidence', 0.0) - return 0.0 - - @property - def coverage_stats(self) -> Dict[str, Any]: - """Detailed coverage statistics""" - if not self.state: - return {} - - total_content_length = sum( - len(result.markdown.raw_markdown or "") - for result in self.state.knowledge_base - ) - - return { - 'pages_crawled': len(self.state.crawled_urls), - 'total_content_length': total_content_length, - 'unique_terms': len(self.state.term_frequencies), - 'total_terms': sum(self.state.term_frequencies.values()), - 'pending_links': len(self.state.pending_links), - 'confidence': self.confidence, - 'coverage': self.state.metrics.get('coverage', 0.0), - 'consistency': self.state.metrics.get('consistency', 0.0), - 'saturation': self.state.metrics.get('saturation', 0.0) - } - - @property - def is_sufficient(self) -> bool: - """Check if current knowledge is sufficient""" - if isinstance(self.strategy, EmbeddingStrategy): - # For embedding strategy, sufficient = validation passed - return self.strategy._validation_passed - else: - # For statistical strategy, use threshold - return self.confidence >= self.config.confidence_threshold - - def print_stats(self, detailed: bool = False) -> None: - """Print comprehensive statistics about the knowledge base - - Args: - detailed: If True, show detailed statistics including top terms - """ - if not self.state: - print("No crawling state available.") - return - - # Import here to avoid circular imports - try: - from rich.console import Console - from rich.table import Table - console = Console() - use_rich = True - except ImportError: - use_rich = False - - if not detailed and use_rich: - # Summary view with nice table (like original) - table = Table(title=f"Adaptive Crawl Stats - Query: '{self.state.query}'") - table.add_column("Metric", style="cyan", no_wrap=True) - table.add_column("Value", style="magenta") - - # Basic stats - stats = self.coverage_stats - table.add_row("Pages Crawled", str(stats.get('pages_crawled', 0))) - table.add_row("Unique Terms", str(stats.get('unique_terms', 0))) - table.add_row("Total Terms", str(stats.get('total_terms', 0))) - table.add_row("Content Length", f"{stats.get('total_content_length', 0):,} chars") - table.add_row("Pending Links", str(stats.get('pending_links', 0))) - table.add_row("", "") # Spacer - - # Strategy-specific metrics - if isinstance(self.strategy, EmbeddingStrategy): - # Embedding-specific metrics - table.add_row("Confidence", f"{stats.get('confidence', 0):.2%}") - table.add_row("Avg Min Distance", f"{self.state.metrics.get('avg_min_distance', 0):.3f}") - table.add_row("Avg Close Neighbors", f"{self.state.metrics.get('avg_close_neighbors', 0):.1f}") - table.add_row("Validation Score", f"{self.state.metrics.get('validation_confidence', 0):.2%}") - table.add_row("", "") # Spacer - table.add_row("Is Sufficient?", "[green]Yes (Validated)[/green]" if self.is_sufficient else "[red]No[/red]") - else: - # Statistical strategy metrics - table.add_row("Confidence", f"{stats.get('confidence', 0):.2%}") - table.add_row("Coverage", f"{stats.get('coverage', 0):.2%}") - table.add_row("Consistency", f"{stats.get('consistency', 0):.2%}") - table.add_row("Saturation", f"{stats.get('saturation', 0):.2%}") - table.add_row("", "") # Spacer - table.add_row("Is Sufficient?", "[green]Yes[/green]" if self.is_sufficient else "[red]No[/red]") - - console.print(table) - else: - # Detailed view or fallback when rich not available - print("\n" + "="*80) - print(f"Adaptive Crawl Statistics - Query: '{self.state.query}'") - print("="*80) - - # Basic stats - print("\n[*] Basic Statistics:") - print(f" Pages Crawled: {len(self.state.crawled_urls)}") - print(f" Pending Links: {len(self.state.pending_links)}") - print(f" Total Documents: {self.state.total_documents}") - - # Content stats - total_content_length = sum( - len(self._get_content_from_result(result)) - for result in self.state.knowledge_base - ) - total_words = sum(self.state.term_frequencies.values()) - unique_terms = len(self.state.term_frequencies) - - print(f"\n[*] Content Statistics:") - print(f" Total Content: {total_content_length:,} characters") - print(f" Total Words: {total_words:,}") - print(f" Unique Terms: {unique_terms:,}") - if total_words > 0: - print(f" Vocabulary Richness: {unique_terms/total_words:.2%}") - - # Strategy-specific output - if isinstance(self.strategy, EmbeddingStrategy): - # Semantic coverage for embedding strategy - print(f"\n[*] Semantic Coverage Analysis:") - print(f" Average Min Distance: {self.state.metrics.get('avg_min_distance', 0):.3f}") - print(f" Avg Close Neighbors (< 0.3): {self.state.metrics.get('avg_close_neighbors', 0):.1f}") - print(f" Avg Very Close Neighbors (< 0.2): {self.state.metrics.get('avg_very_close_neighbors', 0):.1f}") - - # Confidence metrics - print(f"\n[*] Confidence Metrics:") - if self.is_sufficient: - if use_rich: - console.print(f" Overall Confidence: {self.confidence:.2%} [green][VALIDATED][/green]") - else: - print(f" Overall Confidence: {self.confidence:.2%} [VALIDATED]") - else: - if use_rich: - console.print(f" Overall Confidence: {self.confidence:.2%} [red][NOT VALIDATED][/red]") - else: - print(f" Overall Confidence: {self.confidence:.2%} [NOT VALIDATED]") - - print(f" Learning Score: {self.state.metrics.get('learning_score', 0):.2%}") - print(f" Validation Score: {self.state.metrics.get('validation_confidence', 0):.2%}") - - else: - # Query coverage for statistical strategy - print(f"\n[*] Query Coverage:") - query_terms = self.strategy._tokenize(self.state.query.lower()) - for term in query_terms: - tf = self.state.term_frequencies.get(term, 0) - df = self.state.document_frequencies.get(term, 0) - if df > 0: - if use_rich: - console.print(f" '{term}': found in {df}/{self.state.total_documents} docs ([green]{df/self.state.total_documents:.0%}[/green]), {tf} occurrences") - else: - print(f" '{term}': found in {df}/{self.state.total_documents} docs ({df/self.state.total_documents:.0%}), {tf} occurrences") - else: - if use_rich: - console.print(f" '{term}': [red][X] not found[/red]") - else: - print(f" '{term}': [X] not found") - - # Confidence metrics - print(f"\n[*] Confidence Metrics:") - status = "[OK]" if self.is_sufficient else "[!!]" - if use_rich: - status_colored = "[green][OK][/green]" if self.is_sufficient else "[red][!!][/red]" - console.print(f" Overall Confidence: {self.confidence:.2%} {status_colored}") - else: - print(f" Overall Confidence: {self.confidence:.2%} {status}") - print(f" Coverage Score: {self.state.metrics.get('coverage', 0):.2%}") - print(f" Consistency Score: {self.state.metrics.get('consistency', 0):.2%}") - print(f" Saturation Score: {self.state.metrics.get('saturation', 0):.2%}") - - # Crawl efficiency - if self.state.new_terms_history: - avg_new_terms = sum(self.state.new_terms_history) / len(self.state.new_terms_history) - print(f"\n[*] Crawl Efficiency:") - print(f" Avg New Terms per Page: {avg_new_terms:.1f}") - print(f" Information Saturation: {self.state.metrics.get('saturation', 0):.2%}") - - if detailed: - print("\n" + "-"*80) - if use_rich: - console.print("[bold cyan]DETAILED STATISTICS[/bold cyan]") - else: - print("DETAILED STATISTICS") - print("-"*80) - - # Top terms - print("\n[+] Top 20 Terms by Frequency:") - top_terms = sorted(self.state.term_frequencies.items(), key=lambda x: x[1], reverse=True)[:20] - for i, (term, freq) in enumerate(top_terms, 1): - df = self.state.document_frequencies.get(term, 0) - if use_rich: - console.print(f" {i:2d}. [yellow]'{term}'[/yellow]: {freq} occurrences in {df} docs") - else: - print(f" {i:2d}. '{term}': {freq} occurrences in {df} docs") - - # URLs crawled - print(f"\n[+] URLs Crawled ({len(self.state.crawled_urls)}):") - for i, url in enumerate(self.state.crawl_order, 1): - new_terms = self.state.new_terms_history[i-1] if i <= len(self.state.new_terms_history) else 0 - if use_rich: - console.print(f" {i}. [cyan]{url}[/cyan]") - console.print(f" -> Added [green]{new_terms}[/green] new terms") - else: - print(f" {i}. {url}") - print(f" -> Added {new_terms} new terms") - - # Document frequency distribution - print("\n[+] Document Frequency Distribution:") - df_counts = {} - for df in self.state.document_frequencies.values(): - df_counts[df] = df_counts.get(df, 0) + 1 - - for df in sorted(df_counts.keys()): - count = df_counts[df] - print(f" Terms in {df} docs: {count} terms") - - # Embedding stats - if self.state.embedding_model: - print("\n[+] Semantic Coverage Analysis:") - print(f" Embedding Model: {self.state.embedding_model}") - print(f" Query Variations: {len(self.state.expanded_queries)}") - if self.state.kb_embeddings is not None: - print(f" Knowledge Embeddings: {self.state.kb_embeddings.shape}") - else: - print(f" Knowledge Embeddings: None") - print(f" Semantic Gaps: {len(self.state.semantic_gaps)}") - print(f" Coverage Achievement: {self.confidence:.2%}") - - # Show sample expanded queries - if self.state.expanded_queries: - print("\n[+] Query Space (samples):") - for i, eq in enumerate(self.state.expanded_queries[:5], 1): - if use_rich: - console.print(f" {i}. [yellow]{eq}[/yellow]") - else: - print(f" {i}. {eq}") - - print("\n" + "="*80) - - def _get_content_from_result(self, result) -> str: - """Helper to safely extract content from result""" - if hasattr(result, 'markdown') and result.markdown: - if hasattr(result.markdown, 'raw_markdown'): - return result.markdown.raw_markdown or "" - return str(result.markdown) - return "" - - def export_knowledge_base(self, filepath: Union[str, Path], format: str = "jsonl") -> None: - """Export the knowledge base to a file - - Args: - filepath: Path to save the file - format: Export format - currently supports 'jsonl' - """ - if not self.state or not self.state.knowledge_base: - print("No knowledge base to export.") - return - - filepath = Path(filepath) - filepath.parent.mkdir(parents=True, exist_ok=True) - - if format == "jsonl": - # Export as JSONL - one CrawlResult per line - with open(filepath, 'w', encoding='utf-8') as f: - for result in self.state.knowledge_base: - # Convert CrawlResult to dict - result_dict = self._crawl_result_to_export_dict(result) - # Write as single line JSON - f.write(json.dumps(result_dict, ensure_ascii=False) + '\n') - - print(f"Exported {len(self.state.knowledge_base)} documents to {filepath}") - else: - raise ValueError(f"Unsupported export format: {format}") - - def _crawl_result_to_export_dict(self, result) -> Dict[str, Any]: - """Convert CrawlResult to a dictionary for export""" - # Extract all available fields - export_dict = { - 'url': getattr(result, 'url', ''), - 'timestamp': getattr(result, 'timestamp', None), - 'success': getattr(result, 'success', True), - 'query': self.state.query if self.state else '', - } - - # Extract content - if hasattr(result, 'markdown') and result.markdown: - if hasattr(result.markdown, 'raw_markdown'): - export_dict['content'] = result.markdown.raw_markdown - else: - export_dict['content'] = str(result.markdown) - else: - export_dict['content'] = '' - - # Extract metadata - if hasattr(result, 'metadata'): - export_dict['metadata'] = result.metadata - - # Extract links if available - if hasattr(result, 'links'): - export_dict['links'] = result.links - - # Add crawl-specific metadata - if self.state: - export_dict['crawl_metadata'] = { - 'crawl_order': self.state.crawl_order.index(export_dict['url']) + 1 if export_dict['url'] in self.state.crawl_order else 0, - 'confidence_at_crawl': self.state.metrics.get('confidence', 0), - 'total_documents': self.state.total_documents - } - - return export_dict - - def import_knowledge_base(self, filepath: Union[str, Path], format: str = "jsonl") -> None: - """Import a knowledge base from a file - - Args: - filepath: Path to the file to import - format: Import format - currently supports 'jsonl' - """ - filepath = Path(filepath) - if not filepath.exists(): - raise FileNotFoundError(f"File not found: {filepath}") - - if format == "jsonl": - imported_results = [] - with open(filepath, 'r', encoding='utf-8') as f: - for line in f: - if line.strip(): - data = json.loads(line) - # Convert back to a mock CrawlResult - mock_result = self._import_dict_to_crawl_result(data) - imported_results.append(mock_result) - - # Initialize state if needed - if not self.state: - self.state = CrawlState() - - # Add imported results - self.state.knowledge_base.extend(imported_results) - - # Update state with imported data - asyncio.run(self.strategy.update_state(self.state, imported_results)) - - print(f"Imported {len(imported_results)} documents from {filepath}") - else: - raise ValueError(f"Unsupported import format: {format}") - - def _import_dict_to_crawl_result(self, data: Dict[str, Any]): - """Convert imported dict back to a mock CrawlResult""" - class MockMarkdown: - def __init__(self, content): - self.raw_markdown = content - - class MockCrawlResult: - def __init__(self, data): - self.url = data.get('url', '') - self.markdown = MockMarkdown(data.get('content', '')) - self.links = data.get('links', {}) - self.metadata = data.get('metadata', {}) - self.success = data.get('success', True) - self.timestamp = data.get('timestamp') - - return MockCrawlResult(data) - - def get_relevant_content(self, top_k: int = 5) -> List[Dict[str, Any]]: - """Get most relevant content for the query""" - if not self.state or not self.state.knowledge_base: - return [] - - # Simple relevance ranking based on term overlap - scored_docs = [] - query_terms = set(self.state.query.lower().split()) - - for i, result in enumerate(self.state.knowledge_base): - content = (result.markdown.raw_markdown or "").lower() - content_terms = set(content.split()) - - # Calculate relevance score - overlap = len(query_terms & content_terms) - score = overlap / len(query_terms) if query_terms else 0.0 - - scored_docs.append({ - 'url': result.url, - 'score': score, - 'content': result.markdown.raw_markdown, - 'index': i - }) - - # Sort by score and return top K - scored_docs.sort(key=lambda x: x['score'], reverse=True) - return scored_docs[:top_k] \ No newline at end of file diff --git a/crawl4ai/async_configs.py b/crawl4ai/async_configs.py index 27320fd4b..8b909731d 100644 --- a/crawl4ai/async_configs.py +++ b/crawl4ai/async_configs.py @@ -11,7 +11,7 @@ IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD, PROVIDER_MODELS, PROVIDER_MODELS_PREFIXES, - SCREENSHOT_HEIGHT_TRESHOLD, + SCREENSHOT_HEIGHT_THRESHOLD, PAGE_TIMEOUT, IMAGE_SCORE_THRESHOLD, SOCIAL_MEDIA_DOMAINS, @@ -897,9 +897,9 @@ def __init__( self.memory_saving_mode = memory_saving_mode self.max_pages_before_recycle = max_pages_before_recycle - fa_user_agenr_generator = ValidUAGenerator() + ua_generator = ValidUAGenerator() if self.user_agent_mode == "random": - self.user_agent = fa_user_agenr_generator.generate( + self.user_agent = ua_generator.generate( **(self.user_agent_generator_config or {}) ) else: @@ -1496,7 +1496,7 @@ class CrawlerRunConfig(): screenshot_wait_for (float or None): Additional wait time before taking a screenshot. Default: None. screenshot_height_threshold (int): Threshold for page height to decide screenshot strategy. - Default: SCREENSHOT_HEIGHT_TRESHOLD (from config, e.g. 20000). + Default: SCREENSHOT_HEIGHT_THRESHOLD (from config, e.g. 20000). force_viewport_screenshot (bool): If True, always take viewport-only screenshots regardless of page height. When False, uses automatic decision (viewport for short pages, full-page for long pages). Default: False. @@ -1530,8 +1530,6 @@ class CrawlerRunConfig(): Default: False. exclude_domains (list of str): List of specific domains to exclude from results. Default: []. - exclude_internal_links (bool): If True, exclude internal links from the results. - Default: False. score_links (bool): If True, calculate intrinsic quality scores for all links using URL structure, text quality, and contextual relevance metrics. Separate from link_preview_config. Default: False. @@ -1654,7 +1652,7 @@ def __init__( # Media Handling Parameters screenshot: bool = False, screenshot_wait_for: float = None, - screenshot_height_threshold: int = SCREENSHOT_HEIGHT_TRESHOLD, + screenshot_height_threshold: int = SCREENSHOT_HEIGHT_THRESHOLD, force_viewport_screenshot: bool = False, pdf: bool = False, capture_mhtml: bool = False, diff --git a/crawl4ai/async_crawler_strategy.py b/crawl4ai/async_crawler_strategy.py index b3710b20d..ceb39dbfc 100644 --- a/crawl4ai/async_crawler_strategy.py +++ b/crawl4ai/async_crawler_strategy.py @@ -16,7 +16,7 @@ import uuid from .js_snippet import load_js_script from .models import AsyncCrawlResponse -from .config import SCREENSHOT_HEIGHT_TRESHOLD +from .config import SCREENSHOT_HEIGHT_THRESHOLD from .async_configs import BrowserConfig, CrawlerRunConfig, HTTPCrawlerConfig from .async_logger import AsyncLogger from .ssl_certificate import SSLCertificate @@ -919,7 +919,7 @@ async def handle_request_failed_capture(request): # ) target_width = self.browser_config.viewport_width - target_height = int(target_width * page_width / page_height * 0.95) + target_height = int(target_width * page_height / page_width) await page.set_viewport_size( {"width": target_width, "height": target_height} ) @@ -1878,7 +1878,7 @@ async def take_screenshot_scroller(self, page: Page, **kwargs) -> str: # Set a large viewport large_viewport_height = min( page_height, - kwargs.get("screenshot_height_threshold", SCREENSHOT_HEIGHT_TRESHOLD), + kwargs.get("screenshot_height_threshold", SCREENSHOT_HEIGHT_THRESHOLD), ) await page.set_viewport_size( {"width": page_width, "height": large_viewport_height} @@ -2258,14 +2258,6 @@ async def execute_user_script( ) return {"success": False, "error": str(e)} - except Exception as e: - self.logger.error( - message="Script execution failed: {error}", - tag="JS_EXEC", - params={"error": str(e)}, - ) - return {"success": False, "error": str(e)} - async def check_visibility(self, page): """ Checks if an element is visible on the page. diff --git a/crawl4ai/async_database.py b/crawl4ai/async_database.py index 757905352..df17219f5 100644 --- a/crawl4ai/async_database.py +++ b/crawl4ai/async_database.py @@ -332,13 +332,44 @@ async def _get(db): else: row_dict[field] = "" + # Handle markdown separately (stored as raw text, not JSON-serialized) + raw_md = row_dict.get("markdown", "") + if raw_md: + try: + parsed = json.loads(raw_md) + if isinstance(parsed, dict): + row_dict["markdown"] = parsed + else: + row_dict["markdown"] = MarkdownGenerationResult( + raw_markdown=str(parsed) if parsed is not None else "", + markdown_with_citations="", + references_markdown="", + fit_markdown="", + fit_html="", + ) + except (json.JSONDecodeError, ValueError): + row_dict["markdown"] = MarkdownGenerationResult( + raw_markdown=raw_md, + markdown_with_citations="", + references_markdown="", + fit_markdown="", + fit_html="", + ) + else: + row_dict["markdown"] = MarkdownGenerationResult( + raw_markdown="", + markdown_with_citations="", + references_markdown="", + fit_markdown="", + fit_html="", + ) + # Parse JSON fields json_fields = [ "media", "links", "metadata", "response_headers", - "markdown", ] for field in json_fields: try: @@ -346,21 +377,7 @@ async def _get(db): json.loads(row_dict[field]) if row_dict[field] else {} ) except json.JSONDecodeError: - # Very UGLY, never mention it to me please - if field == "markdown" and isinstance(row_dict[field], str): - row_dict[field] = MarkdownGenerationResult( - raw_markdown=row_dict[field] or "", - markdown_with_citations="", - references_markdown="", - fit_markdown="", - fit_html="", - ) - else: - row_dict[field] = {} - - if isinstance(row_dict["markdown"], Dict): - if row_dict["markdown"].get("raw_markdown"): - row_dict["markdown"] = row_dict["markdown"]["raw_markdown"] + row_dict[field] = {} # Parse downloaded_files try: @@ -505,7 +522,11 @@ async def acache_url(self, result: CrawlResult): ) else: content_map["markdown"] = ( - MarkdownGenerationResult().model_dump_json(), + MarkdownGenerationResult( + raw_markdown="", + markdown_with_citations="", + references_markdown="", + ).model_dump_json(), "markdown", ) except Exception as e: @@ -514,7 +535,11 @@ async def acache_url(self, result: CrawlResult): ) # Fallback to empty markdown result content_map["markdown"] = ( - MarkdownGenerationResult().model_dump_json(), + MarkdownGenerationResult( + raw_markdown="", + markdown_with_citations="", + references_markdown="", + ).model_dump_json(), "markdown", ) diff --git a/crawl4ai/config.py b/crawl4ai/config.py index 5d394136f..d4caed2ed 100644 --- a/crawl4ai/config.py +++ b/crawl4ai/config.py @@ -99,7 +99,8 @@ NEED_MIGRATION = True URL_LOG_SHORTEN_LENGTH = 30 SHOW_DEPRECATION_WARNINGS = True -SCREENSHOT_HEIGHT_TRESHOLD = 10000 +SCREENSHOT_HEIGHT_THRESHOLD = 10000 +SCREENSHOT_HEIGHT_TRESHOLD = SCREENSHOT_HEIGHT_THRESHOLD # Backward compat alias PAGE_TIMEOUT = 60000 DOWNLOAD_PAGE_TIMEOUT = 60000 diff --git a/crawl4ai/models.py b/crawl4ai/models.py index 506538970..4592b0abd 100644 --- a/crawl4ai/models.py +++ b/crawl4ai/models.py @@ -179,11 +179,14 @@ def __init__(self, **data): markdown_result = data.pop('markdown', None) super().__init__(**data) if markdown_result is not None: - self._markdown = ( - MarkdownGenerationResult(**markdown_result) - if isinstance(markdown_result, dict) - else markdown_result - ) + if isinstance(markdown_result, dict): + self._markdown = MarkdownGenerationResult(**markdown_result) + elif isinstance(markdown_result, MarkdownGenerationResult): + self._markdown = markdown_result + else: + self._markdown = MarkdownGenerationResult( + raw_markdown=str(markdown_result) if markdown_result is not None else "" + ) @property def markdown(self): diff --git a/crawl4ai/utils.py b/crawl4ai/utils.py index 89fb782d9..714149990 100644 --- a/crawl4ai/utils.py +++ b/crawl4ai/utils.py @@ -1113,7 +1113,6 @@ def flatten_nested_elements(node): # sanitized_html = escape_json_string(cleaned_html) # Convert cleaned HTML to Markdown - h = html2text.HTML2Text() h = CustomHTML2Text() h.ignore_links = True markdown = h.handle(cleaned_html) @@ -2212,26 +2211,6 @@ def fast_format_html(html_string): return "\n".join(formatted) -def normalize_url(href, base_url): - """Normalize URLs to ensure consistent format""" - from urllib.parse import urljoin, urlparse - - # Parse base URL to get components - parsed_base = urlparse(base_url) - if not parsed_base.scheme or not parsed_base.netloc: - raise ValueError(f"Invalid base URL format: {base_url}") - - if parsed_base.scheme.lower() not in ["http", "https"]: - # Handle special protocols - raise ValueError(f"Invalid base URL format: {base_url}") - cleaned_href = href.strip() - - # Use urljoin to handle all cases - return urljoin(base_url, cleaned_href) - - - - def normalize_url( href: str, base_url: str, @@ -2431,42 +2410,6 @@ def efficient_normalize_url_for_deep_crawl(href, base_url, preserve_https=False, return normalized -def normalize_url_tmp(href, base_url): - """Normalize URLs to ensure consistent format""" - # Extract protocol and domain from base URL - try: - base_parts = base_url.split("/") - protocol = base_parts[0] - domain = base_parts[2] - except IndexError: - raise ValueError(f"Invalid base URL format: {base_url}") - - # Handle special protocols - special_protocols = {"mailto:", "tel:", "ftp:", "file:", "data:", "javascript:"} - if any(href.lower().startswith(proto) for proto in special_protocols): - return href.strip() - - # Handle anchor links - if href.startswith("#"): - return f"{base_url}{href}" - - # Handle protocol-relative URLs - if href.startswith("//"): - return f"{protocol}{href}" - - # Handle root-relative URLs - if href.startswith("/"): - return f"{protocol}//{domain}{href}" - - # Handle relative URLs - if not href.startswith(("http://", "https://")): - # Remove leading './' if present - href = href.lstrip("./") - return f"{protocol}//{domain}/{href}" - - return href.strip() - - def quick_extract_links(html: str, base_url: str) -> Dict[str, List[Dict[str, str]]]: """ Fast link extraction for prefetch mode. @@ -2989,7 +2932,7 @@ def configure_windows_event_loop(): Example: ```python - from crawl4ai.async_configs import configure_windows_event_loop + from crawl4ai.utils import configure_windows_event_loop # Call this before any async operations if you're on Windows configure_windows_event_loop() diff --git a/deploy/docker/README.md b/deploy/docker/README.md index a711c7206..a377943a7 100644 --- a/deploy/docker/README.md +++ b/deploy/docker/README.md @@ -1,4 +1,4 @@ -# Crawl4AI Docker Guide ๐Ÿณ +# Crawl4AI Docker Guide ## Table of Contents - [Prerequisites](#prerequisites) @@ -47,7 +47,7 @@ Before we dive in, make sure you have: - Python 3.10+ (if using the Python SDK). - Node.js 16+ (if using the Node.js examples). -> ๐Ÿ’ก **Pro tip**: Run `docker info` to check your Docker installation and available resources. +> **Pro tip**: Run `docker info` to check your Docker installation and available resources. ## Installation @@ -90,7 +90,7 @@ ANTHROPIC_API_KEY=your-anthropic-key # GEMINI_API_TOKEN=your-gemini-token EOL ``` -> ๐Ÿ”‘ **Note**: Keep your API keys secure! Never commit `.llm.env` to version control. +> **Note**: Keep your API keys secure! Never commit `.llm.env` to version control. #### 3. Run the Container @@ -833,7 +833,7 @@ curl -X POST http://localhost:11235/llm/job \ - Single URL only (not an array) - Supports schema-based extraction with `schema` parameter -> ๐Ÿ’ก **Pro tip**: See [WEBHOOK_EXAMPLES.md](./WEBHOOK_EXAMPLES.md) for detailed examples including TypeScript client code, Flask webhook handlers, and failure handling. +> **Pro tip**: See [WEBHOOK_EXAMPLES.md](./WEBHOOK_EXAMPLES.md) for detailed examples including TypeScript client code, Flask webhook handlers, and failure handling. --- @@ -975,7 +975,7 @@ You can override the default `config.yml`. ``` *(Note: Ensure `my-custom-config.yml` is in the same directory as `docker-compose.yml`)* -> ๐Ÿ’ก When mounting, your custom file *completely replaces* the default one. Ensure it's a valid and complete configuration. +> When mounting, your custom file *completely replaces* the default one. Ensure it's a valid and complete configuration. ### Configuration Recommendations @@ -985,17 +985,17 @@ You can override the default `config.yml`. - Set up proper rate limiting to protect your server - Consider your environment before enabling HTTPS redirect -2. **Resource Management** ๐Ÿ’ป +2. **Resource Management** - Adjust memory_threshold_percent based on available RAM - Set timeouts according to your content size and network conditions - Use Redis for rate limiting in multi-container setups -3. **Monitoring** ๐Ÿ“Š +3. **Monitoring** - Enable Prometheus if you need metrics - Set DEBUG logging in development, INFO in production - Regular health check monitoring is crucial -4. **Performance Tuning** โšก +4. **Performance Tuning** - Start with conservative rate limiter delays - Increase batch_process timeout for large content - Adjust stream_init timeout based on initial response times @@ -1004,9 +1004,9 @@ You can override the default `config.yml`. We're here to help you succeed with Crawl4AI! Here's how to get support: -- ๐Ÿ“– Check our [full documentation](https://docs.crawl4ai.com) -- ๐Ÿ› Found a bug? [Open an issue](https://github.com/unclecode/crawl4ai/issues) -- ๐Ÿ’ฌ Join our [Discord community](https://discord.gg/crawl4ai) +- Check our [full documentation](https://docs.crawl4ai.com) +- Found a bug? [Open an issue](https://github.com/unclecode/crawl4ai/issues) +- Join our [Discord community](https://discord.gg/crawl4ai) - โญ Star us on GitHub to show support! ## Summary @@ -1030,4 +1030,4 @@ Remember, the examples in the `examples` folder are your friends - they show rea Keep exploring, and don't hesitate to reach out if you need help! We're building something amazing together. ๐Ÿš€ -Happy crawling! ๐Ÿ•ท๏ธ +Happy crawling! diff --git a/docs/blog/release-v0.7.0.md b/docs/blog/release-v0.7.0.md index 1b474a998..19ade1ae0 100644 --- a/docs/blog/release-v0.7.0.md +++ b/docs/blog/release-v0.7.0.md @@ -6,7 +6,7 @@ Today I'm releasing Crawl4AI v0.7.0โ€”the Adaptive Intelligence Update. This release introduces fundamental improvements in how Crawl4AI handles modern web complexity through adaptive learning, intelligent content discovery, and advanced extraction capabilities. -## ๐ŸŽฏ What's New at a Glance +## What's New at a Glance - **Adaptive Crawling**: Your crawler now learns and adapts to website patterns - **Virtual Scroll Support**: Complete content extraction from infinite scroll pages @@ -14,7 +14,7 @@ Today I'm releasing Crawl4AI v0.7.0โ€”the Adaptive Intelligence Update. This rel - **Async URL Seeder**: Discover thousands of URLs in seconds with intelligent filtering - **Performance Optimizations**: Significant speed and memory improvements -## ๐Ÿง  Adaptive Crawling: Intelligence Through Pattern Learning +## Adaptive Crawling: Intelligence Through Pattern Learning **The Problem:** Websites change. Class names shift. IDs disappear. Your carefully crafted selectors break at 3 AM, and you wake up to empty datasets and angry stakeholders. @@ -72,7 +72,7 @@ asyncio.run(main()) - **Research Data Collection**: Build robust academic datasets that survive website redesigns - **Reduced Maintenance**: Cut selector update time by 80% for frequently-changing sites -## ๐ŸŒŠ Virtual Scroll: Complete Content Capture +## Virtual Scroll: Complete Content Capture **The Problem:** Modern web apps only render what's visible. Scroll down, new content appears, old content vanishes into the void. Traditional crawlers capture that first viewport and miss 90% of the content. It's like reading only the first page of every book. @@ -212,7 +212,7 @@ asyncio.run(main()) - **Content Discovery**: Build topic-focused crawlers that stay on track - **SEO Audits**: Identify and prioritize high-value internal linking opportunities -## ๐ŸŽฃ Async URL Seeder: Automated URL Discovery at Scale +## Async URL Seeder: Automated URL Discovery at Scale **The Problem:** You want to crawl an entire domain but only have the homepage. Or worse, you want specific content types across thousands of pages. Manual URL discovery? That's a job for machines, not humans. @@ -264,7 +264,7 @@ asyncio.run(main()) - **SEO Audits**: Find every indexable page with content scoring - **Content Archival**: Ensure no content is left behind during site migrations -## โšก Performance Optimizations +## Performance Optimizations This release includes significant performance improvements through optimized resource handling, better concurrency management, and reduced memory footprint. @@ -293,7 +293,7 @@ for url in urls: - **Concurrent Crawls**: Handle 5x more parallel requests -## ๐Ÿ”ง Important Changes +## Important Changes ### Breaking Changes - `link_extractor` renamed to `link_preview` (better reflects functionality) @@ -336,7 +336,7 @@ Questions? Issues? I'm always listening: - Discord: [discord.gg/crawl4ai](https://discord.gg/jP8KfhDhyN) - Twitter: [@unclecode](https://x.com/unclecode) -Happy crawling! ๐Ÿ•ท๏ธ +Happy crawling! --- diff --git a/docs/blog/release-v0.7.1.md b/docs/blog/release-v0.7.1.md index d5bfdaec9..517f1e14b 100644 --- a/docs/blog/release-v0.7.1.md +++ b/docs/blog/release-v0.7.1.md @@ -1,4 +1,4 @@ -# ๐Ÿ› ๏ธ Crawl4AI v0.7.1: Minor Cleanup Update +# Crawl4AI v0.7.1: Minor Cleanup Update *July 17, 2025 โ€ข 2 min read* @@ -6,13 +6,13 @@ A small maintenance release that removes unused code and improves documentation. -## ๐ŸŽฏ What's Changed +## What's Changed - **Removed unused StealthConfig** from `crawl4ai/browser_manager.py` - **Updated documentation** with better examples and parameter explanations - **Fixed virtual scroll configuration** examples in docs -## ๐Ÿงน Code Cleanup +## Code Cleanup Removed unused `StealthConfig` import and configuration that wasn't being used anywhere in the codebase. The project uses its own custom stealth implementation through JavaScript injection instead. @@ -22,7 +22,7 @@ from playwright_stealth import StealthConfig stealth_config = StealthConfig(...) # This was never used ``` -## ๐Ÿ“– Documentation Updates +## Documentation Updates - Fixed adaptive crawling parameter examples - Updated session management documentation diff --git a/docs/blog/release-v0.7.3.md b/docs/blog/release-v0.7.3.md index 59b2a9be3..5687eeed6 100644 --- a/docs/blog/release-v0.7.3.md +++ b/docs/blog/release-v0.7.3.md @@ -6,18 +6,18 @@ Today I'm releasing Crawl4AI v0.7.3โ€”the Multi-Config Intelligence Update. This release brings smarter URL-specific configurations, flexible Docker deployments, important bug fixes, and documentation improvements that make Crawl4AI more robust and production-ready. -## ๐ŸŽฏ What's New at a Glance +## What's New at a Glance -- **๐Ÿ•ต๏ธ Undetected Browser Support**: Stealth mode for bypassing bot detection systems -- **๐ŸŽจ Multi-URL Configurations**: Different crawling strategies for different URL patterns in a single batch -- **๐Ÿณ Flexible Docker LLM Providers**: Configure LLM providers via environment variables -- **๐Ÿง  Memory Monitoring**: Enhanced memory usage tracking and optimization tools -- **๐Ÿ“Š Enhanced Table Extraction**: Improved table access and DataFrame conversion -- **๐Ÿ’ฐ GitHub Sponsors**: 4-tier sponsorship system with custom arrangements -- **๐Ÿ”ง Bug Fixes**: Resolved several critical issues for better stability -- **๐Ÿ“š Documentation Updates**: Clearer examples and improved API documentation +- ** Undetected Browser Support**: Stealth mode for bypassing bot detection systems +- ** Multi-URL Configurations**: Different crawling strategies for different URL patterns in a single batch +- ** Flexible Docker LLM Providers**: Configure LLM providers via environment variables +- ** Memory Monitoring**: Enhanced memory usage tracking and optimization tools +- ** Enhanced Table Extraction**: Improved table access and DataFrame conversion +- ** GitHub Sponsors**: 4-tier sponsorship system with custom arrangements +- ** Bug Fixes**: Resolved several critical issues for better stability +- ** Documentation Updates**: Clearer examples and improved API documentation -## ๐ŸŽจ Multi-URL Configurations: One Size Doesn't Fit All +## Multi-URL Configurations: One Size Doesn't Fit All **The Problem:** You're crawling a mix of documentation sites, blogs, and API endpoints. Each needs different handlingโ€”caching for docs, fresh content for news, structured extraction for APIs. Previously, you'd run separate crawls or write complex conditional logic. @@ -82,7 +82,7 @@ async with AsyncWebCrawler() as crawler: - **Reduced Complexity**: No more if/else forests in your extraction code - **Better Performance**: Each URL gets exactly the processing it needs -## ๐Ÿ•ต๏ธ Undetected Browser Support: Stealth Mode Activated +## Undetected Browser Support: Stealth Mode Activated **The Problem:** Modern websites employ sophisticated bot detection systems. Cloudflare, Akamai, and custom solutions block automated crawlers, limiting access to valuable content. @@ -161,7 +161,7 @@ result = await crawler.arun("https://bot-protected-site.com", config=config) - **Content Aggregation**: Collect news and social media despite anti-bot measures - **Compliance Testing**: Verify your own site's bot protection effectiveness -## ๐Ÿง  Memory Monitoring & Optimization +## Memory Monitoring & Optimization **The Problem:** Long-running crawl sessions consuming excessive memory, especially when processing large batches or heavy JavaScript sites. @@ -203,7 +203,7 @@ async with AsyncWebCrawler() as crawler: - **Performance Tuning**: Identify memory bottlenecks and optimization opportunities - **Scalability Planning**: Understand memory patterns for horizontal scaling -## ๐Ÿ“Š Enhanced Table Extraction +## Enhanced Table Extraction **The Problem:** Table data was accessed through the generic `result.media` interface, making DataFrame conversion cumbersome and unclear. @@ -240,15 +240,15 @@ if result.tables: - **ETL Pipelines**: Cleaner integration with data processing workflows - **Reporting**: Simplified table extraction for automated reporting systems -## ๐Ÿ’ฐ Community Support: GitHub Sponsors +## Community Support: GitHub Sponsors I've launched GitHub Sponsors to ensure Crawl4AI's continued development and support our growing community. **Sponsorship Tiers:** -- **๐ŸŒฑ Supporter ($5/month)**: Community support + early feature previews +- ** Supporter ($5/month)**: Community support + early feature previews - **๐Ÿš€ Professional ($25/month)**: Priority support + beta access -- **๐Ÿข Business ($100/month)**: Direct consultation + custom integrations -- **๐Ÿ›๏ธ Enterprise ($500/month)**: Dedicated support + feature development +- ** Business ($100/month)**: Direct consultation + custom integrations +- ** Enterprise ($500/month)**: Dedicated support + feature development **Why Sponsor?** - Ensure continuous development and maintenance @@ -258,7 +258,7 @@ I've launched GitHub Sponsors to ensure Crawl4AI's continued development and sup [**Become a Sponsor โ†’**](https://github.com/sponsors/unclecode) -## ๐Ÿณ Docker: Flexible LLM Provider Configuration +## Docker: Flexible LLM Provider Configuration **The Problem:** Hardcoded LLM providers in Docker deployments. Want to switch from OpenAI to Groq? Rebuild and redeploy. Testing different models? Multiple Docker images. @@ -311,7 +311,7 @@ response = requests.post("http://localhost:11235/crawl", json={ - **Development Flexibility**: Test locally with one provider, deploy with another - **Secure Configuration**: Keep API keys in `.llm.env` file, not in commands -## ๐Ÿ”ง Bug Fixes & Improvements +## Bug Fixes & Improvements This release includes several important bug fixes that improve stability and reliability: @@ -321,7 +321,7 @@ This release includes several important bug fixes that improve stability and rel - **Table Extraction**: Improved table detection and extraction accuracy - **Error Handling**: Better error messages and recovery from network failures -## ๐Ÿ“š Documentation Enhancements +## Documentation Enhancements Based on community feedback, we've updated: - Clearer examples for multi-URL configuration @@ -330,11 +330,11 @@ Based on community feedback, we've updated: - Added real-world URLs in examples for better understanding - New comprehensive demo showcasing all v0.7.3 features -## ๐Ÿ™ Acknowledgments +## Acknowledgments Thanks to our contributors and the entire community for feedback and bug reports. -## ๐Ÿ“š Resources +## Resources - [Full Documentation](https://docs.crawl4ai.com) - [GitHub Repository](https://github.com/unclecode/crawl4ai) @@ -345,6 +345,6 @@ Thanks to our contributors and the entire community for feedback and bug reports *Crawl4AI continues to evolve with your needs. This release makes it smarter, more flexible, and more stable. Try the new multi-config feature and flexible Docker deploymentโ€”they're game changers!* -**Happy Crawling! ๐Ÿ•ท๏ธ** +**Happy Crawling! ** *- The Crawl4AI Team* \ No newline at end of file diff --git a/docs/blog/release-v0.7.4.md b/docs/blog/release-v0.7.4.md index 72cfe3ae1..32118222d 100644 --- a/docs/blog/release-v0.7.4.md +++ b/docs/blog/release-v0.7.4.md @@ -6,15 +6,15 @@ Today I'm releasing Crawl4AI v0.7.4โ€”the Intelligent Table Extraction & Performance Update. This release introduces revolutionary LLM-powered table extraction with intelligent chunking, significant performance improvements for concurrent crawling, enhanced browser management, and critical stability fixes that make Crawl4AI more robust for production workloads. -## ๐ŸŽฏ What's New at a Glance +## What's New at a Glance - **๐Ÿš€ LLMTableExtraction**: Revolutionary table extraction with intelligent chunking for massive tables -- **โšก Enhanced Concurrency**: True concurrency improvements for fast-completing tasks in batch operations -- **๐Ÿ”ง Browser Manager Fixes**: Resolved race conditions in concurrent page creation -- **โŒจ๏ธ Cross-Platform Browser Profiler**: Improved keyboard handling and quit mechanisms +- ** Enhanced Concurrency**: True concurrency improvements for fast-completing tasks in batch operations +- ** Browser Manager Fixes**: Resolved race conditions in concurrent page creation +- **โŒจ Cross-Platform Browser Profiler**: Improved keyboard handling and quit mechanisms - **๐Ÿ”— Advanced URL Processing**: Better handling of raw URLs and base tag link resolution - **๐Ÿ›ก๏ธ Enhanced Proxy Support**: Flexible proxy configuration with dict and string formats -- **๐Ÿณ Docker Improvements**: Better API handling and raw HTML support +- ** Docker Improvements**: Better API handling and raw HTML support ## ๐Ÿš€ LLMTableExtraction: Revolutionary Table Processing @@ -123,7 +123,7 @@ for table in result.tables: - **Government Data**: Extract census data, statistical tables from official sources - **Competitive Intelligence**: Process competitor pricing and feature tables -## โšก Enhanced Concurrency: True Performance Gains +## Enhanced Concurrency: True Performance Gains **The Problem:** The `arun_many()` method wasn't achieving true concurrency for fast-completing tasks, leading to sequential processing bottlenecks in batch operations. @@ -157,7 +157,7 @@ async with AsyncWebCrawler() as crawler: - **Monitoring Systems**: Faster health checks and status page monitoring - **Data Aggregation**: Improved performance for real-time data collection -## ๐Ÿ”ง Critical Stability Fixes +## Critical Stability Fixes ### Browser Manager Race Condition Resolution @@ -232,7 +232,7 @@ async with AsyncWebCrawler(config=browser_config) as crawler: result = await crawler.arun("https://httpbin.org/ip") ``` -## ๐Ÿณ Docker & Infrastructure Improvements +## Docker & Infrastructure Improvements This release includes several Docker and infrastructure improvements: @@ -241,7 +241,7 @@ This release includes several Docker and infrastructure improvements: - **Documentation Updates**: Comprehensive Docker deployment examples - **Test Coverage**: Expanded test suite with better coverage -## ๐Ÿ“š Documentation & Examples +## Documentation & Examples Enhanced documentation includes: @@ -250,11 +250,11 @@ Enhanced documentation includes: - **Docker Deployment**: Complete deployment guide with examples - **Performance Optimization**: Guidelines for concurrent crawling -## ๐Ÿ™ Acknowledgments +## Acknowledgments Thanks to our contributors and community for feedback, bug reports, and feature requests that made this release possible. -## ๐Ÿ“š Resources +## Resources - [Full Documentation](https://docs.crawl4ai.com) - [GitHub Repository](https://github.com/unclecode/crawl4ai) @@ -265,6 +265,6 @@ Thanks to our contributors and community for feedback, bug reports, and feature *Crawl4AI v0.7.4 delivers intelligent table extraction and significant performance improvements. The new LLMTableExtraction strategy handles complex tables that were previously impossible to process, while concurrency improvements make batch operations 3-4x faster. Try the intelligent table extractionโ€”it's a game changer for data extraction workflows!* -**Happy Crawling! ๐Ÿ•ท๏ธ** +**Happy Crawling! ** *- The Crawl4AI Team* \ No newline at end of file diff --git a/docs/blog/release-v0.7.5.md b/docs/blog/release-v0.7.5.md index 977d2fd9e..abf552570 100644 --- a/docs/blog/release-v0.7.5.md +++ b/docs/blog/release-v0.7.5.md @@ -6,7 +6,7 @@ Today I'm releasing Crawl4AI v0.7.5โ€”focused on extensibility and security. This update introduces the Docker Hooks System for pipeline customization, enhanced LLM integration, and important security improvements. -## ๐ŸŽฏ What's New at a Glance +## What's New at a Glance - **Docker Hooks System**: Custom Python functions at key pipeline points with function-based API - **Function-Based Hooks**: New `hooks_to_string()` utility with Docker client auto-conversion @@ -15,7 +15,7 @@ Today I'm releasing Crawl4AI v0.7.5โ€”focused on extensibility and security. Thi - **Bug Fixes**: Resolved multiple community-reported issues - **Improved Docker Error Handling**: Better debugging and reliability -## ๐Ÿ”ง Docker Hooks System: Pipeline Customization +## Docker Hooks System: Pipeline Customization Every scraping project needs custom logicโ€”authentication, performance optimization, content processing. Traditional solutions require forking or complex workarounds. Docker Hooks let you inject custom Python functions at 8 key points in the crawling pipeline. @@ -249,7 +249,7 @@ async def test_https_preservation(): print(f" โ†’ {link}") ``` -## ๐Ÿ› ๏ธ Bug Fixes and Improvements +## Bug Fixes and Improvements ### Major Fixes - **URL Processing**: Fixed '+' sign preservation in query parameters (#1332) @@ -286,7 +286,7 @@ browser_config = BrowserConfig( ) ``` -## ๐Ÿ”„ Breaking Changes +## Breaking Changes 1. **Python 3.10+ Required**: Upgrade from Python 3.9 2. **Proxy Parameter Deprecated**: Use new `proxy_config` structure @@ -310,9 +310,9 @@ python docs/releases_review/demo_v0.7.5.py ``` **Resources:** -- ๐Ÿ“– Documentation: [docs.crawl4ai.com](https://docs.crawl4ai.com) -- ๐Ÿ™ GitHub: [github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai) -- ๐Ÿ’ฌ Discord: [discord.gg/crawl4ai](https://discord.gg/jP8KfhDhyN) -- ๐Ÿฆ Twitter: [@unclecode](https://x.com/unclecode) +- Documentation: [docs.crawl4ai.com](https://docs.crawl4ai.com) +- GitHub: [github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai) +- Discord: [discord.gg/crawl4ai](https://discord.gg/jP8KfhDhyN) +- Twitter: [@unclecode](https://x.com/unclecode) -Happy crawling! ๐Ÿ•ท๏ธ +Happy crawling! diff --git a/docs/blog/release-v0.7.6.md b/docs/blog/release-v0.7.6.md index e27d19cc9..a33866c28 100644 --- a/docs/blog/release-v0.7.6.md +++ b/docs/blog/release-v0.7.6.md @@ -4,7 +4,7 @@ I'm excited to announce Crawl4AI v0.7.6, featuring a complete webhook infrastructure for the Docker job queue API! This release eliminates polling and brings real-time notifications to both crawling and LLM extraction workflows. -## ๐ŸŽฏ What's New +## What's New ### Webhook Support for Docker Job Queue API @@ -158,20 +158,20 @@ def handle_webhook(): app.run(port=8080) ``` -## ๐Ÿ“Š Performance Improvements +## Performance Improvements - **Reduced Server Load**: Eliminates constant polling requests - **Lower Latency**: Instant notification vs. polling interval delay - **Better Resource Usage**: Frees up client connections while jobs run in background - **Scalable Architecture**: Handles high-volume crawling workflows efficiently -## ๐Ÿ› Bug Fixes +## Bug Fixes - Fixed webhook configuration serialization for Pydantic HttpUrl fields - Improved error handling in webhook delivery service - Enhanced Redis task storage for webhook config persistence -## ๐ŸŒ Expected Real-World Impact +## Expected Real-World Impact ### For Web Scraping Workflows - **Reduced Costs**: Less API calls = lower bandwidth and server costs @@ -188,7 +188,7 @@ app.run(port=8080) - **Decoupling**: Decouple job submission from result processing - **Reliability**: Automatic retries ensure webhooks are delivered -## ๐Ÿ”„ Breaking Changes +## Breaking Changes **None!** This release is fully backward compatible. @@ -196,7 +196,7 @@ app.run(port=8080) - Existing code continues to work without modification - Polling is still supported for jobs without webhook config -## ๐Ÿ“š Documentation +## Documentation ### New Documentation - **[WEBHOOK_EXAMPLES.md](../deploy/docker/WEBHOOK_EXAMPLES.md)** - Comprehensive webhook usage guide @@ -206,7 +206,7 @@ app.run(port=8080) - **[Docker README](../deploy/docker/README.md)** - Added webhook sections - API documentation with webhook examples -## ๐Ÿ› ๏ธ Migration Guide +## Migration Guide No migration needed! Webhooks are opt-in: @@ -231,7 +231,7 @@ payload = { } ``` -## ๐Ÿ”ง Configuration +## Configuration ### Global Webhook Configuration (config.yml) @@ -274,7 +274,7 @@ docker run -d \ pip install --upgrade crawl4ai ``` -## ๐Ÿ’ก Pro Tips +## Pro Tips 1. **Use notification-only mode** for large results - fetch data separately to avoid large webhook payloads 2. **Set custom headers** for webhook authentication and request tracking @@ -282,7 +282,7 @@ pip install --upgrade crawl4ai 4. **Implement idempotent webhook handlers** - same webhook may be delivered multiple times on retry 5. **Use structured schemas** with LLM extraction for predictable webhook data -## ๐ŸŽฌ Demo +## Demo Try the release demo: @@ -297,11 +297,11 @@ This comprehensive demo showcases: - Webhook retry mechanism - Real-time webhook receiver -## ๐Ÿ™ Acknowledgments +## Acknowledgments Thank you to the community for the feedback that shaped this feature! Special thanks to everyone who requested webhook support for asynchronous job processing. -## ๐Ÿ“ž Support +## Support - **Documentation**: https://docs.crawl4ai.com - **GitHub Issues**: https://github.com/unclecode/crawl4ai/issues @@ -309,6 +309,6 @@ Thank you to the community for the feedback that shaped this feature! Special th --- -**Happy crawling with webhooks!** ๐Ÿ•ท๏ธ๐Ÿช +**Happy crawling with webhooks!** *- unclecode* diff --git a/docs/blog/release-v0.7.7.md b/docs/blog/release-v0.7.7.md index 190cd374b..b1ffb59b6 100644 --- a/docs/blog/release-v0.7.7.md +++ b/docs/blog/release-v0.7.7.md @@ -6,19 +6,19 @@ Today I'm releasing Crawl4AI v0.7.7โ€”the Self-Hosting & Monitoring Update. This release transforms Crawl4AI Docker from a simple containerized crawler into a complete self-hosting platform with enterprise-grade real-time monitoring, full operational transparency, and production-ready observability. -## ๐ŸŽฏ What's New at a Glance +## What's New at a Glance -- **๐Ÿ“Š Real-time Monitoring Dashboard**: Interactive web UI with live system metrics and browser pool status -- **๐Ÿ”Œ Comprehensive Monitor API**: Complete REST API for programmatic access to all monitoring data -- **โšก WebSocket Streaming**: Real-time updates every 2 seconds for custom dashboards -- **๐ŸŽฎ Control Actions**: Manual browser management (kill, restart, cleanup) -- **๐Ÿ”ฅ Smart Browser Pool**: 3-tier architecture (permanent/hot/cold) with automatic promotion -- **๐Ÿงน Janitor Cleanup System**: Automatic resource management with event logging -- **๐Ÿ“ˆ Production Metrics**: 6 critical metrics for operational excellence -- **๐Ÿญ Integration Ready**: Prometheus, alerting, and log aggregation examples -- **๐Ÿ› Critical Bug Fixes**: Async LLM extraction, DFS crawling, viewport config, and more +- ** Real-time Monitoring Dashboard**: Interactive web UI with live system metrics and browser pool status +- ** Comprehensive Monitor API**: Complete REST API for programmatic access to all monitoring data +- ** WebSocket Streaming**: Real-time updates every 2 seconds for custom dashboards +- ** Control Actions**: Manual browser management (kill, restart, cleanup) +- ** Smart Browser Pool**: 3-tier architecture (permanent/hot/cold) with automatic promotion +- ** Janitor Cleanup System**: Automatic resource management with event logging +- ** Production Metrics**: 6 critical metrics for operational excellence +- ** Integration Ready**: Prometheus, alerting, and log aggregation examples +- ** Critical Bug Fixes**: Async LLM extraction, DFS crawling, viewport config, and more -## ๐Ÿ“Š Real-time Monitoring Dashboard: Complete Visibility +## Real-time Monitoring Dashboard: Complete Visibility **The Problem:** Running Crawl4AI in Docker was like flying blind. Users had no visibility into what was happening inside the containerโ€”memory usage, active requests, browser pools, or errors. Troubleshooting required checking logs, and there was no way to monitor performance or manually intervene when issues occurred. @@ -29,10 +29,10 @@ Today I'm releasing Crawl4AI v0.7.7โ€”the Self-Hosting & Monitoring Update. This Before v0.7.7, Docker was just a containerized crawler. After v0.7.7, it's a complete self-hosting platform that gives you: - **๐Ÿ”’ Data Privacy**: Your data never leaves your infrastructure -- **๐Ÿ’ฐ Cost Control**: No per-request pricing or rate limits -- **๐ŸŽฏ Full Customization**: Complete control over configurations and strategies -- **๐Ÿ“Š Complete Transparency**: Real-time visibility into every aspect -- **โšก Performance**: Direct access without network overhead +- ** Cost Control**: No per-request pricing or rate limits +- ** Full Customization**: Complete control over configurations and strategies +- ** Complete Transparency**: Real-time visibility into every aspect +- ** Performance**: Direct access without network overhead - **๐Ÿ›ก๏ธ Enterprise Security**: Keep workflows behind your firewall ### Interactive Monitoring Dashboard @@ -47,7 +47,7 @@ Access the dashboard at `http://localhost:11235/dashboard` to see: The dashboard updates every 2 seconds via WebSocket, giving you live visibility into your crawling operations. -## ๐Ÿ”Œ Monitor API: Programmatic Access +## Monitor API: Programmatic Access **The Problem:** Monitoring dashboards are great for humans, but automation and integration require programmatic access. @@ -151,7 +151,7 @@ The Monitor API includes these endpoints: - `POST /monitor/actions/restart_browser` - Restart browser - `POST /monitor/stats/reset` - Reset accumulated statistics -## โšก WebSocket Streaming: Real-time Updates +## WebSocket Streaming: Real-time Updates **The Problem:** Polling the API every few seconds wastes resources and adds latency. Real-time dashboards need instant updates. @@ -193,7 +193,7 @@ asyncio.run(monitor_realtime()) - **Integration**: Feed live data into monitoring tools like Grafana - **Automation**: React to events in real-time without polling -## ๐Ÿ”ฅ Smart Browser Pool: 3-Tier Architecture +## Smart Browser Pool: 3-Tier Architecture **The Problem:** Creating a new browser for every request is slow and memory-intensive. Traditional browser pools are static and inefficient. @@ -246,9 +246,9 @@ asyncio.run(demonstrate_browser_pool()) **Pool Tiers:** -- **๐Ÿ”ฅ Permanent Browser**: Always-on, default configuration, instant response -- **โ™จ๏ธ Hot Pool**: Browsers promoted after 3+ uses, kept warm for quick access -- **โ„๏ธ Cold Pool**: On-demand browsers for variant configs, cleaned up when idle +- ** Permanent Browser**: Always-on, default configuration, instant response +- ** Hot Pool**: Browsers promoted after 3+ uses, kept warm for quick access +- ** Cold Pool**: On-demand browsers for variant configs, cleaned up when idle **Expected Real-World Impact:** - **Memory Efficiency**: 10x reduction in memory usage vs creating browsers per request @@ -256,7 +256,7 @@ asyncio.run(demonstrate_browser_pool()) - **Automatic Optimization**: Pool adapts to your usage patterns - **Resource Management**: Janitor automatically cleans up idle browsers -## ๐Ÿงน Janitor System: Automatic Cleanup +## Janitor System: Automatic Cleanup **The Problem:** Long-running crawlers accumulate idle browsers and consume memory over time. @@ -278,7 +278,7 @@ async def monitor_janitor_activity(): # 2025-11-14 10:20:00: Hot pool browser promoted (10 requests) ``` -## ๐ŸŽฎ Control Actions: Manual Management +## Control Actions: Manual Management **The Problem:** Sometimes you need to manually interveneโ€”kill a stuck browser, force cleanup, or restart resources. @@ -320,7 +320,7 @@ async def reset_stats(): print("๐Ÿ“Š Statistics reset for fresh monitoring") ``` -## ๐Ÿ“ˆ Production Integration Patterns +## Production Integration Patterns ### Prometheus Integration @@ -417,7 +417,7 @@ CRITICAL_METRICS = { } ``` -## ๐Ÿ› Critical Bug Fixes +## Critical Bug Fixes This release includes significant bug fixes that improve stability and performance: @@ -501,7 +501,7 @@ async with AsyncWebCrawler() as crawler: - **LLMConfig Documentation**: Fixed casing and variable name consistency (issue #1551) - **Python Version**: Dropped Python 3.9 support, now requires Python >=3.10 -## ๐Ÿ“Š Expected Real-World Impact +## Expected Real-World Impact ### For DevOps & Infrastructure Teams - **Full Visibility**: Know exactly what's happening inside your crawling infrastructure @@ -521,7 +521,7 @@ async with AsyncWebCrawler() as crawler: - **Troubleshooting**: Quickly identify and fix issues - **Learning**: See exactly how the browser pool works -## ๐Ÿ”„ Breaking Changes +## Breaking Changes **None!** This release is fully backward compatible. @@ -562,7 +562,7 @@ pip install --upgrade crawl4ai pip install crawl4ai==0.7.7 ``` -## ๐ŸŽฌ Try the Demo +## Try the Demo Run the comprehensive demo that showcases all monitoring features: @@ -580,7 +580,7 @@ python docs/releases_review/demo_v0.7.7.py 7. Production metrics and alerting patterns 8. Self-hosting value proposition -## ๐Ÿ“š Documentation +## Documentation ### New Documentation - **[Self-Hosting Guide](https://docs.crawl4ai.com/core/self-hosting/)** - Complete self-hosting documentation with monitoring @@ -592,7 +592,7 @@ python docs/releases_review/demo_v0.7.7.py - Production integration patterns - WebSocket streaming examples -## ๐Ÿ’ก Pro Tips +## Pro Tips 1. **Start with the dashboard** - Visit `/dashboard` to get familiar with the monitoring system 2. **Track the 6 key metrics** - Memory, success rate, latency, reuse rate, browser count, errors @@ -603,24 +603,24 @@ python docs/releases_review/demo_v0.7.7.py 7. **Check janitor logs** - Understand automatic cleanup patterns 8. **Use control actions judiciously** - Manual interventions are for exceptional cases -## ๐Ÿ™ Acknowledgments +## Acknowledgments Thank you to our community for the feedback, bug reports, and feature requests that shaped this release. Special thanks to everyone who contributed to the issues that were fixed in this version. The monitoring system was built based on real user needs for production deployments, and your input made it comprehensive and practical. -## ๐Ÿ“ž Support & Resources +## Support & Resources -- **๐Ÿ“– Documentation**: [docs.crawl4ai.com](https://docs.crawl4ai.com) -- **๐Ÿ™ GitHub**: [github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai) -- **๐Ÿ’ฌ Discord**: [discord.gg/crawl4ai](https://discord.gg/jP8KfhDhyN) -- **๐Ÿฆ Twitter**: [@unclecode](https://x.com/unclecode) -- **๐Ÿ“Š Dashboard**: `http://localhost:11235/dashboard` (when running) +- ** Documentation**: [docs.crawl4ai.com](https://docs.crawl4ai.com) +- ** GitHub**: [github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai) +- ** Discord**: [discord.gg/crawl4ai](https://discord.gg/jP8KfhDhyN) +- ** Twitter**: [@unclecode](https://x.com/unclecode) +- ** Dashboard**: `http://localhost:11235/dashboard` (when running) --- **Crawl4AI v0.7.7 delivers complete self-hosting with enterprise-grade monitoring. You now have full visibility and control over your web crawling infrastructure. The monitoring dashboard, comprehensive API, and WebSocket streaming give you everything needed for production deployments. Try the self-hosting platformโ€”it's a game changer for operational excellence!** -**Happy crawling with full visibility!** ๐Ÿ•ท๏ธ๐Ÿ“Š +**Happy crawling with full visibility!** *- unclecode* diff --git a/docs/deprecated/docker-deployment.md b/docs/deprecated/docker-deployment.md index db8446e32..0e76774dc 100644 --- a/docs/deprecated/docker-deployment.md +++ b/docs/deprecated/docker-deployment.md @@ -1,11 +1,11 @@ -# ๐Ÿณ Using Docker (Legacy) +# Using Docker (Legacy) Crawl4AI is available as Docker images for easy deployment. You can either pull directly from Docker Hub (recommended) or build from the repository. ---
-๐Ÿณ Option 1: Docker Hub (Recommended) + Option 1: Docker Hub (Recommended) Choose the appropriate image based on your platform and needs: @@ -64,7 +64,7 @@ Note: Due to hardware constraints, only the basic version is recommended for Ras
-๐Ÿณ Option 2: Build from Repository + Option 2: Build from Repository Build the image locally based on your platform: @@ -117,7 +117,7 @@ curl http://localhost:11235/health
-๐Ÿณ Option 3: Using Docker Compose + Option 3: Using Docker Compose Docker Compose provides a more structured way to run Crawl4AI, especially when dealing with environment variables and multiple configurations. @@ -178,7 +178,7 @@ Deploy your own instance of Crawl4AI with one click: [![DigitalOcean Referral Badge](https://web-platforms.sfo2.cdn.digitaloceanspaces.com/WWW/Badge%203.svg)](https://www.digitalocean.com/?repo=https://github.com/unclecode/crawl4ai/tree/0.3.74&refcode=a0780f1bdb3d&utm_campaign=Referral_Invite&utm_medium=Referral_Program&utm_source=badge) -> ๐Ÿ’ก **Recommended specs**: 4GB RAM minimum. Select "professional-xs" or higher when deploying for stable operation. +> **Recommended specs**: 4GB RAM minimum. Select "professional-xs" or higher when deploying for stable operation. The deploy will: - Set up a Docker container with Crawl4AI diff --git a/docs/examples/c4a_script/amazon_example/README.md b/docs/examples/c4a_script/amazon_example/README.md index 234d603a9..134b86fb3 100644 --- a/docs/examples/c4a_script/amazon_example/README.md +++ b/docs/examples/c4a_script/amazon_example/README.md @@ -2,7 +2,7 @@ A real-world demonstration of Crawl4AI's multi-step crawling with LLM-generated automation scripts. -## ๐ŸŽฏ What This Example Shows +## What This Example Shows This example demonstrates advanced Crawl4AI features: - **LLM-Generated Scripts**: Automatically create C4A-Script from HTML snippets @@ -31,7 +31,7 @@ Products are extracted with: - Sponsored/Small Business badges - Direct product URLs -## ๐Ÿ“ Files +## Files - `amazon_r2d2_search.py` - Main example script - `header.html` - Amazon search bar HTML (provided) @@ -42,7 +42,7 @@ Products are extracted with: - `extracted_products.json` - Final scraped data - `search_results_screenshot.png` - Visual proof of results -## ๐Ÿƒ Running the Example +## Running the Example 1. **Prerequisites** ```bash @@ -65,7 +65,7 @@ Products are extracted with: - Extracts all products - Saves results to JSON -## ๐Ÿ“Š Sample Output +## Sample Output ```json [ @@ -83,7 +83,7 @@ Products are extracted with: ] ``` -## ๐Ÿ” Key Features Demonstrated +## Key Features Demonstrated ### Session Persistence ```python @@ -116,7 +116,7 @@ schema = { } ``` -## ๐Ÿ› ๏ธ Customization +## Customization ### Search Different Products Change the search term in the script generation: @@ -144,21 +144,21 @@ Adapt the approach for other e-commerce sites by: 2. Adjusting the search goals 3. Updating the extraction schema -## ๐ŸŽ“ Learning Points +## Learning Points 1. **No Manual Scripting**: LLM generates all automation code 2. **Session Management**: Maintain state across page navigations 3. **Robust Extraction**: Handle dynamic content and multiple products 4. **Error Handling**: Graceful fallbacks if generation fails -## ๐Ÿ› Troubleshooting +## Troubleshooting - **"No products found"**: Check if Amazon's HTML structure changed - **"Script generation failed"**: Ensure LLM API key is configured - **"Page timeout"**: Increase wait times in the config - **"Session lost"**: Ensure same session_id is used consistently -## ๐Ÿ“š Next Steps +## Next Steps - Try searching for different products - Add pagination to get more results diff --git a/docs/examples/c4a_script/tutorial/README.md b/docs/examples/c4a_script/tutorial/README.md index 2d6940bb8..890b60cbf 100644 --- a/docs/examples/c4a_script/tutorial/README.md +++ b/docs/examples/c4a_script/tutorial/README.md @@ -31,7 +31,7 @@ A comprehensive web-based tutorial for learning and experimenting with C4A-Scrip http://localhost:8000 ``` -**๐ŸŒ Try Online**: [Live Demo](https://docs.crawl4ai.com/c4a-script/demo) +** Try Online**: [Live Demo](https://docs.crawl4ai.com/c4a-script/demo) ### 2. Try Your First Script @@ -43,16 +43,16 @@ IF (EXISTS `.cookie-banner`) THEN CLICK `.accept` CLICK `#start-tutorial` ``` -## ๐ŸŽฏ What You'll Learn +## What You'll Learn ### Core Features -- **๐Ÿ“ Text Editor**: Write C4A-Script with syntax highlighting -- **๐Ÿงฉ Visual Editor**: Build scripts using drag-and-drop Blockly interface -- **๐ŸŽฌ Recording Mode**: Capture browser actions and auto-generate scripts -- **โšก Live Execution**: Run scripts in real-time with instant feedback -- **๐Ÿ“Š Timeline View**: Visualize and edit automation steps +- ** Text Editor**: Write C4A-Script with syntax highlighting +- ** Visual Editor**: Build scripts using drag-and-drop Blockly interface +- ** Recording Mode**: Capture browser actions and auto-generate scripts +- ** Live Execution**: Run scripts in real-time with instant feedback +- ** Timeline View**: Visualize and edit automation steps -## ๐Ÿ“š Tutorial Content +## Tutorial Content ### Basic Commands - **Navigation**: `GO url` @@ -71,7 +71,7 @@ CLICK `#start-tutorial` - **Variables**: `SET name = "value"` - **Complex selectors**: CSS selectors in backticks -## ๐ŸŽฎ Interactive Playground Features +## Interactive Playground Features The tutorial includes a fully interactive web app with: @@ -100,7 +100,7 @@ The tutorial includes a fully interactive web app with: - Search functionality - Export options -## ๐Ÿ› ๏ธ Tutorial Features +## Tutorial Features ### Live Code Editor - Syntax highlighting @@ -125,7 +125,7 @@ Load pre-written examples demonstrating: - Multi-step form completion - Complex interaction sequences -## ๐Ÿ“– Tutorial Sections +## Tutorial Sections ### 1. Getting Started Learn basic commands and syntax: @@ -166,7 +166,7 @@ SET password = "pass123" login ``` -## ๐ŸŽฏ Practice Challenges +## Practice Challenges ### Challenge 1: Cookie & Popups Handle the cookie banner and newsletter popup that appear on page load. @@ -183,7 +183,7 @@ Complete the entire multi-step survey form. ### Challenge 5: Full Workflow Create a script that logs in, browses products, and exports data. -## ๐Ÿ’ก Tips & Tricks +## Tips & Tricks ### 1. Use Specific Selectors ```c4a @@ -219,7 +219,7 @@ ENDPROC handle_popups ``` -## ๐Ÿ”ง Troubleshooting +## Troubleshooting ### Common Issues @@ -257,7 +257,7 @@ async with AsyncWebCrawler() as crawler: result = await crawler.arun(config=config) ``` -## ๐Ÿ“ Example Scripts +## Example Scripts Check the `scripts/` folder for complete examples: - `01-basic-interaction.c4a` - Getting started @@ -266,7 +266,7 @@ Check the `scripts/` folder for complete examples: - `04-multi-step-form.c4a` - Complex forms - `05-complex-workflow.c4a` - Full automation -## ๐Ÿ—๏ธ Developer Guide +## Developer Guide ### Project Architecture @@ -336,7 +336,7 @@ THREADED = True - `POST /execute` - Script execution endpoint - `GET /examples/