diff --git a/CHANGELOG.md b/CHANGELOG.md
index c09b7d2e1..fbd6a3ff2 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -166,17 +166,17 @@ Song Binglin (q1uf3ng), by111 (August829), Jeongbean Jeon, wulonchia, secsys_cod
### Added
- **๐ init_scripts for BrowserConfig**: Pre-page-load JavaScript injection for stealth evasions
-- **๐ CDP Connection Improvements**: WebSocket URL support, proper cleanup, browser reuse
-- **๐พ Crash Recovery for Deep Crawl**: `resume_state` and `on_state_change` for BFS/DFS/Best-First strategies
-- **๐ PDF/MHTML for raw:/file:// URLs**: Generate PDFs and MHTML from cached HTML content
-- **๐ธ Screenshots for raw:/file:// URLs**: Render cached HTML and capture screenshots
+- ** CDP Connection Improvements**: WebSocket URL support, proper cleanup, browser reuse
+- ** Crash Recovery for Deep Crawl**: `resume_state` and `on_state_change` for BFS/DFS/Best-First strategies
+- ** PDF/MHTML for raw:/file:// URLs**: Generate PDFs and MHTML from cached HTML content
+- ** Screenshots for raw:/file:// URLs**: Render cached HTML and capture screenshots
- **๐ base_url Parameter**: Proper URL resolution for raw: HTML processing
-- **โก Prefetch Mode**: Two-phase deep crawling with fast link extraction
-- **๐ Enhanced Proxy Support**: Improved proxy rotation and sticky sessions
-- **๐ HTTP Strategy Proxy Support**: Non-browser crawler now supports proxies
-- **๐ฅ๏ธ Browser Pipeline for raw:/file://**: New `process_in_browser` parameter
-- **๐ Smart TTL Cache for Sitemap Seeder**: `cache_ttl_hours` and `validate_sitemap_lastmod` parameters
-- **๐ Security Documentation**: Added SECURITY.md with vulnerability reporting guidelines
+- ** Prefetch Mode**: Two-phase deep crawling with fast link extraction
+- ** Enhanced Proxy Support**: Improved proxy rotation and sticky sessions
+- ** HTTP Strategy Proxy Support**: Non-browser crawler now supports proxies
+- ** Browser Pipeline for raw:/file://**: New `process_in_browser` parameter
+- ** Smart TTL Cache for Sitemap Seeder**: `cache_ttl_hours` and `validate_sitemap_lastmod` parameters
+- ** Security Documentation**: Added SECURITY.md with vulnerability reporting guidelines
### Fixed
- **raw: URL Parsing**: Fixed truncation at `#` character (CSS color codes like `#eee`)
@@ -201,7 +201,7 @@ Song Binglin (q1uf3ng), by111 (August829), Jeongbean Jeon, wulonchia, secsys_cod
## [0.7.3] - 2025-08-09
### Added
-- **๐ต๏ธ Undetected Browser Support**: New browser adapter pattern with stealth capabilities
+- ** Undetected Browser Support**: New browser adapter pattern with stealth capabilities
- `browser_adapter.py` with undetected Chrome integration
- Bypass sophisticated bot detection systems (Cloudflare, Akamai, custom solutions)
- Support for headless stealth mode with anti-detection techniques
@@ -209,7 +209,7 @@ Song Binglin (q1uf3ng), by111 (August829), Jeongbean Jeon, wulonchia, secsys_cod
- Comprehensive examples for anti-bot strategies and stealth crawling
- Full documentation guide for undetected browser usage
-- **๐จ Multi-URL Configuration System**: URL-specific crawler configurations for batch processing
+- ** Multi-URL Configuration System**: URL-specific crawler configurations for batch processing
- Different crawling strategies for different URL patterns in a single batch
- Support for string patterns with wildcards (`"*.pdf"`, `"*/blog/*"`)
- Lambda function matchers for complex URL logic
@@ -217,7 +217,7 @@ Song Binglin (q1uf3ng), by111 (August829), Jeongbean Jeon, wulonchia, secsys_cod
- Fallback configuration support when no patterns match
- First-match-wins configuration selection with optional fallback
-- **๐ง Memory Monitoring & Optimization**: Comprehensive memory usage tracking
+- ** Memory Monitoring & Optimization**: Comprehensive memory usage tracking
- New `memory_utils.py` module for memory monitoring and optimization
- Real-time memory usage tracking during crawl sessions
- Memory leak detection and reporting
@@ -225,21 +225,21 @@ Song Binglin (q1uf3ng), by111 (August829), Jeongbean Jeon, wulonchia, secsys_cod
- Peak memory usage analysis and efficiency metrics
- Automatic cleanup suggestions for memory-intensive operations
-- **๐ Enhanced Table Extraction**: Improved table access and DataFrame conversion
+- ** Enhanced Table Extraction**: Improved table access and DataFrame conversion
- Direct `result.tables` interface replacing generic `result.media` approach
- Instant pandas DataFrame conversion with `pd.DataFrame(table['data'])`
- Enhanced table detection algorithms for better accuracy
- Table metadata including source XPath and headers
- Improved table structure preservation during extraction
-- **๐ฐ GitHub Sponsors Integration**: 4-tier sponsorship system
+- ** GitHub Sponsors Integration**: 4-tier sponsorship system
- Supporter ($5/month): Community support + early feature previews
- Professional ($25/month): Priority support + beta access
- Business ($100/month): Direct consultation + custom integrations
- Enterprise ($500/month): Dedicated support + feature development
- Custom arrangement options for larger organizations
-- **๐ณ Docker LLM Provider Flexibility**: Environment-based LLM configuration
+- ** Docker LLM Provider Flexibility**: Environment-based LLM configuration
- `LLM_PROVIDER` environment variable support for dynamic provider switching
- `.llm.env` file support for secure configuration management
- Per-request provider override capabilities in API endpoints
@@ -1172,14 +1172,14 @@ asyncio.run(browser_management_demo())
- Introduced `CacheMode` enum (`ENABLED`, `DISABLED`, `READ_ONLY`, `WRITE_ONLY`, `BYPASS`) and `always_bypass_cache` parameter in AsyncWebCrawler for fine-grained cache control. This replaces `bypass_cache`, `no_cache_read`, `no_cache_write`, and `always_by_pass_cache`.
-### ๐๏ธ Removals
+### Removals
- Removed deprecated: `crawl4ai/content_cleaning_strategy.py`.
- Removed internal class ContentCleaningStrategy
- Removed legacy cache control flags: `bypass_cache`, `disable_cache`, `no_cache_read`, `no_cache_write`, and `always_by_pass_cache`. These have been superseded by `cache_mode`.
-### โ๏ธ Other Changes
+### Other Changes
- Moved version file to `crawl4ai/__version__.py`.
- Added `crawl4ai/cache_context.py`.
@@ -1196,7 +1196,7 @@ asyncio.run(browser_management_demo())
- The synchronous version of `WebCrawler` is being phased out. While still available via `crawl4ai[sync]`, it will eventually be removed. Transition to `AsyncWebCrawler` is strongly recommended. Boolean cache control flags in `arun` are also deprecated, migrate to using the `cache_mode` parameter. See examples in the "New Features" section above for correct usage.
-### ๐ Bug Fixes
+### Bug Fixes
- Resolved issue with browser context closing unexpectedly in Docker. This significantly improves stability, particularly within containerized environments.
- Fixed memory leaks associated with incorrect asynchronous cleanup by removing the `__del__` method and ensuring the browser context is closed explicitly using context managers.
@@ -1680,8 +1680,8 @@ Significant improvements in text processing and performance:
- ๐ **Dependency reduction**: Removed dependency on spaCy model for text chunk labeling in cosine extraction strategy.
- ๐ค **Transformer upgrade**: Implemented text sequence classification using a transformer model for labeling text chunks.
-- โก **Performance enhancement**: Improved model loading speed due to removal of spaCy dependency.
-- ๐ง **Future-proofing**: Laid groundwork for potential complete removal of spaCy dependency in future versions.
+- **Performance enhancement**: Improved model loading speed due to removal of spaCy dependency.
+- **Future-proofing**: Laid groundwork for potential complete removal of spaCy dependency in future versions.
These changes address issue #68 and provide a foundation for faster, more efficient text processing in Crawl4AI.
@@ -1689,40 +1689,40 @@ These changes address issue #68 and provide a foundation for faster, more effici
Major improvements in functionality, performance, and cross-platform compatibility! ๐
-- ๐ณ **Docker enhancements**: Significantly improved Dockerfile for easy installation on Linux, Mac, and Windows.
-- ๐ **Official Docker Hub image**: Launched our first official image on Docker Hub for streamlined deployment.
-- ๐ง **Selenium upgrade**: Removed dependency on ChromeDriver, now using Selenium's built-in capabilities for better compatibility.
-- ๐ผ๏ธ **Image description**: Implemented ability to generate textual descriptions for extracted images from web pages.
-- โก **Performance boost**: Various improvements to enhance overall speed and performance.
+- **Docker enhancements**: Significantly improved Dockerfile for easy installation on Linux, Mac, and Windows.
+- **Official Docker Hub image**: Launched our first official image on Docker Hub for streamlined deployment.
+- **Selenium upgrade**: Removed dependency on ChromeDriver, now using Selenium's built-in capabilities for better compatibility.
+- **Image description**: Implemented ability to generate textual descriptions for extracted images from web pages.
+- **Performance boost**: Various improvements to enhance overall speed and performance.
A big shoutout to our amazing community contributors:
- [@aravindkarnam](https://github.com/aravindkarnam) for developing the textual description extraction feature.
- [@FractalMind](https://github.com/FractalMind) for creating the first official Docker Hub image and fixing Dockerfile errors.
- [@ketonkss4](https://github.com/ketonkss4) for identifying Selenium's new capabilities, helping us reduce dependencies.
-Your contributions are driving Crawl4AI forward! ๐
+Your contributions are driving Crawl4AI forward!
## [v0.2.75] - 2024-07-19
Minor improvements for a more maintainable codebase:
-- ๐ Fixed typos in `chunking_strategy.py` and `crawler_strategy.py` to improve code readability
-- ๐ Removed `.test_pads/` directory from `.gitignore` to keep our repository clean and organized
+- Fixed typos in `chunking_strategy.py` and `crawler_strategy.py` to improve code readability
+- Removed `.test_pads/` directory from `.gitignore` to keep our repository clean and organized
These changes may seem small, but they contribute to a more stable and sustainable codebase. By fixing typos and updating our `.gitignore` settings, we're ensuring that our code is easier to maintain and scale in the long run.
## [v0.2.74] - 2024-07-08
-A slew of exciting updates to improve the crawler's stability and robustness! ๐
+A slew of exciting updates to improve the crawler's stability and robustness!
-- ๐ป **UTF encoding fix**: Resolved the Windows \"charmap\" error by adding UTF encoding.
+- **UTF encoding fix**: Resolved the Windows \"charmap\" error by adding UTF encoding.
- ๐ก๏ธ **Error handling**: Implemented MaxRetryError exception handling in LocalSeleniumCrawlerStrategy.
-- ๐งน **Input sanitization**: Improved input sanitization and handled encoding issues in LLMExtractionStrategy.
-- ๐ฎ **Database cleanup**: Removed existing database file and initialized a new one.
+- **Input sanitization**: Improved input sanitization and handled encoding issues in LLMExtractionStrategy.
+- **Database cleanup**: Removed existing database file and initialized a new one.
## [v0.2.73] - 2024-07-03
-๐ก In this release, we've bumped the version to v0.2.73 and refreshed our documentation to ensure you have the best experience with our project.
+ In this release, we've bumped the version to v0.2.73 and refreshed our documentation to ensure you have the best experience with our project.
* Supporting website need "with-head" mode to crawl the website with head.
* Fixing the installation issues for setup.py and dockerfile.
@@ -1730,23 +1730,23 @@ A slew of exciting updates to improve the crawler's stability and robustness!
## [v0.2.72] - 2024-06-30
-This release brings exciting updates and improvements to our project! ๐
+This release brings exciting updates and improvements to our project!
-* ๐ **Documentation Updates**: Our documentation has been revamped to reflect the latest changes and additions.
+* **Documentation Updates**: Our documentation has been revamped to reflect the latest changes and additions.
* ๐ **New Modes in setup.py**: We've added support for three new modes in setup.py: default, torch, and transformers. This enhances the project's flexibility and usability.
-* ๐ณ **Docker File Updates**: The Docker file has been updated to ensure seamless compatibility with the new modes and improvements.
-* ๐ท๏ธ **Temporary Solution for Headless Crawling**: We've implemented a temporary solution to overcome issues with crawling websites in headless mode.
+* **Docker File Updates**: The Docker file has been updated to ensure seamless compatibility with the new modes and improvements.
+* **Temporary Solution for Headless Crawling**: We've implemented a temporary solution to overcome issues with crawling websites in headless mode.
These changes aim to improve the overall user experience, provide more flexibility, and enhance the project's performance. We're thrilled to share these updates with you and look forward to continuing to evolve and improve our project!
## [0.2.71] - 2024-06-26
-**Improved Error Handling and Performance** ๐ง
+**Improved Error Handling and Performance**
-* ๐ซ Refactored `crawler_strategy.py` to handle exceptions and provide better error messages, making it more robust and reliable.
-* ๐ป Optimized the `get_content_of_website_optimized` function in `utils.py` for improved performance, reducing potential bottlenecks.
-* ๐ป Updated `utils.py` with the latest changes, ensuring consistency and accuracy.
-* ๐ซ Migrated to `ChromeDriverManager` to resolve Chrome driver download issues, providing a smoother user experience.
+* Refactored `crawler_strategy.py` to handle exceptions and provide better error messages, making it more robust and reliable.
+* Optimized the `get_content_of_website_optimized` function in `utils.py` for improved performance, reducing potential bottlenecks.
+* Updated `utils.py` with the latest changes, ensuring consistency and accuracy.
+* Migrated to `ChromeDriverManager` to resolve Chrome driver download issues, providing a smoother user experience.
These changes focus on refining the existing codebase, resulting in a more stable, efficient, and user-friendly experience. With these improvements, you can expect fewer errors and better performance in the crawler strategy and utility functions.
diff --git a/README-first.md b/README-first.md
index 2a21df395..d369627b8 100644
--- a/README-first.md
+++ b/README-first.md
@@ -27,12 +27,12 @@
Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.
-[โจ Check out latest update v0.7.0](#-recent-updates)
+[ Check out latest update v0.7.0](#-recent-updates)
-๐ **Version 0.7.0 is now available!** The Adaptive Intelligence Update introduces groundbreaking features: Adaptive Crawling that learns website patterns, Virtual Scroll support for infinite pages, intelligent Link Preview with 3-layer scoring, Async URL Seeder for massive discovery, and significant performance improvements. [Read the release notes โ](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.0.md)
+ **Version 0.7.0 is now available!** The Adaptive Intelligence Update introduces groundbreaking features: Adaptive Crawling that learns website patterns, Virtual Scroll support for infinite pages, intelligent Link Preview with 3-layer scoring, Async URL Seeder for massive discovery, and significant performance improvements. [Read the release notes โ](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.0.md)
-๐ค My Personal Story
+My Personal Story
My journey with computers started in childhood when my dad, a computer scientist, introduced me to an Amstrad computer. Those early days sparked a fascination with technology, leading me to pursue computer science and specialize in NLP during my postgraduate studies. It was during this time that I first delved into web crawling, building tools to help researchers organize papers and extract information from publications a challenging yet rewarding experience that honed my skills in data extraction.
@@ -43,7 +43,7 @@ I made Crawl4AI open-source for two reasons. First, itโs my way of giving back
Thank you to everyone who has supported this project, used it, and shared feedback. Your encouragement motivates me to dream even bigger. Join us, file issues, submit PRs, or spread the word. Together, we can build a tool that truly empowers people to access their own data and reshape the future of AI.
-## ๐ง Why Crawl4AI?
+## Why Crawl4AI?
1. **Built for LLMs**: Creates smart, concise Markdown optimized for RAG and fine-tuning applications.
2. **Lightning Fast**: Delivers results faster with real-time, cost-efficient performance.
@@ -102,96 +102,96 @@ crwl https://docs.crawl4ai.com --deep-crawl bfs --max-pages 10
crwl https://www.example.com/products -q "Extract all product prices"
```
-## โจ Features
+## Features
-๐ Markdown Generation
+Markdown Generation
-- ๐งน **Clean Markdown**: Generates clean, structured Markdown with accurate formatting.
-- ๐ฏ **Fit Markdown**: Heuristic-based filtering to remove noise and irrelevant parts for AI-friendly processing.
+- **Clean Markdown**: Generates clean, structured Markdown with accurate formatting.
+- **Fit Markdown**: Heuristic-based filtering to remove noise and irrelevant parts for AI-friendly processing.
- ๐ **Citations and References**: Converts page links into a numbered reference list with clean citations.
-- ๐ ๏ธ **Custom Strategies**: Users can create their own Markdown generation strategies tailored to specific needs.
-- ๐ **BM25 Algorithm**: Employs BM25-based filtering for extracting core information and removing irrelevant content.
+- **Custom Strategies**: Users can create their own Markdown generation strategies tailored to specific needs.
+- **BM25 Algorithm**: Employs BM25-based filtering for extracting core information and removing irrelevant content.
-๐ Structured Data Extraction
+Structured Data Extraction
- ๐ค **LLM-Driven Extraction**: Supports all LLMs (open-source and proprietary) for structured data extraction.
-- ๐งฑ **Chunking Strategies**: Implements chunking (topic-based, regex, sentence-level) for targeted content processing.
-- ๐ **Cosine Similarity**: Find relevant content chunks based on user queries for semantic extraction.
-- ๐ **CSS-Based Extraction**: Fast schema-based data extraction using XPath and CSS selectors.
-- ๐ง **Schema Definition**: Define custom schemas for extracting structured JSON from repetitive patterns.
+- **Chunking Strategies**: Implements chunking (topic-based, regex, sentence-level) for targeted content processing.
+- **Cosine Similarity**: Find relevant content chunks based on user queries for semantic extraction.
+- **CSS-Based Extraction**: Fast schema-based data extraction using XPath and CSS selectors.
+- **Schema Definition**: Define custom schemas for extracting structured JSON from repetitive patterns.
-๐ Browser Integration
+Browser Integration
-- ๐ฅ๏ธ **Managed Browser**: Use user-owned browsers with full control, avoiding bot detection.
-- ๐ **Remote Browser Control**: Connect to Chrome Developer Tools Protocol for remote, large-scale data extraction.
-- ๐ค **Browser Profiler**: Create and manage persistent profiles with saved authentication states, cookies, and settings.
+- **Managed Browser**: Use user-owned browsers with full control, avoiding bot detection.
+- **Remote Browser Control**: Connect to Chrome Developer Tools Protocol for remote, large-scale data extraction.
+- **Browser Profiler**: Create and manage persistent profiles with saved authentication states, cookies, and settings.
- ๐ **Session Management**: Preserve browser states and reuse them for multi-step crawling.
-- ๐งฉ **Proxy Support**: Seamlessly connect to proxies with authentication for secure access.
-- โ๏ธ **Full Browser Control**: Modify headers, cookies, user agents, and more for tailored crawling setups.
-- ๐ **Multi-Browser Support**: Compatible with Chromium, Firefox, and WebKit.
-- ๐ **Dynamic Viewport Adjustment**: Automatically adjusts the browser viewport to match page content, ensuring complete rendering and capturing of all elements.
+- **Proxy Support**: Seamlessly connect to proxies with authentication for secure access.
+- **Full Browser Control**: Modify headers, cookies, user agents, and more for tailored crawling setups.
+- **Multi-Browser Support**: Compatible with Chromium, Firefox, and WebKit.
+- **Dynamic Viewport Adjustment**: Automatically adjusts the browser viewport to match page content, ensuring complete rendering and capturing of all elements.
-๐ Crawling & Scraping
+Crawling & Scraping
-- ๐ผ๏ธ **Media Support**: Extract images, audio, videos, and responsive image formats like `srcset` and `picture`.
+- **Media Support**: Extract images, audio, videos, and responsive image formats like `srcset` and `picture`.
- ๐ **Dynamic Crawling**: Execute JS and wait for async or sync for dynamic content extraction.
-- ๐ธ **Screenshots**: Capture page screenshots during crawling for debugging or analysis.
-- ๐ **Raw Data Crawling**: Directly process raw HTML (`raw:`) or local files (`file://`).
+- **Screenshots**: Capture page screenshots during crawling for debugging or analysis.
+- **Raw Data Crawling**: Directly process raw HTML (`raw:`) or local files (`file://`).
- ๐ **Comprehensive Link Extraction**: Extracts internal, external links, and embedded iframe content.
-- ๐ ๏ธ **Customizable Hooks**: Define hooks at every step to customize crawling behavior.
-- ๐พ **Caching**: Cache data for improved speed and to avoid redundant fetches.
-- ๐ **Metadata Extraction**: Retrieve structured metadata from web pages.
-- ๐ก **IFrame Content Extraction**: Seamless extraction from embedded iframe content.
-- ๐ต๏ธ **Lazy Load Handling**: Waits for images to fully load, ensuring no content is missed due to lazy loading.
-- ๐ **Full-Page Scanning**: Simulates scrolling to load and capture all dynamic content, perfect for infinite scroll pages.
+- **Customizable Hooks**: Define hooks at every step to customize crawling behavior.
+- **Caching**: Cache data for improved speed and to avoid redundant fetches.
+- **Metadata Extraction**: Retrieve structured metadata from web pages.
+- **IFrame Content Extraction**: Seamless extraction from embedded iframe content.
+- **Lazy Load Handling**: Waits for images to fully load, ensuring no content is missed due to lazy loading.
+- **Full-Page Scanning**: Simulates scrolling to load and capture all dynamic content, perfect for infinite scroll pages.
๐ Deployment
-- ๐ณ **Dockerized Setup**: Optimized Docker image with FastAPI server for easy deployment.
-- ๐ **Secure Authentication**: Built-in JWT token authentication for API security.
-- ๐ **API Gateway**: One-click deployment with secure token authentication for API-based workflows.
-- ๐ **Scalable Architecture**: Designed for mass-scale production and optimized server performance.
-- โ๏ธ **Cloud Deployment**: Ready-to-deploy configurations for major cloud platforms.
+- **Dockerized Setup**: Optimized Docker image with FastAPI server for easy deployment.
+- **Secure Authentication**: Built-in JWT token authentication for API security.
+- **API Gateway**: One-click deployment with secure token authentication for API-based workflows.
+- **Scalable Architecture**: Designed for mass-scale production and optimized server performance.
+- **Cloud Deployment**: Ready-to-deploy configurations for major cloud platforms.
-๐ฏ Additional Features
+Additional Features
-- ๐ถ๏ธ **Stealth Mode**: Avoid bot detection by mimicking real users.
-- ๐ท๏ธ **Tag-Based Content Extraction**: Refine crawling based on custom tags, headers, or metadata.
+- **Stealth Mode**: Avoid bot detection by mimicking real users.
+- **Tag-Based Content Extraction**: Refine crawling based on custom tags, headers, or metadata.
- ๐ **Link Analysis**: Extract and analyze all links for detailed data exploration.
- ๐ก๏ธ **Error Handling**: Robust error management for seamless execution.
-- ๐ **CORS & Static Serving**: Supports filesystem-based caching and cross-origin requests.
-- ๐ **Clear Documentation**: Simplified and updated guides for onboarding and advanced usage.
-- ๐ **Community Recognition**: Acknowledges contributors and pull requests for transparency.
+- **CORS & Static Serving**: Supports filesystem-based caching and cross-origin requests.
+- **Clear Documentation**: Simplified and updated guides for onboarding and advanced usage.
+- **Community Recognition**: Acknowledges contributors and pull requests for transparency.
## Try it Now!
-โจ Play around with this [](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing)
+ Play around with this [](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing)
-โจ Visit our [Documentation Website](https://docs.crawl4ai.com/)
+ Visit our [Documentation Website](https://docs.crawl4ai.com/)
-## Installation ๐ ๏ธ
+## Installation
Crawl4AI offers flexible installation options to suit various use cases. You can install it as a Python package or use Docker.
-๐ Using pip
+Using pip
Choose the installation option that best fits your needs:
@@ -206,7 +206,7 @@ crawl4ai-setup # Setup the browser
By default, this will install the asynchronous version of Crawl4AI, using Playwright for web crawling.
-๐ **Note**: When you install Crawl4AI, the `crawl4ai-setup` should automatically install and set up Playwright. However, if you encounter any Playwright-related errors, you can manually install it using one of these methods:
+ **Note**: When you install Crawl4AI, the `crawl4ai-setup` should automatically install and set up Playwright. However, if you encounter any Playwright-related errors, you can manually install it using one of these methods:
1. Through the command line:
@@ -257,7 +257,7 @@ pip install -e ".[all]" # Install all optional features
-๐ณ Docker Deployment
+Docker Deployment
> ๐ **Now Available!** Our completely redesigned Docker implementation is here! This new solution makes deployment more efficient and seamless than ever.
@@ -318,12 +318,12 @@ For more examples, see our [Docker Examples](https://github.com/unclecode/crawl4
-## ๐ฌ Advanced Usage Examples ๐ฌ
+## Advanced Usage Examples
You can check the project structure in the directory [docs/examples](https://github.com/unclecode/crawl4ai/tree/main/docs/examples). Over there, you can find a variety of examples; here, some popular examples are shared.
-๐ Heuristic Markdown Generation with Clean and Fit Markdown
+Heuristic Markdown Generation with Clean and Fit Markdown
```python
import asyncio
@@ -361,7 +361,7 @@ if __name__ == "__main__":
-๐ฅ๏ธ Executing JavaScript & Extract Structured Data without LLMs
+Executing JavaScript & Extract Structured Data without LLMs
```python
import asyncio
@@ -434,7 +434,7 @@ if __name__ == "__main__":
-๐ Extracting Structured Data with LLMs
+Extracting Structured Data with LLMs
```python
import os
@@ -517,11 +517,11 @@ async def test_news_crawl():
-## โจ Recent Updates
+## Recent Updates
### Version 0.7.0 Release Highlights - The Adaptive Intelligence Update
-- **๐ง Adaptive Crawling**: Your crawler now learns and adapts to website patterns automatically:
+- ** Adaptive Crawling**: Your crawler now learns and adapts to website patterns automatically:
```python
config = AdaptiveConfig(
confidence_threshold=0.7, # Min confidence to stop crawling
@@ -539,7 +539,7 @@ async def test_news_crawl():
# Crawler learns patterns and improves extraction over time
```
-- **๐ Virtual Scroll Support**: Complete content extraction from infinite scroll pages:
+- ** Virtual Scroll Support**: Complete content extraction from infinite scroll pages:
```python
scroll_config = VirtualScrollConfig(
container_selector="[data-testid='feed']",
@@ -568,7 +568,7 @@ async def test_news_crawl():
# Links ranked by relevance and quality
```
-- **๐ฃ Async URL Seeder**: Discover thousands of URLs in seconds:
+- ** Async URL Seeder**: Discover thousands of URLs in seconds:
```python
seeder = AsyncUrlSeeder(SeedingConfig(
source="sitemap+cc",
@@ -580,7 +580,7 @@ async def test_news_crawl():
urls = await seeder.discover("https://example.com")
```
-- **โก Performance Boost**: Up to 3x faster with optimized resource handling and memory efficiency
+- ** Performance Boost**: Up to 3x faster with optimized resource handling and memory efficiency
Read the full details in our [0.7.0 Release Notes](https://docs.crawl4ai.com/blog/release-v0.7.0) or check the [CHANGELOG](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
@@ -625,16 +625,16 @@ We use pre-releases to:
For production environments, we recommend using the stable version. For testing new features, you can opt-in to pre-releases using the `--pre` flag.
-## ๐ Documentation & Roadmap
+## Documentation & Roadmap
-> ๐จ **Documentation Update Alert**: We're undertaking a major documentation overhaul next week to reflect recent updates and improvements. Stay tuned for a more comprehensive and up-to-date guide!
+> **Documentation Update Alert**: We're undertaking a major documentation overhaul next week to reflect recent updates and improvements. Stay tuned for a more comprehensive and up-to-date guide!
For current documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://docs.crawl4ai.com/).
To check our development plans and upcoming features, visit our [Roadmap](https://github.com/unclecode/crawl4ai/blob/main/ROADMAP.md).
-๐ Development TODOs
+Development TODOs
- [x] 0. Graph Crawler: Smart website traversal using graph search algorithms for comprehensive nested page extraction
- [ ] 1. Question-Based Crawler: Natural language driven web discovery and content extraction
@@ -651,7 +651,7 @@ To check our development plans and upcoming features, visit our [Roadmap](https:
-## ๐ค Contributing
+## Contributing
We welcome contributions from the open-source community. Check out our [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTORS.md) for more information.
@@ -659,7 +659,7 @@ I'll help modify the license section with badges. For the halftone effect, here'
Here's the updated license section:
-## ๐ License & Attribution
+## License & Attribution
This project is licensed under the Apache License 2.0, attribution is recommended via the badges below. See the [Apache 2.0 License](https://github.com/unclecode/crawl4ai/blob/main/LICENSE) file for details.
@@ -711,7 +711,7 @@ Add this line to your documentation:
This project uses Crawl4AI (https://github.com/unclecode/crawl4ai) for web data extraction.
```
-## ๐ Citation
+## Citation
If you use Crawl4AI in your research or project, please cite:
@@ -733,7 +733,7 @@ UncleCode. (2024). Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper [Com
GitHub. https://github.com/unclecode/crawl4ai
```
-## ๐ง Contact
+## Contact
For questions, suggestions, or feedback, feel free to reach out:
@@ -741,11 +741,11 @@ For questions, suggestions, or feedback, feel free to reach out:
- Twitter: [@unclecode](https://twitter.com/unclecode)
- Website: [crawl4ai.com](https://crawl4ai.com)
-Happy Crawling! ๐ธ๏ธ๐
+Happy Crawling! ๐
-## ๐ Support Crawl4AI
+## Support Crawl4AI
-> ๐ **Sponsorship Program Just Launched!** Be among the first 50 **Founding Sponsors** and get permanent recognition in our Hall of Fame!
+> **Sponsorship Program Just Launched!** Be among the first 50 **Founding Sponsors** and get permanent recognition in our Hall of Fame!
Crawl4AI is the #1 trending open-source web crawler with 51K+ stars. Your support ensures we stay independent, innovative, and free forever.
@@ -756,20 +756,20 @@ Crawl4AI is the #1 trending open-source web crawler with 51K+ stars. Your suppor
-### ๐ค Sponsorship Tiers
+### Sponsorship Tiers
-- **๐ฑ Believer ($5/mo)**: Join the movement for data democratization
+- ** Believer ($5/mo)**: Join the movement for data democratization
- **๐ Builder ($50/mo)**: Get priority support and early feature access
-- **๐ผ Growing Team ($500/mo)**: Bi-weekly syncs and optimization help
-- **๐ข Data Infrastructure Partner ($2000/mo)**: Full partnership with dedicated support
+- ** Growing Team ($500/mo)**: Bi-weekly syncs and optimization help
+- ** Data Infrastructure Partner ($2000/mo)**: Full partnership with dedicated support
**Why sponsor?** Every tier includes real benefits. No more rate-limited APIs. Own your data pipeline. Build data sovereignty together.
[View All Tiers & Benefits โ](https://github.com/sponsors/unclecode)
-### ๐ Our Sponsors
+### Our Sponsors
-#### ๐ Founding Sponsors (First 50)
+#### Founding Sponsors (First 50)
*Be part of history - [Become a Founding Sponsor](https://github.com/sponsors/unclecode)*
@@ -779,14 +779,14 @@ Thank you to all our sponsors who make this project possible!
-## ๐พ Mission
+## Mission
Our mission is to unlock the value of personal and enterprise data by transforming digital footprints into structured, tradeable assets. Crawl4AI empowers individuals and organizations with open-source tools to extract and structure data, fostering a shared data economy.
We envision a future where AI is powered by real human knowledge, ensuring data creators directly benefit from their contributions. By democratizing data and enabling ethical sharing, we are laying the foundation for authentic AI advancement.
-๐ Key Opportunities
+Key Opportunities
- **Data Capitalization**: Transform digital footprints into measurable, valuable assets.
- **Authentic AI Data**: Provide AI systems with real human insights.
diff --git a/README.md b/README.md
index 733678be3..fb22e087a 100644
--- a/README.md
+++ b/README.md
@@ -16,7 +16,7 @@
#### ๐ Crawl4AI Cloud API โ Closed Beta (Launching Soon)
Reliable, large-scale web extraction, now built to be _**drastically more cost-effective**_ than any of the existing solutions.
-๐ **Apply [here](https://forms.gle/E9MyPaNXACnAMaqG7) for early access**
+ **Apply [here](https://forms.gle/E9MyPaNXACnAMaqG7) for early access**
_Weโll be onboarding in phases and working closely with early users.
Limited slots._
@@ -37,20 +37,20 @@ Limited slots._
Crawl4AI turns the web into clean, LLM ready Markdown for RAG, agents, and data pipelines. Fast, controllable, battle tested by a 50k+ star community.
-[โจ Check out latest update v0.9](#-recent-updates)
+[ Check out latest update v0.9](#-recent-updates)
-โจ **New in v0.9**: Major secure-by-default release of the Docker API server. Auth is on by default, the server binds loopback unless given a token, and the request body is now an untrusted trust boundary. Breaking changes for the self-hosted server only; the pip library is unchanged. If you self-host the Docker API, read the [migration guide](https://github.com/unclecode/crawl4ai/blob/main/deploy/docker/MIGRATION.md) before upgrading. [Release notes โ](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.9.0.md)
+ **New in v0.9**: Major secure-by-default release of the Docker API server. Auth is on by default, the server binds loopback unless given a token, and the request body is now an untrusted trust boundary. Breaking changes for the self-hosted server only; the pip library is unchanged. If you self-host the Docker API, read the [migration guide](https://github.com/unclecode/crawl4ai/blob/main/deploy/docker/MIGRATION.md) before upgrading. [Release notes โ](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.9.0.md)
-โจ Recent v0.8.7: Security-hardening release. Fixes critical Docker API vulnerabilities (RCE, SSRF, auth bypass, file write, XSS, hardcoded JWT secret), adds DomainMapper, and ships scraping, deep-crawl, and LLM fixes. [Release notes โ](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.8.7.md)
+ Recent v0.8.7: Security-hardening release. Fixes critical Docker API vulnerabilities (RCE, SSRF, auth bypass, file write, XSS, hardcoded JWT secret), adds DomainMapper, and ships scraping, deep-crawl, and LLM fixes. [Release notes โ](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.8.7.md)
-โจ Recent v0.8.6: Security hotfix that replaced `litellm` with `unclecode-litellm` due to a PyPI supply chain compromise.
+ Recent v0.8.6: Security hotfix that replaced `litellm` with `unclecode-litellm` due to a PyPI supply chain compromise.
-โจ Previous v0.8.0: Crash Recovery & Prefetch Mode! Deep crawl crash recovery with `resume_state` and `on_state_change` callbacks for long-running crawls. New `prefetch=True` mode for 5-10x faster URL discovery. [Release notes โ](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.8.0.md)
+ Previous v0.8.0: Crash Recovery & Prefetch Mode! Deep crawl crash recovery with `resume_state` and `on_state_change` callbacks for long-running crawls. New `prefetch=True` mode for 5-10x faster URL discovery. [Release notes โ](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.8.0.md)
-โจ Previous v0.7.8: Stability & Bug Fix Release! 11 bug fixes addressing Docker API issues, LLM extraction improvements, URL handling fixes, and dependency updates. [Release notes โ](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.8.md)
+ Previous v0.7.8: Stability & Bug Fix Release! 11 bug fixes addressing Docker API issues, LLM extraction improvements, URL handling fixes, and dependency updates. [Release notes โ](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.8.md)
- ๐ค My Personal Story
+ My Personal Story
I grew up on an Amstrad, thanks to my dad, and never stopped building. In grad school I specialized in NLP and built crawlers for research. Thatโs where I learned how much extraction matters.
@@ -121,9 +121,9 @@ crwl https://docs.crawl4ai.com --deep-crawl bfs --max-pages 10
crwl https://www.example.com/products -q "Extract all product prices"
```
-## ๐ Support Crawl4AI
+## Support Crawl4AI
-> ๐ **Sponsorship Program Now Open!** After powering 51K+ developers and 1 year of growth, Crawl4AI is launching dedicated support for **startups** and **enterprises**. Be among the first 50 **Founding Sponsors** for permanent recognition in our Hall of Fame.
+> **Sponsorship Program Now Open!** After powering 51K+ developers and 1 year of growth, Crawl4AI is launching dedicated support for **startups** and **enterprises**. Be among the first 50 **Founding Sponsors** for permanent recognition in our Hall of Fame.
Crawl4AI is the #1 trending open-source web crawler on GitHub. Your support keeps it independent, innovative, and free for the community โ while giving you direct access to premium benefits.
@@ -134,12 +134,12 @@ Crawl4AI is the #1 trending open-source web crawler on GitHub. Your support keep
-### ๐ค Sponsorship Tiers
+### Sponsorship Tiers
-- **๐ฑ Believer ($5/mo)** โ Join the movement for data democratization
+- ** Believer ($5/mo)** โ Join the movement for data democratization
- **๐ Builder ($50/mo)** โ Priority support & early access to features
-- **๐ผ Growing Team ($500/mo)** โ Bi-weekly syncs & optimization help
-- **๐ข Data Infrastructure Partner ($2000/mo)** โ Full partnership with dedicated support
+- ** Growing Team ($500/mo)** โ Bi-weekly syncs & optimization help
+- ** Data Infrastructure Partner ($2000/mo)** โ Full partnership with dedicated support
*Custom arrangements available - see [SPONSORS.md](SPONSORS.md) for details & contact*
**Why sponsor?**
@@ -148,96 +148,96 @@ No rate-limited APIs. No lock-in. Build and own your data pipeline with direct g
[See All Tiers & Benefits โ](https://github.com/sponsors/unclecode)
-## โจ Features
+## Features
-๐ Markdown Generation
+Markdown Generation
-- ๐งน **Clean Markdown**: Generates clean, structured Markdown with accurate formatting.
-- ๐ฏ **Fit Markdown**: Heuristic-based filtering to remove noise and irrelevant parts for AI-friendly processing.
+- **Clean Markdown**: Generates clean, structured Markdown with accurate formatting.
+- **Fit Markdown**: Heuristic-based filtering to remove noise and irrelevant parts for AI-friendly processing.
- ๐ **Citations and References**: Converts page links into a numbered reference list with clean citations.
-- ๐ ๏ธ **Custom Strategies**: Users can create their own Markdown generation strategies tailored to specific needs.
-- ๐ **BM25 Algorithm**: Employs BM25-based filtering for extracting core information and removing irrelevant content.
+- **Custom Strategies**: Users can create their own Markdown generation strategies tailored to specific needs.
+- **BM25 Algorithm**: Employs BM25-based filtering for extracting core information and removing irrelevant content.
-๐ Structured Data Extraction
+Structured Data Extraction
- ๐ค **LLM-Driven Extraction**: Supports all LLMs (open-source and proprietary) for structured data extraction.
-- ๐งฑ **Chunking Strategies**: Implements chunking (topic-based, regex, sentence-level) for targeted content processing.
-- ๐ **Cosine Similarity**: Find relevant content chunks based on user queries for semantic extraction.
-- ๐ **CSS-Based Extraction**: Fast schema-based data extraction using XPath and CSS selectors.
-- ๐ง **Schema Definition**: Define custom schemas for extracting structured JSON from repetitive patterns.
+- **Chunking Strategies**: Implements chunking (topic-based, regex, sentence-level) for targeted content processing.
+- **Cosine Similarity**: Find relevant content chunks based on user queries for semantic extraction.
+- **CSS-Based Extraction**: Fast schema-based data extraction using XPath and CSS selectors.
+- **Schema Definition**: Define custom schemas for extracting structured JSON from repetitive patterns.
-๐ Browser Integration
+Browser Integration
-- ๐ฅ๏ธ **Managed Browser**: Use user-owned browsers with full control, avoiding bot detection.
-- ๐ **Remote Browser Control**: Connect to Chrome Developer Tools Protocol for remote, large-scale data extraction.
-- ๐ค **Browser Profiler**: Create and manage persistent profiles with saved authentication states, cookies, and settings.
+- **Managed Browser**: Use user-owned browsers with full control, avoiding bot detection.
+- **Remote Browser Control**: Connect to Chrome Developer Tools Protocol for remote, large-scale data extraction.
+- **Browser Profiler**: Create and manage persistent profiles with saved authentication states, cookies, and settings.
- ๐ **Session Management**: Preserve browser states and reuse them for multi-step crawling.
-- ๐งฉ **Proxy Support**: Seamlessly connect to proxies with authentication for secure access.
-- โ๏ธ **Full Browser Control**: Modify headers, cookies, user agents, and more for tailored crawling setups.
-- ๐ **Multi-Browser Support**: Compatible with Chromium, Firefox, and WebKit.
-- ๐ **Dynamic Viewport Adjustment**: Automatically adjusts the browser viewport to match page content, ensuring complete rendering and capturing of all elements.
+- **Proxy Support**: Seamlessly connect to proxies with authentication for secure access.
+- **Full Browser Control**: Modify headers, cookies, user agents, and more for tailored crawling setups.
+- **Multi-Browser Support**: Compatible with Chromium, Firefox, and WebKit.
+- **Dynamic Viewport Adjustment**: Automatically adjusts the browser viewport to match page content, ensuring complete rendering and capturing of all elements.
-๐ Crawling & Scraping
+Crawling & Scraping
-- ๐ผ๏ธ **Media Support**: Extract images, audio, videos, and responsive image formats like `srcset` and `picture`.
+- **Media Support**: Extract images, audio, videos, and responsive image formats like `srcset` and `picture`.
- ๐ **Dynamic Crawling**: Execute JS and wait for async or sync for dynamic content extraction.
-- ๐ธ **Screenshots**: Capture page screenshots during crawling for debugging or analysis.
-- ๐ **Raw Data Crawling**: Directly process raw HTML (`raw:`) or local files (`file://`).
+- **Screenshots**: Capture page screenshots during crawling for debugging or analysis.
+- **Raw Data Crawling**: Directly process raw HTML (`raw:`) or local files (`file://`).
- ๐ **Comprehensive Link Extraction**: Extracts internal, external links, and embedded iframe content.
-- ๐ ๏ธ **Customizable Hooks**: Define hooks at every step to customize crawling behavior (supports both string and function-based APIs).
-- ๐พ **Caching**: Cache data for improved speed and to avoid redundant fetches.
-- ๐ **Metadata Extraction**: Retrieve structured metadata from web pages.
-- ๐ก **IFrame Content Extraction**: Seamless extraction from embedded iframe content.
-- ๐ต๏ธ **Lazy Load Handling**: Waits for images to fully load, ensuring no content is missed due to lazy loading.
-- ๐ **Full-Page Scanning**: Simulates scrolling to load and capture all dynamic content, perfect for infinite scroll pages.
+- **Customizable Hooks**: Define hooks at every step to customize crawling behavior (supports both string and function-based APIs).
+- **Caching**: Cache data for improved speed and to avoid redundant fetches.
+- **Metadata Extraction**: Retrieve structured metadata from web pages.
+- **IFrame Content Extraction**: Seamless extraction from embedded iframe content.
+- **Lazy Load Handling**: Waits for images to fully load, ensuring no content is missed due to lazy loading.
+- **Full-Page Scanning**: Simulates scrolling to load and capture all dynamic content, perfect for infinite scroll pages.
๐ Deployment
-- ๐ณ **Dockerized Setup**: Optimized Docker image with FastAPI server for easy deployment.
-- ๐ **Secure Authentication**: Built-in JWT token authentication for API security.
-- ๐ **API Gateway**: One-click deployment with secure token authentication for API-based workflows.
-- ๐ **Scalable Architecture**: Designed for mass-scale production and optimized server performance.
-- โ๏ธ **Cloud Deployment**: Ready-to-deploy configurations for major cloud platforms.
+- **Dockerized Setup**: Optimized Docker image with FastAPI server for easy deployment.
+- **Secure Authentication**: Built-in JWT token authentication for API security.
+- **API Gateway**: One-click deployment with secure token authentication for API-based workflows.
+- **Scalable Architecture**: Designed for mass-scale production and optimized server performance.
+- **Cloud Deployment**: Ready-to-deploy configurations for major cloud platforms.
-๐ฏ Additional Features
+Additional Features
-- ๐ถ๏ธ **Stealth Mode**: Avoid bot detection by mimicking real users.
-- ๐ท๏ธ **Tag-Based Content Extraction**: Refine crawling based on custom tags, headers, or metadata.
+- **Stealth Mode**: Avoid bot detection by mimicking real users.
+- **Tag-Based Content Extraction**: Refine crawling based on custom tags, headers, or metadata.
- ๐ **Link Analysis**: Extract and analyze all links for detailed data exploration.
- ๐ก๏ธ **Error Handling**: Robust error management for seamless execution.
-- ๐ **CORS & Static Serving**: Supports filesystem-based caching and cross-origin requests.
-- ๐ **Clear Documentation**: Simplified and updated guides for onboarding and advanced usage.
-- ๐ **Community Recognition**: Acknowledges contributors and pull requests for transparency.
+- **CORS & Static Serving**: Supports filesystem-based caching and cross-origin requests.
+- **Clear Documentation**: Simplified and updated guides for onboarding and advanced usage.
+- **Community Recognition**: Acknowledges contributors and pull requests for transparency.
## Try it Now!
-โจ Play around with this [](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing)
+ Play around with this [](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing)
-โจ Visit our [Documentation Website](https://docs.crawl4ai.com/)
+ Visit our [Documentation Website](https://docs.crawl4ai.com/)
-## Installation ๐ ๏ธ
+## Installation
Crawl4AI offers flexible installation options to suit various use cases. You can install it as a Python package or use Docker.
-๐ Using pip
+Using pip
Choose the installation option that best fits your needs:
@@ -252,7 +252,7 @@ crawl4ai-setup # Setup the browser
By default, this will install the asynchronous version of Crawl4AI, using Playwright for web crawling.
-๐ **Note**: When you install Crawl4AI, the `crawl4ai-setup` should automatically install and set up Playwright. However, if you encounter any Playwright-related errors, you can manually install it using one of these methods:
+ **Note**: When you install Crawl4AI, the `crawl4ai-setup` should automatically install and set up Playwright. However, if you encounter any Playwright-related errors, you can manually install it using one of these methods:
1. Through the command line:
@@ -303,7 +303,7 @@ pip install -e ".[all]" # Install all optional features
-๐ณ Docker Deployment
+Docker Deployment
> ๐ **Now Available!** Our completely redesigned Docker implementation is here! This new solution makes deployment more efficient and seamless than ever.
@@ -361,12 +361,12 @@ For more examples, see our [Docker Examples](https://github.com/unclecode/crawl4
---
-## ๐ฌ Advanced Usage Examples ๐ฌ
+## Advanced Usage Examples
You can check the project structure in the directory [docs/examples](https://github.com/unclecode/crawl4ai/tree/main/docs/examples). Over there, you can find a variety of examples; here, some popular examples are shared.
-๐ Heuristic Markdown Generation with Clean and Fit Markdown
+Heuristic Markdown Generation with Clean and Fit Markdown
```python
import asyncio
@@ -404,7 +404,7 @@ if __name__ == "__main__":
-๐ฅ๏ธ Executing JavaScript & Extract Structured Data without LLMs
+Executing JavaScript & Extract Structured Data without LLMs
```python
import asyncio
@@ -477,7 +477,7 @@ if __name__ == "__main__":
-๐ Extracting Structured Data with LLMs
+Extracting Structured Data with LLMs
```python
import os
@@ -562,9 +562,9 @@ async def test_news_crawl():
---
-> **๐ก Tip:** Some websites may use **CAPTCHA** based verification mechanisms to prevent automated access. If your workflow encounters such challenges, you may optionally integrate a third-party CAPTCHA-handling service such as [CapSolver](https://www.capsolver.com/blog/Partners/crawl4ai-capsolver/?utm_source=crawl4ai&utm_medium=github_pr&utm_campaign=crawl4ai_integration). They support reCAPTCHA v2/v3, Cloudflare Turnstile, Challenge, AWS WAF, and more. Please ensure that your usage complies with the target websiteโs terms of service and applicable laws.
+> ** Tip:** Some websites may use **CAPTCHA** based verification mechanisms to prevent automated access. If your workflow encounters such challenges, you may optionally integrate a third-party CAPTCHA-handling service such as [CapSolver](https://www.capsolver.com/blog/Partners/crawl4ai-capsolver/?utm_source=crawl4ai&utm_medium=github_pr&utm_campaign=crawl4ai_integration). They support reCAPTCHA v2/v3, Cloudflare Turnstile, Challenge, AWS WAF, and more. Please ensure that your usage complies with the target websiteโs terms of service and applicable laws.
-## โจ Recent Updates
+## Recent Updates
Version 0.9.0 Release Highlights - Secure-by-Default Docker Server
@@ -624,17 +624,17 @@ Our biggest release since v0.8.0. Anti-bot detection with proxy escalation, Shad
)
```
-- **๐ Shadow DOM Flattening**:
+- ** Shadow DOM Flattening**:
- Extract content hidden inside shadow DOM components
```python
config = CrawlerRunConfig(flatten_shadow_dom=True)
```
-- **๐ Deep Crawl Cancellation**:
+- ** Deep Crawl Cancellation**:
- Stop long crawls gracefully with `cancel()` or `should_cancel` callback
- Works with BFS, DFS, and BestFirst strategies
-- **โ๏ธ Config Defaults API**:
+- ** Config Defaults API**:
- `set_defaults()` / `get_defaults()` / `reset_defaults()` on BrowserConfig and CrawlerRunConfig
- **๐ Critical Security Fixes**:
@@ -652,7 +652,7 @@ Our biggest release since v0.8.0. Anti-bot detection with proxy escalation, Shad
This release introduces crash recovery for deep crawls, a new prefetch mode for fast URL discovery, and critical security fixes for Docker deployments.
-- **๐ Deep Crawl Crash Recovery**:
+- ** Deep Crawl Crash Recovery**:
- `on_state_change` callback fires after each URL for real-time state persistence
- `resume_state` parameter to continue from a saved checkpoint
- JSON-serializable state for Redis/database storage
@@ -667,7 +667,7 @@ This release introduces crash recovery for deep crawls, a new prefetch mode for
)
```
-- **โก Prefetch Mode for Fast URL Discovery**:
+- ** Prefetch Mode for Fast URL Discovery**:
- `prefetch=True` skips markdown, extraction, and media processing
- 5-10x faster than full processing
- Perfect for two-phase crawling: discover first, process selectively
@@ -691,7 +691,7 @@ This release introduces crash recovery for deep crawls, a new prefetch mode for
This release focuses on stability with 11 bug fixes addressing issues reported by the community. No new features, but significant improvements to reliability.
-- **๐ณ Docker API Fixes**:
+- ** Docker API Fixes**:
- Fixed `ContentRelevanceFilter` deserialization in deep crawl requests (#1642)
- Fixed `ProxyConfig` JSON serialization in `BrowserConfig.to_dict()` (#1629)
- Fixed `.cache` folder permissions in Docker image (#1638)
@@ -724,11 +724,11 @@ This release focuses on stability with 11 bug fixes addressing issues reported b
- Fixed relative URL resolution after JavaScript redirects (#1268)
- Fixed import statement formatting in extracted code (#1181)
-- **๐ฆ Dependency Updates**:
+- ** Dependency Updates**:
- Replaced deprecated PyPDF2 with pypdf (#1412)
- Pydantic v2 ConfigDict compatibility - no more deprecation warnings (#678)
-- **๐ง AdaptiveCrawler**:
+- ** AdaptiveCrawler**:
- Fixed query expansion to actually use LLM instead of hardcoded mock data (#1621)
[Full v0.7.8 Release Notes โ](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.8.md)
@@ -738,7 +738,7 @@ This release focuses on stability with 11 bug fixes addressing issues reported b
Version 0.7.7 Release Highlights - The Self-Hosting & Monitoring Update
-- **๐ Real-time Monitoring Dashboard**: Interactive web UI with live system metrics and browser pool visibility
+- ** Real-time Monitoring Dashboard**: Interactive web UI with live system metrics and browser pool visibility
```python
# Access the monitoring dashboard
# Visit: http://localhost:11235/dashboard
@@ -751,7 +751,7 @@ This release focuses on stability with 11 bug fixes addressing issues reported b
# - Error monitoring with full context
```
-- **๐ Comprehensive Monitor API**: Complete REST API for programmatic access to all monitoring data
+- ** Comprehensive Monitor API**: Complete REST API for programmatic access to all monitoring data
```python
import httpx
@@ -769,12 +769,12 @@ This release focuses on stability with 11 bug fixes addressing issues reported b
stats = await client.get("http://localhost:11235/monitor/endpoints/stats")
```
-- **โก WebSocket Streaming**: Real-time updates every 2 seconds for custom dashboards
-- **๐ฅ Smart Browser Pool**: 3-tier architecture (permanent/hot/cold) with automatic promotion and cleanup
-- **๐งน Janitor System**: Automatic resource management with event logging
-- **๐ฎ Control Actions**: Manual browser management (kill, restart, cleanup) via API
-- **๐ Production Metrics**: 6 critical metrics for operational excellence with Prometheus integration
-- **๐ Critical Bug Fixes**:
+- ** WebSocket Streaming**: Real-time updates every 2 seconds for custom dashboards
+- ** Smart Browser Pool**: 3-tier architecture (permanent/hot/cold) with automatic promotion and cleanup
+- ** Janitor System**: Automatic resource management with event logging
+- ** Control Actions**: Manual browser management (kill, restart, cleanup) via API
+- ** Production Metrics**: 6 critical metrics for operational excellence with Prometheus integration
+- ** Critical Bug Fixes**:
- Fixed async LLM extraction blocking issue (#1055)
- Enhanced DFS deep crawl strategy (#1607)
- Fixed sitemap parsing in AsyncUrlSeeder (#1598)
@@ -789,8 +789,8 @@ This release focuses on stability with 11 bug fixes addressing issues reported b
Version 0.7.5 Release Highlights - The Docker Hooks & Security Update
-- **๐ง Docker Hooks System**: Complete pipeline customization with user-provided Python functions at 8 key points
-- **โจ Function-Based Hooks API (NEW)**: Write hooks as regular Python functions with full IDE support:
+- ** Docker Hooks System**: Complete pipeline customization with user-provided Python functions at 8 key points
+- ** Function-Based Hooks API (NEW)**: Write hooks as regular Python functions with full IDE support:
```python
from crawl4ai import hooks_to_string
from crawl4ai.docker_client import Crawl4aiDockerClient
@@ -827,8 +827,8 @@ This release focuses on stability with 11 bug fixes addressing issues reported b
- **๐ค Enhanced LLM Integration**: Custom providers with temperature control and base_url configuration
- **๐ HTTPS Preservation**: Secure internal link handling with `preserve_https_for_internal_links=True`
-- **๐ Python 3.10+ Support**: Modern language features and enhanced performance
-- **๐ ๏ธ Bug Fixes**: Resolved multiple community-reported issues including URL processing, JWT authentication, and proxy configuration
+- ** Python 3.10+ Support**: Modern language features and enhanced performance
+- ** Bug Fixes**: Resolved multiple community-reported issues including URL processing, JWT authentication, and proxy configuration
[Full v0.7.5 Release Notes โ](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.5.md)
@@ -858,9 +858,9 @@ This release focuses on stability with 11 bug fixes addressing issues reported b
print(f"Extracted table: {len(table['data'])} rows")
```
-- **โก Dispatcher Bug Fix**: Fixed sequential processing bottleneck in arun_many for fast-completing tasks
-- **๐งน Memory Management Refactor**: Consolidated memory utilities into main utils module for cleaner architecture
-- **๐ง Browser Manager Fixes**: Resolved race conditions in concurrent page creation with thread-safe locking
+- ** Dispatcher Bug Fix**: Fixed sequential processing bottleneck in arun_many for fast-completing tasks
+- ** Memory Management Refactor**: Consolidated memory utilities into main utils module for cleaner architecture
+- ** Browser Manager Fixes**: Resolved race conditions in concurrent page creation with thread-safe locking
- **๐ Advanced URL Processing**: Better handling of raw:// URLs and base tag link resolution
- **๐ก๏ธ Enhanced Proxy Support**: Flexible proxy configuration supporting both dict and string formats
@@ -871,7 +871,7 @@ This release focuses on stability with 11 bug fixes addressing issues reported b
Version 0.7.3 Release Highlights - The Multi-Config Intelligence Update
-- **๐ต๏ธ Undetected Browser Support**: Bypass sophisticated bot detection systems:
+- ** Undetected Browser Support**: Bypass sophisticated bot detection systems:
```python
from crawl4ai import AsyncWebCrawler, BrowserConfig
@@ -889,7 +889,7 @@ This release focuses on stability with 11 bug fixes addressing issues reported b
# Successfully bypass Cloudflare, Akamai, and custom bot detection
```
-- **๐จ Multi-URL Configuration**: Different strategies for different URL patterns in one batch:
+- ** Multi-URL Configuration**: Different strategies for different URL patterns in one batch:
```python
from crawl4ai import CrawlerRunConfig, MatchMode, CacheMode
@@ -915,7 +915,7 @@ from crawl4ai import CrawlerRunConfig, MatchMode, CacheMode
# Each URL gets the perfect configuration automatically
```
-- **๐ง Memory Monitoring**: Track and optimize memory usage during crawling:
+- ** Memory Monitoring**: Track and optimize memory usage during crawling:
```python
from crawl4ai.memory_utils import MemoryMonitor
@@ -930,7 +930,7 @@ from crawl4ai import CrawlerRunConfig, MatchMode, CacheMode
# Get optimization recommendations
```
-- **๐ Enhanced Table Extraction**: Direct DataFrame conversion from web tables:
+- ** Enhanced Table Extraction**: Direct DataFrame conversion from web tables:
```python
result = await crawler.arun("https://site-with-tables.com")
@@ -942,8 +942,8 @@ from crawl4ai import CrawlerRunConfig, MatchMode, CacheMode
print(f"Table: {df.shape[0]} rows ร {df.shape[1]} columns")
```
-- **๐ฐ GitHub Sponsors**: 4-tier sponsorship system for project sustainability
-- **๐ณ Docker LLM Flexibility**: Configure providers via environment variables
+- ** GitHub Sponsors**: 4-tier sponsorship system for project sustainability
+- ** Docker LLM Flexibility**: Configure providers via environment variables
[Full v0.7.3 Release Notes โ](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.3.md)
@@ -952,7 +952,7 @@ from crawl4ai import CrawlerRunConfig, MatchMode, CacheMode
Version 0.7.0 Release Highlights - The Adaptive Intelligence Update
-- **๐ง Adaptive Crawling**: Your crawler now learns and adapts to website patterns automatically:
+- ** Adaptive Crawling**: Your crawler now learns and adapts to website patterns automatically:
```python
config = AdaptiveConfig(
confidence_threshold=0.7, # Min confidence to stop crawling
@@ -970,7 +970,7 @@ from crawl4ai import CrawlerRunConfig, MatchMode, CacheMode
# Crawler learns patterns and improves extraction over time
```
-- **๐ Virtual Scroll Support**: Complete content extraction from infinite scroll pages:
+- ** Virtual Scroll Support**: Complete content extraction from infinite scroll pages:
```python
scroll_config = VirtualScrollConfig(
container_selector="[data-testid='feed']",
@@ -999,7 +999,7 @@ from crawl4ai import CrawlerRunConfig, MatchMode, CacheMode
# Links ranked by relevance and quality
```
-- **๐ฃ Async URL Seeder**: Discover thousands of URLs in seconds:
+- ** Async URL Seeder**: Discover thousands of URLs in seconds:
```python
seeder = AsyncUrlSeeder(SeedingConfig(
source="sitemap+cc",
@@ -1011,7 +1011,7 @@ from crawl4ai import CrawlerRunConfig, MatchMode, CacheMode
urls = await seeder.discover("https://example.com")
```
-- **โก Performance Boost**: Up to 3x faster with optimized resource handling and memory efficiency
+- ** Performance Boost**: Up to 3x faster with optimized resource handling and memory efficiency
Read the full details in our [0.7.0 Release Notes](https://docs.crawl4ai.com/blog/release-v0.7.0) or check the [CHANGELOG](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
@@ -1022,7 +1022,7 @@ Read the full details in our [0.7.0 Release Notes](https://docs.crawl4ai.com/blo
Crawl4AI follows standard Python version numbering conventions (PEP 440) to help users understand the stability and features of each release.
-๐ Version Numbers Explained
+Version Numbers Explained
Our version numbers follow this pattern: `MAJOR.MINOR.PATCH` (e.g., 0.4.3)
@@ -1061,16 +1061,16 @@ For production environments, we recommend using the stable version. For testing
-## ๐ Documentation & Roadmap
+## Documentation & Roadmap
-> ๐จ **Documentation Update Alert**: We're undertaking a major documentation overhaul next week to reflect recent updates and improvements. Stay tuned for a more comprehensive and up-to-date guide!
+> **Documentation Update Alert**: We're undertaking a major documentation overhaul next week to reflect recent updates and improvements. Stay tuned for a more comprehensive and up-to-date guide!
For current documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://docs.crawl4ai.com/).
To check our development plans and upcoming features, visit our [Roadmap](https://github.com/unclecode/crawl4ai/blob/main/ROADMAP.md).
-๐ Development TODOs
+Development TODOs
- [x] 0. Graph Crawler: Smart website traversal using graph search algorithms for comprehensive nested page extraction
- [x] 1. Question-Based Crawler: Natural language driven web discovery and content extraction
@@ -1087,7 +1087,7 @@ To check our development plans and upcoming features, visit our [Roadmap](https:
-## ๐ค Contributing
+## Contributing
We welcome contributions from the open-source community. Check out our [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTORS.md) for more information.
@@ -1095,7 +1095,7 @@ I'll help modify the license section with badges. For the halftone effect, here'
Here's the updated license section:
-## ๐ License & Attribution
+## License & Attribution
This project is licensed under the Apache License 2.0, attribution is recommended via the badges below. See the [Apache 2.0 License](https://github.com/unclecode/crawl4ai/blob/main/LICENSE) file for details.
@@ -1103,7 +1103,7 @@ This project is licensed under the Apache License 2.0, attribution is recommende
When using Crawl4AI, you must include one of the following attribution methods:
-๐ 1. Badge Attribution (Recommended)
+1. Badge Attribution (Recommended)
Add one of these badges to your README, documentation, or website:
| Theme | Badge |
@@ -1145,14 +1145,14 @@ HTML code for adding the badges:
-๐ 2. Text Attribution
+2. Text Attribution
Add this line to your documentation:
```
This project uses Crawl4AI (https://github.com/unclecode/crawl4ai) for web data extraction.
```
-## ๐ Citation
+## Citation
If you use Crawl4AI in your research or project, please cite:
@@ -1174,7 +1174,7 @@ UncleCode. (2024). Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper [Com
GitHub. https://github.com/unclecode/crawl4ai
```
-## ๐ง Contact
+## Contact
For questions, suggestions, or feedback, feel free to reach out:
@@ -1182,16 +1182,16 @@ For questions, suggestions, or feedback, feel free to reach out:
- Twitter: [@unclecode](https://twitter.com/unclecode)
- Website: [crawl4ai.com](https://crawl4ai.com)
-Happy Crawling! ๐ธ๏ธ๐
+Happy Crawling! ๐
-## ๐พ Mission
+## Mission
Our mission is to unlock the value of personal and enterprise data by transforming digital footprints into structured, tradeable assets. Crawl4AI empowers individuals and organizations with open-source tools to extract and structure data, fostering a shared data economy.
We envision a future where AI is powered by real human knowledge, ensuring data creators directly benefit from their contributions. By democratizing data and enabling ethical sharing, we are laying the foundation for authentic AI advancement.
-๐ Key Opportunities
+Key Opportunities
- **Data Capitalization**: Transform digital footprints into measurable, valuable assets.
- **Authentic AI Data**: Provide AI systems with real human insights.
@@ -1209,25 +1209,25 @@ We envision a future where AI is powered by real human knowledge, ensuring data
For more details, see our [full mission statement](./MISSION.md).
-## ๐ Current Sponsors
+## Current Sponsors
-### ๐ข Enterprise Sponsors & Partners
+### Enterprise Sponsors & Partners
Our enterprise sponsors and technology partners help scale Crawl4AI to power production-grade data pipelines.
| Company | About | Sponsorship Tier |
|------|------|----------------------------|
-| | Leveraging Thordata ensures seamless compatibility with any AI/ML workflows and data infrastructure, massively accessing web data with 99.9% uptime, backed by one-on-one customer support. | ๐ฅ Silver |
-| | NstProxy is a trusted proxy provider with over 110M+ real residential IPs, city-level targeting, 99.99% uptime, and low pricing at $0.1/GB, it delivers unmatched stability, scale, and cost-efficiency. | ๐ฅ Silver |
-| | Scrapeless provides production-grade infrastructure for Crawling, Automation, and AI Agents, offering Scraping Browser, 4 Proxy Types and Universal Scraping API. | ๐ฅ Silver |
-| | AI-powered Captcha solving service. Supports all major Captcha types, including reCAPTCHA, Cloudflare, and more | ๐ฅ Bronze |
-| | Helps engineers and buyers find, compare, and source electronic & industrial parts in seconds, with specs, pricing, lead times & alternatives.| ๐ฅ Gold |
-|
| Kidocode is a hybrid technology and entrepreneurship school for kids aged 5โ18, offering both online and on-campus education. | ๐ฅ Gold |
-| | Singapore-based Aleph Null is Asiaโs leading edtech hub, dedicated to student-centric, AI-driven educationโempowering learners with the tools to thrive in a fast-changing world. | ๐ฅ Gold |
+| | Leveraging Thordata ensures seamless compatibility with any AI/ML workflows and data infrastructure, massively accessing web data with 99.9% uptime, backed by one-on-one customer support. | Silver |
+| | NstProxy is a trusted proxy provider with over 110M+ real residential IPs, city-level targeting, 99.99% uptime, and low pricing at $0.1/GB, it delivers unmatched stability, scale, and cost-efficiency. | Silver |
+| | Scrapeless provides production-grade infrastructure for Crawling, Automation, and AI Agents, offering Scraping Browser, 4 Proxy Types and Universal Scraping API. | Silver |
+| | AI-powered Captcha solving service. Supports all major Captcha types, including reCAPTCHA, Cloudflare, and more | Bronze |
+| | Helps engineers and buyers find, compare, and source electronic & industrial parts in seconds, with specs, pricing, lead times & alternatives.| Gold |
+|