A modular and configurable web search library with LLM summarization powered by OpenAI and DuckDuckGo.
- 🔍 Web Search: Integrated DuckDuckGo search engine
- 🤖 LLM Integration: OpenAI-powered content summarization
- ⚡ Async/Await: Fast asynchronous HTTP operations
- 🎯 Content Extraction: Trafilatura-based HTML content extraction
- 💾 Caching: Built-in cache management for efficiency
- 🛡️ Robots.txt: Respectful crawling with robots.txt compliance
- 🔧 Configurable: YAML-based configuration system
- 📦 Modular: Clean architecture with dependency injection
pip install websearch-ai# Clone the repository
git clone <your-repo-url>
cd workdir
# Install with uv (recommended)
uv pip install -e .
# Or with pip
pip install -e .# Install with development dependencies
uv pip install -e ".[dev]"
# Or with pip
pip install -e ".[dev]"import asyncio
from websearch_ai import WebSearchPipeline, Settings
async def main():
# Initialize with settings
settings = Settings.from_env() # Loads from environment variables
pipeline = WebSearchPipeline(settings)
# Run a search
results, answer = await pipeline.run("What are the latest AI developments?")
# Process results
print(f"Final Answer: {answer}")
for result in results:
print(f"- {result.title}: {result.url}")
if __name__ == "__main__":
asyncio.run(main())# Set your OpenAI API key
export OPENAI_API_KEY='your-api-key-here'
# Run a search
websearch-ai "latest nvidia earnings 2025"
# Or use as a module
python -m websearch_ai.cli "your search query"# Required
export OPENAI_API_KEY='your-api-key-here'
# Optional
export OPENAI_MODEL='gpt-4o-mini' # Default model
export LOG_LEVEL='INFO' # Logging levelCreate a config.yaml file (see src/websearch_ai/config/config.example.yaml for full options):
openai:
api_key: ${OPENAI_API_KEY}
model: gpt-4o-mini
search:
max_results: 5
timeout: 30
cache:
enabled: true
ttl: 86400Main pipeline for orchestrating web search and summarization.
from websearch_ai import WebSearchPipeline, Settings
settings = Settings.from_env()
pipeline = WebSearchPipeline(settings)
results, answer = await pipeline.run("query")Configuration management using Pydantic.
from websearch_ai import Settings
# From environment variables
settings = Settings.from_env()
# From YAML file
settings = Settings.from_yaml("config.yaml")Data model for search results.
from websearch_ai import SearchResult
result = SearchResult(
title="Example",
url="https://example.com",
snippet="Description",
content="Full content"
)Manages caching of search results and content.
from websearch_ai import CacheManager
cache = CacheManager(cache_dir=".cache", ttl=86400)
await cache.get("key")
await cache.set("key", "value")Manages LLM prompts from YAML files.
from websearch_ai import PromptManager
prompts = PromptManager("prompts.yaml")
prompt = prompts.get("search_prompt", query="test")Checks robots.txt compliance before fetching.
from websearch_ai import RobotsChecker
checker = RobotsChecker()
allowed = await checker.can_fetch("https://example.com")DuckDuckGo search integration.
from websearch_ai import SearchEngine
engine = SearchEngine(max_results=5)
results = await engine.search("query")Async HTTP client with rate limiting.
from websearch_ai import HTTPFetcher
fetcher = HTTPFetcher(timeout=30)
html = await fetcher.fetch("https://example.com")OpenAI API client wrapper.
from websearch_ai import LLMClient
client = LLMClient(api_key="key", model="gpt-4o-mini")
response = await client.complete("prompt")# Install development dependencies
uv pip install -e ".[dev]"
# Install pre-commit hooks
pre-commit install# Run all tests
pytest
# Run with coverage
pytest --cov=websearch_ai --cov-report=html
# Run specific test file
pytest src/websearch_ai/tests/test_models.py# Format code
ruff format .
# Run linter
ruff check .
# Fix linting issues
ruff check --fix .- Python 3.12+
- OpenAI API key
- Internet connection
Core dependencies:
openai>=1.0.0- OpenAI API clientpydantic>=2.0.0- Data validation and settingsaiohttp>=3.9.0- Async HTTP clientaiofiles>=23.0.0- Async file operationspyyaml>=6.0- YAML configurationddgs>=0.1.0- DuckDuckGo searchtrafilatura>=1.6.0- Content extraction
Development dependencies:
pytest>=7.0- Testing frameworkpytest-cov>=4.0- Coverage reportingpytest-asyncio>=0.21.0- Async test supportruff>=0.1.6- Linting and formattingpre-commit>=3.0- Git hooks
src/websearch_ai/
├── __init__.py # Package exports
├── cli.py # Command-line interface
├── config/ # Configuration management
│ ├── settings.py
│ └── config.yaml
├── core/ # Core pipeline logic
│ ├── models.py
│ └── pipeline.py
├── clients/ # External service clients
│ ├── http.py
│ ├── llm.py
│ └── search.py
├── managers/ # Resource managers
│ ├── cache.py
│ ├── prompts.py
│ └── robots.py
├── filters/ # URL and content filters
│ └── url_filter.py
├── prompts/ # LLM prompt templates
│ └── prompts.yaml
├── tests/ # Test suite
└── docs/ # Documentation
See the src/websearch_ai/examples/ directory for more usage examples.
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Run tests and linting
- Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
Davide Palleschi (davide@deepplants.com)
See CHANGELOG.md for version history and updates.