Intelligent job search system that uses LangGraph agents to search for jobs across multiple sources, extract contact information, and generate an interactive HTML report.
The system now includes LLM-enhanced capabilities that make searches significantly smarter:
- 🎯 Adaptive Keywords: LLM generates optimized keywords for each source and region based on your profile (+30-50% better relevance)
- 🧠 Semantic Matching: Deep relevance analysis beyond simple keywords (+40-60% better precision)
- 🔄 Hybrid Approach: Combines fast heuristic matching with intelligent semantic analysis
- 🌍 Regional Adaptation: Keywords and analysis specific to Hispanic vs English-speaking regions
# 1. Install dependencies
pip install -r requirements.txt
# 2. Install Playwright
playwright install chromium
# 3. Configure .env (copy from env.example)
# 4. 🚀 START THE SYSTEM
python main.py✨ Ready! The system will start searching for jobs automatically
The complete process normally takes between 30 minutes and 1.5 hours, depending on:
- 💻 Your machine specifications (processor, RAM, disk speed)
- 🌐 Your internet connection speed
- 📊 Number of jobs found in each source
- 🔍 Number of enabled sources in the configuration
💡 Recommendation: Let the system run and don't close the terminal. The system will show real-time progress and generate the HTML report when finished.
- 🔍 Multi-Source Search: Searches LinkedIn, RemoteOK, We Work Remotely, Stack Overflow Jobs, GitHub Jobs, Findjobit
- 🤖 Specialized Agents: Each source has its own optimized agent with advanced anti-bot techniques
- 📧 Intelligent Email Extraction: Uses LLMs to extract contact emails from job descriptions
- 🎯 Intelligent Matching: Calculates match score between jobs and your profile using heuristic matching + deep semantic analysis with LLM (hybrid)
- 🤖 Adaptive Keywords: Dynamically generates optimized keywords by source and region using LLM (NEW)
- 🧠 Semantic Analysis: Intelligent semantic matching that understands synonyms and real context (NEW)
- 📊 Interactive HTML Report: Generates an HTML report with filters, statistics, and visualizations
- 🔄 LangGraph Architecture: Coordinated workflow using LangGraph StateGraph for agent orchestration
- 🛡️ Anti-Bot Protection: Advanced system with User-Agent rotation, circuit breakers, adaptive rate limiting, and more
- ⚙️ Flexible Configuration: Environment variables to fully customize system behavior
- ✅ Comprehensive Testing: Full test suite with pytest for unit testing and validation
- 🏗️ Clean Architecture: Modular design with shared utilities, base classes, and proper error handling
- 📊 Configuration Validation: Pydantic models for type-safe configuration validation
- 🎯 Agent Skills System: LLM prompts organized as reusable skills following the agent-skills standard, enabling easy prompt management, versioning, and experimentation
- Python 3.9+
- OpenAI or Anthropic API Key (for LLMs)
- Playwright (for web scraping)
- Resume file in Markdown format (or configure the path in environment variables)
cd job_search_agentspip install -r requirements.txtplaywright install chromiumCopy the example file and customize it:
# Windows
copy env.example .env
# Linux/Mac
cp env.example .envThen edit the .env file with your values. See Environment Variables section for more details.
Copy the profile example file:
# Windows
copy data\profile.json.example data\profile.json
# Linux/Mac
cp data/profile.json.example data/profile.jsonEdit data/profile.json with your personal and professional information.
The system requires the following variables to function:
| Variable | Description | Example |
|---|---|---|
USER_EMAIL |
Your contact email | your_email@example.com |
USER_PHONE |
Your phone (with country code) | +1234567890 |
ANTHROPIC_API_KEY |
Anthropic API key (if LLM_PROVIDER=anthropic) |
sk-ant-api03-... |
OPENAI_API_KEY |
OpenAI API key (if LLM_PROVIDER=openai) |
sk-... |
LLM_PROVIDER=anthropic
ANTHROPIC_API_KEY=your_api_key_here
LLM_MODEL=claude-3-5-sonnet-20241022Get your API key at: https://console.anthropic.com/
LLM_PROVIDER=openai
OPENAI_API_KEY=your_api_key_here
LLM_MODEL=gpt-4o-miniGet your API key at: https://platform.openai.com/api-keys
The system looks for your resume at the configured path. You can specify it in two ways:
CV_PATH=C:\Users\YourUser\Mi hoja de vida\CVs_Principales\CV_Dev_Senior_AI_Improvement.mdCV_PATH=../CVs_Principales/CV_Dev_Senior_AI_Improvement.mdIf you don't specify CV_PATH, the system will use:
CVs_Principales/CV_Dev_Senior_AI_Improvement.md
(relative to the project base directory)
The env.example file contains all available variables with complete documentation. Here's a summary by category:
| Variable | Required | Description |
|---|---|---|
ANTHROPIC_API_KEY |
If LLM_PROVIDER=anthropic |
Anthropic API key |
OPENAI_API_KEY |
If LLM_PROVIDER=openai |
OpenAI API key |
LLM_PROVIDER |
No (default: openai) |
openai or anthropic |
LLM_MODEL |
No (default: gpt-4o-mini) |
Model to use |
| Variable | Required | Description |
|---|---|---|
USER_EMAIL |
✅ Yes | Contact email |
USER_PHONE |
✅ Yes | Phone with country code |
| Variable | Default | Description |
|---|---|---|
MAX_JOBS_PER_SOURCE |
50 |
Maximum jobs per source |
MIN_MATCH_SCORE |
60 |
Minimum score to consider relevant (0-100) |
SEARCH_TIMEOUT |
30 |
Timeout in seconds |
FAST_MODE |
false |
Enable fast mode (reduced delays, disabled human simulation) |
EMAIL_EXTRACTION_CONCURRENCY |
10 |
Number of parallel email extractions |
EMAIL_BATCH_SIZE |
5 |
Batch size for email extraction |
| Variable | Default | Description |
|---|---|---|
SCRAPING_DELAY |
2.0 |
Delay between requests (seconds) |
MAX_RETRIES |
3 |
Maximum number of retries |
HEADLESS_BROWSER |
true |
Run browser without interface |
PAGE_LOAD_TIMEOUT |
30000 |
Page load timeout in milliseconds |
SELECTOR_TIMEOUT |
10000 |
Selector timeout in milliseconds |
REQUEST_TIMEOUT |
30 |
HTTP request timeout in seconds |
DESCRIPTION_MAX_LENGTH |
2000 |
Maximum length for job descriptions |
TITLE_DISPLAY_LENGTH |
50 |
Maximum display length for job titles |
| Variable | Default | Description |
|---|---|---|
USE_USER_AGENT_ROTATION |
true |
Rotate User-Agent automatically |
RANDOM_DELAY_ENABLED |
true |
Random delays between requests |
MIN_DELAY |
1.5 |
Minimum delay (seconds) |
MAX_DELAY |
4.0 |
Maximum delay (seconds) |
ENABLE_BROWSER_STEALTH |
true |
Browser stealth mode |
SIMULATE_HUMAN_BEHAVIOR |
true |
Simulate human behavior |
| Variable | Default | Description |
|---|---|---|
USE_CIRCUIT_BREAKER |
true |
Enable circuit breaker |
CIRCUIT_BREAKER_THRESHOLD |
5 |
Errors before activating |
CIRCUIT_BREAKER_TIMEOUT |
300 |
Circuit breaker timeout (seconds) |
USE_SESSION_PERSISTENCE |
true |
Maintain persistent sessions |
USE_ADAPTIVE_RATE_LIMITING |
true |
Adaptive rate limiting |
USE_REFERER_HEADERS |
true |
Use Referer headers |
USE_SESSION_WARMUP |
true |
Session warm-up before scraping |
USE_QUERY_VARIATIONS |
true |
Generate query variations with LLM |
| Variable | Default | Description |
|---|---|---|
USE_ADAPTIVE_KEYWORDS |
true |
Generate adaptive keywords by source/region with LLM |
USE_SEMANTIC_MATCHING |
true |
Deep semantic relevance analysis with LLM |
SEMANTIC_MATCHING_THRESHOLD |
50 |
Minimum heuristic score for semantic analysis (0-100) |
SEMANTIC_MAX_JOBS |
100 |
Maximum jobs to analyze semantically |
SEMANTIC_WEIGHT |
0.6 |
Weight of semantic score in final score (0-1) |
HEURISTIC_WEIGHT |
0.4 |
Weight of heuristic score in final score (0-1, must sum 1.0 with SEMANTIC_WEIGHT) |
| Variable | Default | Description |
|---|---|---|
CV_PATH |
CVs_Principales/CV_Dev_Senior_AI_Improvement.md |
Path to resume file |
OUTPUT_DIR |
job_search_agents/results |
Output directory |
DATA_DIR |
job_search_agents/data |
Data directory |
| Variable | Default | Description |
|---|---|---|
LOG_LEVEL |
INFO |
Level: DEBUG, INFO, WARNING, ERROR, CRITICAL |
LOG_FILE |
(optional) | Log file (if not specified, console only) |
| Variable | Default | Description |
|---|---|---|
USE_CACHE |
true |
Enable cache |
CACHE_EXPIRY_HOURS |
24 |
Cache expiration (hours) |
| Variable | Description |
|---|---|
LINKEDIN_API_KEY |
Official LinkedIn API key |
REMOTEOK_API_KEY |
RemoteOK API key (premium) |
The system will execute the following flow:
- ✅ Validate configuration: Verify that all required variables are configured
- 📄 Parse your resume: Extract information from your resume from the configured path
- 🔍 Search for jobs: Query all enabled sources in parallel
- 📧 Extract emails: Use LLMs to find contact emails in job descriptions
- 🎯 Calculate matches: Compare each job with your profile and assign a score
- 📊 Generate report: Create an interactive HTML file with results
Example of the final output showing the HTML report generation, top 5 recommended jobs, search statistics, and total execution time (approximately 1 hour and 4 minutes in this example).
job_search_agents/
├── agents/ # 🤖 Specialized agents
│ ├── orchestrator.py # Main orchestrator (LangGraph)
│ ├── linkedin_agent.py # LinkedIn agent
│ ├── indeed_agent.py # Indeed agent (disabled)
│ ├── remote_jobs_agent.py # Remote jobs agent
│ ├── tech_jobs_agent.py # Tech jobs agent
│ ├── findjobit_agent.py # Findjobit agent (LATAM)
│ ├── email_extractor_agent.py # Email extraction
│ └── matcher_agent.py # Profile matching
│ ├── cv_parser.py # Resume parser
│ ├── html_generator.py # HTML generator
│ ├── user_agent_rotator.py # User-Agent rotation
│ ├── circuit_breaker.py # Circuit breaker pattern
│ ├── adaptive_rate_limiter.py # Adaptive rate limiting
│ ├── url_utils.py # URL utilities
│ ├── http_helpers.py # HTTP helper functions
│ ├── job_enricher.py # Job enrichment utilities
│ ├── exceptions.py # Custom exceptions
│ └── ... # More anti-bot utilities
├── config/ # ⚙️ Configuration
│ ├── settings.py # System configuration
│ ├── validators.py # Pydantic validators
│ ├── config_loader.py # YAML config loader
│ └── job_sources.yaml # Job sources and keywords
├── tools/ # 🛠️ Support tools
│ ├── web_scraper.py # Advanced web scraping
│ ├── base_api_client.py # Base API client class
│ ├── api_clients.py # API clients
│ ├── email_validator.py # Email validation
│ └── http_client_strategy.py # HTTP strategies
├── tests/ # 🧪 Test suite
│ ├── conftest.py # Pytest fixtures
│ ├── test_agents/ # Agent tests
│ ├── test_utils/ # Utility tests
│ └── test_config/ # Configuration tests
├── templates/ # 📄 HTML templates
│ └── results_template.html
├── skills/ # 🎯 Agent Skills (LLM prompts)
│ ├── email-extractor/ # Skill: email extraction
│ ├── query-variator/ # Skill: query variations
│ ├── keyword-generator/ # Skill: adaptive keywords (NEW)
│ └── semantic-matcher/ # Skill: semantic matching (NEW)
│ ├── email-extractor/ # Email extraction skill
│ │ └── SKILL.md
│ ├── job-matcher/ # Job matching skill
│ │ └── SKILL.md
│ ├── query-variator/ # Query variation skill
│ │ └── SKILL.md
│ └── README.md # Skills documentation
├── data/ # 💾 Data
│ ├── profile.json.example # Profile example
│ └── profile.json # Your profile (not uploaded to repo)
├── results/ # 📊 HTML results
├── main.py # 🚀 Entry point
├── env.example # 📋 Environment variables example
├── .env # 🔐 Your variables (not uploaded to repo)
├── .gitignore # Files excluded from repo
└── requirements.txt # 📦 Dependencies
Option 1: Static Keywords (Traditional)
Edit config/job_sources.yaml to change base keywords:
keywords:
- "AI Engineer"
- "LLMOps Engineer"
- "Python Senior Developer"
- "Machine Learning Engineer"
# Add more keywords according to your profileOption 2: Adaptive Keywords with LLM (Recommended - NEW)
If USE_ADAPTIVE_KEYWORDS=true (default), the system will automatically generate optimized keywords for each source and region based on your profile. Keywords in job_sources.yaml are used as a base and the LLM adapts them dynamically.
💡 Advantage: Adaptive keywords improve relevance by 30-50% compared to static keywords.
You can enable/disable sources in config/job_sources.yaml:
job_sources:
linkedin:
enabled: true
max_results: 50
indeed:
enabled: false # Disabled: difficult access
max_results: 50
remoteok:
enabled: false # Disable this sourceControl which jobs are shown in the report:
MIN_MATCH_SCORE=70 # Only jobs with score >= 70If you experience frequent blocks, adjust these variables:
# Increase delays
SCRAPING_DELAY=3.0
MIN_DELAY=2.0
MAX_DELAY=5.0
# Enable all protections
USE_USER_AGENT_ROTATION=true
RANDOM_DELAY_ENABLED=true
ENABLE_BROWSER_STEALTH=true
SIMULATE_HUMAN_BEHAVIOR=true
USE_CIRCUIT_BREAKER=trueFor debugging, change the log level:
LOG_LEVEL=DEBUG
LOG_FILE=debug.log# Custom path for results
OUTPUT_DIR=/custom/path/results
# Custom path for data
DATA_DIR=/custom/path/dataThe system generates the following files:
File generated at results/job_search_results_YYYYMMDD_HHMMSS.html with:
- ✅ Executive summary: General statistics
- 📋 Jobs table: Sorted by match score
- 🔍 Interactive filters: By source, score, keywords
- 📧 Consolidated email list: All emails found
- 📈 Statistics by source: Job distribution
- 🎯 Top recommended jobs: Best matches
data/profile.json with profile extracted from resume (generated automatically).
- Console: Real-time progress
- File (if
LOG_FILEis configured): Complete log for analysis
Cause: Missing required variables configuration.
Solution:
- Make sure you have a
.envfile in thejob_search_agentsdirectory - Copy
env.exampleto.envif it doesn't exist - Configure
USER_EMAILandUSER_PHONEin your.env:USER_EMAIL=your_email@example.com USER_PHONE=+1234567890
Cause: Missing API key for the configured provider.
Solution:
- If
LLM_PROVIDER=anthropic, configureANTHROPIC_API_KEY - If
LLM_PROVIDER=openai, configureOPENAI_API_KEY - Verify that the variable is in your
.envfile
Cause: Missing dependencies.
Solution:
pip install -r requirements.txtCause: Playwright browser not installed.
Solution:
playwright install chromiumCause: The resume file doesn't exist at the configured path.
Solution:
- Verify that the file exists
- Configure
CV_PATHin your.envwith the correct path:CV_PATH=C:\full\path\to\your\resume.md
- Or place your resume at the default path:
CVs_Principales/CV_Dev_Senior_AI_Improvement.md
Note: Indeed is disabled by default due to frequent blocks.
Cause: Too many requests or bot detection.
Solution:
- Increase delays:
SCRAPING_DELAY=5.0 MIN_DELAY=3.0 MAX_DELAY=8.0
- Enable all anti-bot protections
- Consider using official APIs if available
- Reduce
MAX_JOBS_PER_SOURCEto make fewer requests
Cause: Timeout too short or slow connection.
Solution:
SEARCH_TIMEOUT=60 # Increase timeout to 60 secondsCause: Keywords too specific or disabled sources.
Solution:
- Review
config/job_sources.yamland verify that sources are enabled - Adjust keywords to be more general
- Reduce
MIN_MATCH_SCOREto see more results
The project includes a comprehensive test suite using pytest. Run tests to verify functionality:
pytestpytest --cov=. --cov-report=html# Test utilities
pytest tests/test_utils/
# Test agents
pytest tests/test_agents/
# Test configuration
pytest tests/test_config/tests/conftest.py: Shared fixtures and test configurationtests/test_utils/: Tests for utility functions (URL utils, HTTP helpers, job enricher, etc.)tests/test_agents/: Tests for agent functionality (email extractor, matcher, etc.)tests/test_config/: Tests for configuration validation
The project uses an Agent Skills system inspired by the agent-skills standard to organize and manage LLM prompts. This approach separates business logic from LLM instructions, making prompts easier to maintain, version, and experiment with.
Skills are reusable instruction sets stored in SKILL.md files with YAML frontmatter. Each skill contains:
- Metadata: Name, description, version, tags
- System Message: Instructions for the LLM's role
- Human Message Template: User input template with variables
- Documentation: Usage examples, input/output specifications, and best practices
The system includes the following skills:
email-extractor: Extracts contact emails from job descriptions using intelligent LLM analysis. Used byEmailExtractorAgent.query-variator: Generates natural variations of search queries to appear more human-like. Used byQueryVariatorutility.keyword-generator: Generates search keywords dynamically adapted to profile, source, and region. Used byKeywordGeneratorAgent. (NEW)semantic-matcher: Semantically analyzes relevance between jobs and candidate profile. Used bySemanticMatcherAgent. (NEW)
- Separation of Concerns: Business logic is separated from LLM instructions
- Easy Maintenance: Update prompts without modifying Python code
- Better Documentation: Each skill documents its purpose, usage, and examples
- Experimentation: Test different prompt variations easily
- Versioning: Skills can be versioned independently
- Reusability: Skills can be shared between agents
Agents load skills using the SkillLoader utility:
from utils.skill_loader import SkillLoader
# Initialize loader
skill_loader = SkillLoader()
# Load a skill and get a ChatPromptTemplate
prompt_template = skill_loader.load_skill("email-extractor")
# Use with LangChain
chain = prompt_template | llm | output_parser
result = chain.invoke({
"description": job_description,
"format_instructions": parser.get_format_instructions()
})- Create a directory in
skills/with the skill name (e.g.,skills/my-skill/) - Create a
SKILL.mdfile with YAML frontmatter:
---
name: my-skill
description: What this skill does
version: 1.0.0
agent: langgraph
tags:
- tag1
- tag2
---
# My Skill
## System Message
Your system instructions here...
## Human Message Template
Your template with {variables} here...- Use the skill in your agent:
from utils.skill_loader import SkillLoader
skill_loader = SkillLoader()
self.prompt_template = skill_loader.load_skill("my-skill")The SkillLoader class provides several useful methods:
# Load a skill
prompt = skill_loader.load_skill("email-extractor")
# Get skill metadata
metadata = skill_loader.get_skill_metadata("email-extractor")
# List all available skills
skills = skill_loader.list_available_skills()
# Validate a skill
is_valid, error = skill_loader.validate_skill("email-extractor")
# Clear cache
skill_loader.clear_cache()Each skill directory should contain:
SKILL.md: Main skill file with frontmatter and instructions- Optional: Additional documentation, examples, or reference files
- Version your skills: Use semantic versioning in the frontmatter
- Document thoroughly: Include usage examples and variable descriptions
- Test prompts: Validate skills before deploying
- Keep skills focused: One skill should handle one specific task
- Use descriptive names: Skill names should clearly indicate their purpose
Edit templates/results_template.html to customize the report design. The template uses HTML, CSS, and vanilla JavaScript.
- Create a new agent in
agents/(e.g.,new_source_agent.py) - Create API client in
tools/api_clients.pyinheriting fromBaseAPIClientif necessary:from tools.base_api_client import BaseAPIClient class NewSourceClient(BaseAPIClient): def __init__(self): super().__init__(base_url="https://new-source.com/api") def search_jobs(self, keywords: List[str], **kwargs) -> List[Dict]: # Implementation
- Add the agent to the orchestrator in
agents/orchestrator.py - Configure in
config/job_sources.yaml:new_source: enabled: true max_results: 50 base_url: "https://new-source.com"
- The configuration will be automatically validated using Pydantic models
Edit data/profile.json to reflect your exact profile. This file is used for:
- Job matching
- Relevant information extraction
- Score calculation
Instead of editing Python code to change LLM prompts, you can now edit the skill files directly:
- Navigate to
skills/directory - Find the skill you want to modify (e.g.,
email-extractor/SKILL.md) - Edit the
## System Messageor## Human Message Templatesections - Save the file - changes take effect immediately (no code changes needed!)
This makes it much easier to:
- Experiment with different prompt strategies
- Fine-tune LLM behavior without touching business logic
- Version prompt changes independently
- Document prompt improvements
Example: To improve email extraction, edit skills/email-extractor/SKILL.md and modify the system message instructions.
- ⚡ The system respects rate limits and has delays between requests to avoid blocks
- 🔒 LinkedIn has strong anti-scraping protection; may require authentication or official APIs
- 🔄 Some sources may change their HTML structures, requiring code updates
- 🎯 Matching uses keywords and heuristics; for better accuracy, consider using vector embeddings
- 📁 The
.envanddata/profile.jsonfiles are in.gitignoreand are not uploaded to the repository - 🔐 Never share your
.envfile with sensitive information
This is a personal project, but improvements are welcome:
- 🐛 Report bugs: Open an issue with problem details
- 💡 Suggest improvements: Ideas for new features
- 🔧 Scraping improvements: Optimizations and new anti-bot techniques
- ➕ New job sources: Add more job portals
- 🎯 Matching improvements: More accurate algorithms
- ⚡ Performance optimizations: Make the system faster
Personal use.
This project is licensed under the Hippocratic License 3.0, an ethical open source license that allows the use, modification, and distribution of the software with specific restrictions.
- ✅ Commercial and non-commercial use
- ✅ Modification and creation of derivative works
- ✅ Distribution and sublicensing
- ✅ Private use
The license prohibits the use of the software for:
- ❌ Aggressive military purposes or human rights violations
- ❌ Criminal activities or illegal acts
- ❌ Violations of fundamental rights (slavery, torture, discrimination, etc.)
- ❌ Illegal environmental damage
- ❌ Any use that violates the ethical standards defined in the Universal Declaration of Human Rights
For more details, see the LICENSE file or visit firstdonoharm.dev.
Note: This license is designed to promote ethical use of software while maintaining open source freedom for legitimate purposes.
Developed with ❤️ using LangGraph and LangChain for intelligent job search 🚀
Need help? Check the Troubleshooting section or consult the env.example file to see all available configuration options.
