This project discovers companies that use jobs.ashbyhq.com for their careers/job portals.
It finds and verifies company names (slugs) for Ashby-hosted job boards:
https://jobs.ashbyhq.com/{slug}
This project is built for real discovery + verification, not slug guessing.
- Finds candidates from multiple sources:
- Search results that mention
jobs.ashbyhq.com - Careers pages that embed Ashby (
window.Ashby,__ashbyBaseJobBoardUrl,/embed?version=2)
- Search results that mention
- Exports verified records by default.
- Supports resume with SQLite state and dedup.
- Includes retries, timeouts, rate limiting, and verbose logs.
- Search-provider backoff:
- Starts at
20s - Doubles up to
300s - Disables provider for the run after
10blocked events
- Starts at
- Python 3
httpx(async HTTP)beautifulsoup4(HTML parsing)- SQLite (
sqlite3) for state and cache tqdm(progress bars)pytest+pytest-asyncio(tests)
ashby_discovery/
__main__.py
cli.py
config.py
pipeline.py
discovery.py
extractors.py
verification.py
enrichment.py
http_client.py
search_providers.py
logging_utils.py
storage.py
output.py
models.py
utils.py
docs/
ARCHITECTURE.md
CLI_REFERENCE.md
SQLITE_SCHEMA.md
OPERATIONS.md
diagrams/
high-level-system-diagram.png
in-depth-system-diagram.png
examples/
example_verified_ashby_slugs.csv
search_queries.txt
seed_list.txt
requirements.txt
tests/
Path: docs/diagrams/high-level-system-diagram.png
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtRun:
python -m ashby_discovery --max-results 500 --output-dir outputVerbose run:
python -m ashby_discovery --max-results 500 --output-dir output --verboseResume run:
python -m ashby_discovery --max-results 500 --output-dir output --resumeSeed-only mode (skip search):
python -m ashby_discovery \
--input-company-list seed_list.txt \
--output-dir output \
--resume \
--search-pages-per-query 0Custom search query file:
python -m ashby_discovery \
--output-dir output \
--search-queries-file search_queries.txtsearch_queries.txt: default query list (#comments supported)--input-company-list: optional domains/URLs (one per line)- Supported engines:
duckduckgo,brave,yahoo
Inside --output-dir:
verified_ashby_slugs.csvverified_ashby_slugs.jsonfailures.jsonldiscovery_state.sqlite(or timestamped DB for non-resume runs)
CSV/JSON fields:
sluginferred_company_nameashby_urlsource_typesource_urlverification_statusnotes
With --resume, summary counts are cumulative and include this-run deltas:
Rows exported (cumulative): X (this run: Y)Failures recorded (cumulative): X (this run: Y)
--max-results affects more than export:
- Final CSV/JSON rows are capped at
max_results. - Search URL collection budget is
max_results * 4. - Verification selection budget is
max_results * 4(verify_limit_multiplier=4).
--request-delayis global for all fetches (search, source scan, verification).- Verification is heuristic:
- only HTTP
200pages can beVERIFIED - score threshold is
positive >= 3andnegative <= 2
- only HTTP
- If the query file is missing or empty, discovery uses in-code fallback queries.
- Domain seeds are expanded to fixed paths:
/careers,/jobs,/company/careers,/about/careers
- Retry wait and cache TTL are fixed:
- retry backoff:
1.5 * attempt - cache TTL:
24h
- retry backoff:
- Architecture and flow: docs/ARCHITECTURE.md
- CLI flags and defaults: docs/CLI_REFERENCE.md
- SQLite schema and table purpose: docs/SQLITE_SCHEMA.md
- Logs, progress bars, troubleshooting: docs/OPERATIONS.md
Run tests:
pytest -qQuick seed-only smoke check:
python -m ashby_discovery \
--input-company-list seed_list.txt \
--output-dir output \
--search-pages-per-query 0